Building with AI: What Thirteen Years of Software Engineering Taught Us About the Next Thirteen

/images/blog-generated/building-with-ai-lessons-from-the-trenches.webp

CONFLICT was founded in 2012. The first iPhone with a Retina display had been out for two years. Kubernetes did not exist. Neither did Docker. TensorFlow was three years away from its public release, and the idea that a neural network could write coherent prose would have gotten you laughed out of most engineering conversations.

We have built software across every major technology wave since then: mobile, cloud, DevOps, microservices, serverless, and now AI. Each wave changed what was possible. None of them changed what was hard. The hard parts of building software – understanding the problem, designing systems that handle real-world complexity, coordinating human effort, maintaining quality under pressure – have stayed stubbornly constant.

This post is the story of how we got from there to here. Not a theoretical framework or an industry analysis, but a retrospective on what we built, what broke, what we learned, and what thirteen years of making mistakes with real client projects taught us about building in the AI era. The lessons are specific, sometimes embarrassing, and – we hope – useful.

The Early Days: ML Before It Was Cool (and Before It Was Easy)

Our first encounters with machine learning were not glamorous. Around 2014, a client needed a recommendation engine for their eCommerce platform. Not the sophisticated collaborative filtering you see today. Simple content-based recommendations: if a customer bought this, suggest things like this.

The technology was available. Scikit-learn existed. Papers on collaborative filtering were plentiful. What was not available was any kind of reasonable deployment path. We built the model in Python, serialized it, and wrote a custom service to load and serve predictions. There was no MLflow, no SageMaker, no model serving framework. You wrote everything yourself, including the parts that had nothing to do with the actual ML.

The recommendation engine worked. Sort of. It improved click-through rates by a modest but measurable amount. But the engineering effort to build, deploy, and maintain it was wildly disproportionate to the business value. We spent weeks on infrastructure that today would be a few configuration files.

Around the same time, we did our first NLP project: text classification for a client that needed to categorize incoming support tickets by topic and urgency. We used bag-of-words features with a logistic regression classifier. It was not sophisticated by any standard, even at the time. But it worked well enough to route 60% of tickets correctly without human triage, which saved the client’s support team meaningful hours per week.

The lesson from those early projects was simple and durable: the ML is usually the easy part. The hard part is everything else – data pipelines, deployment infrastructure, monitoring, retraining, handling edge cases, and convincing stakeholders that a model that is right 85% of the time is better than a process that is right 70% of the time but feels more controllable.

SafeSocial: When ML Had Real Consequences (2018-2022)

The project that changed how we thought about machine learning was SafeSocial.

Between 2018 and early 2022, we built ML models to detect toxicity, profanity, and violence in social media content. This was not eCommerce recommendations or ticket routing. This was content moderation at scale – the kind of work where a false negative means a child sees content they should not see, and a false positive means legitimate speech gets silenced.

We paired our custom models with Google’s Perspective API, the toxicity scoring system that came out of Jigsaw (formerly Google Ideas). Perspective API gave us a strong baseline for text-based toxicity detection. It scored text on attributes like toxicity, severe toxicity, identity attack, insult, profanity, and threat. But Perspective alone was not sufficient for what SafeSocial needed. Social media content is not just text. It is images, videos, slang that evolves weekly, context-dependent language where the same word is harmless in one conversation and harmful in another, and adversarial users who deliberately misspell words or use substitution characters to evade detection.

We built supplementary models that handled what Perspective could not. Image classification for violent or explicit visual content. Custom text classifiers trained on platform-specific language patterns. Ensemble models that combined Perspective’s toxicity scores with our own classifiers to improve precision without destroying recall.

The technical challenges were significant. Training data for toxicity detection is inherently unpleasant to work with. Annotation is subjective – reasonable people disagree about what constitutes toxicity, especially across cultural contexts. Model drift was constant because the language of online harassment evolves specifically to evade detection systems. We retrained models on a weekly cadence, which in 2018 felt aggressive but was barely adequate.

But the hardest part was not technical. It was the responsibility. Every threshold we set, every confidence score we chose as a cutoff, had direct consequences for real people. Set the threshold too low, and harmful content gets through. Set it too high, and you over-moderate, disproportionately silencing marginalized communities whose language patterns more frequently trigger false positives – a well-documented problem in content moderation systems.

SafeSocial taught us three things that shaped every AI project we built afterward:

ML in production is an ongoing commitment, not a deliverable. You do not build a toxicity model and ship it. You build a toxicity model, ship it, monitor it, retrain it, adapt it to new patterns, and continue doing that for as long as the system is live. The models we built in month one were meaningfully degraded by month six without continuous maintenance.

Hybrid approaches outperform pure ML. The combination of Perspective API’s general-purpose toxicity scoring with our custom models consistently outperformed either approach alone. General models provide breadth. Custom models provide specificity. This principle – combining general-purpose AI capabilities with domain-specific engineering – became a core tenet of how we approach every AI system.

Consequences demand guardrails. When your model’s output directly affects people’s safety and speech, you need robust guardrails: human review pipelines, appeals processes, confidence thresholds that favor caution, monitoring for bias in model outputs, and clear escalation paths for edge cases. This experience is why we talk so much about guardrails now in the context of agentic AI. We learned the lesson early, in a domain where the stakes were impossible to ignore.

GPT-3 and the API-Accessible AI Moment

June 2020. OpenAI released GPT-3 with API access. This was the inflection point that changed the trajectory of our AI work.

Before GPT-3, using AI in a client project meant training a model. You needed data, expertise, infrastructure, and time. The barrier to entry was high enough that AI was reserved for projects where the business case justified months of ML engineering.

GPT-3 collapsed that barrier. Suddenly, you could send text to an API and get back coherent, contextually relevant responses. No training data. No model infrastructure. No ML expertise required for basic tasks. The cost of experimenting with AI dropped from months and tens of thousands of dollars to hours and a few dollars in API credits.

We started small. Internal prototyping. Can we use this for generating first drafts of technical documentation? Can it summarize meeting notes? Can it extract structured data from unstructured client communications? The answer was yes to all of these, with caveats that we would spend the next several years learning to manage.

The first client project where we used GPT-3 in production was a content processing pipeline. The client had thousands of product descriptions that needed to be standardized, expanded, and optimized for search. Previously, this was a manual process done by freelance copywriters at significant cost and with inconsistent quality.

We built a pipeline that used GPT-3 to generate initial drafts, applied rule-based validation to catch common errors, and routed the output for human review. The pipeline reduced the cost per product description by roughly 70% and improved throughput from about 50 descriptions per day to over 500.

It also produced some spectacular failures. Product descriptions that were factually wrong. Tone inconsistencies that slipped past automated checks. One memorable incident where the model generated enthusiastic copy for a product that had been discontinued, because the prompt did not include inventory status as context. That last one taught us more about context engineering than any research paper could have.

The Prompt Engineering Phase (and Why We Moved Past It)

Between 2021 and early 2023, our AI work was dominated by what the industry called prompt engineering. We were good at it. We developed systematic approaches to prompt design: structured templates, few-shot example libraries, chain-of-thought patterns for complex reasoning tasks, output format specifications. We built internal tools for prompt versioning, A/B testing different prompt variations, and tracking performance metrics across prompt changes.

This work produced real results. A client’s customer service automation improved its resolution rate from 34% to 58% through prompt optimization alone. A document analysis pipeline went from 72% accuracy to 87% by restructuring the prompt to provide better examples and clearer output constraints.

But we also hit the ceiling. And we hit it repeatedly, in ways that forced us to rethink the entire approach.

The ceiling was this: prompt engineering treats the prompt as the primary lever for controlling AI behavior. As long as the task is simple enough that all the relevant information fits in a well-crafted prompt, this works. The moment the task requires external context – codebase knowledge, database records, document retrieval, multi-step reasoning across information sources – the prompt is not the bottleneck. The context is.

We have written extensively about the shift from prompt engineering to context engineering. The transition was not a sudden realization. It was a gradual accumulation of evidence from project after project where the prompt was fine but the output was wrong because the model did not have the information it needed.

The turning point was a project in late 2022 where we were building an internal knowledge assistant for a client. We spent three weeks optimizing prompts. The accuracy plateaued at 71%. Then we spent one week improving the retrieval system – better chunking, better embedding models, better relevance filtering. Accuracy jumped to 89%. Same prompt. Better context.

That project changed how we thought about AI system design permanently.

Building CalliopeAI: Solving Our Own Problem

By early 2024, we had a problem that was becoming increasingly expensive to ignore. Every AI project we built required its own orchestration logic. Each project had its own prompt management, its own model provider integration, its own evaluation framework, and its own approach to multi-model routing. We were rebuilding the same infrastructure project after project, and each rebuild was slightly different in ways that made it hard to transfer learnings.

We built CalliopeAI to solve this. We have written about its origins and its technical architecture in detail. The short version: CalliopeAI is the AI workbench that we wished existed when we started building multi-model applications. It provides unified orchestration across model providers, prompt management with versioning and A/B testing, multi-model routing based on task type and quality requirements, and an evaluation engine that continuously measures model performance.

Building CalliopeAI taught us several things that shaped our subsequent client work:

Multi-model is not optional. We started with the assumption that we would default to one provider and use others as fallbacks. Within the first month of production use, we discovered that routing different task types to different models produced consistently better results at lower cost than using any single model for everything. Analysis tasks performed better on Claude. Structured output tasks performed better on GPT-4. Classification tasks performed equally well on cost-efficient models at a fraction of the price.

Evaluation is the foundation. Without systematic evaluation, multi-model routing is guesswork. You think Claude is better for analysis because a few examples looked good. But is it consistently better? Better by how much? Better on what types of analysis? The evaluation engine answered these questions with data instead of intuition, and the data often contradicted our intuitions.

Abstraction layers pay for themselves immediately. The first time a model provider changed their API, the first time a new model was released that outperformed our default, the first time a provider had a multi-hour outage – each of these events justified the entire investment in the abstraction layer. Organizations that call provider APIs directly from application code pay for that shortcut every time the landscape shifts.

The Amplification Effect

The most important lesson we have learned about AI in software engineering is also the most uncomfortable: AI amplifies whatever you already have.

Good engineering practices get amplified into great outcomes. Disciplined teams with clear specifications, strong testing habits, well-organized codebases, and rigorous review processes get dramatically more value from AI tools. The tools accelerate their existing strengths.

Sloppy practices get amplified into disasters. Teams with vague requirements, minimal testing, messy codebases, and perfunctory reviews produce bad code faster. The speed of AI generation means they create more technical debt in less time, with less human review per line of generated code.

This is not a hypothetical observation. We have seen both sides, sometimes on different projects within the same quarter.

Harvard Business School’s research on AI adoption in organizations supports this pattern broadly. Their work has found that AI tends to amplify existing organizational capabilities rather than transforming weak organizations into strong ones. Organizations that were already high-performing saw the largest gains from AI adoption, while organizations with underlying operational problems often found that AI made those problems more visible and more costly.

The implication is clear: if your engineering fundamentals are weak, fix them before you invest heavily in AI tooling. AI tools on top of a weak foundation produce impressive demos and unreliable production systems.

Specifically, the foundations that matter most for AI-amplified engineering:

Specification discipline. AI tools need clear, precise, testable specifications. If your current process produces vague user stories that rely on developer intuition to fill gaps, AI agents will generate code that is confidently wrong about the gaps.
Test coverage. When code is generated by agents, tests are the primary quality gate. If your test coverage is thin, you have no reliable way to verify that generated code works correctly.
Codebase organization. Agentic coding tools work better with well-organized codebases. Clear module boundaries, consistent naming conventions, and logical file structures help agents understand the system context.
Review rigor. Code review becomes more important, not less, when AI generates code. The review is no longer checking whether a human made a typo. It is evaluating whether the agent’s implementation correctly satisfies the specification, handles edge cases, and integrates cleanly with the existing system.

From Using AI Tools to AI-Native Delivery

The progression we went through over the last several years was not just about adopting better tools. It was about fundamentally restructuring how we deliver software.

Phase 1 was using AI tools. We added Copilot to our editors. We used ChatGPT for research and exploration. The tools sat alongside our existing process but did not change it.

Phase 2 was integrating AI into our process. We built AI steps into our pipelines. Automated code review with AI feedback. AI-assisted test generation. AI-powered documentation. The process started to change, but the human was still the primary implementer.

Phase 3 – where we operate now – is AI-native delivery. The engineering process is designed around AI capabilities from the start. Specifications are written in formats optimized for agent consumption. Agents are the primary implementers for well-defined tasks. Humans focus on architecture, specification, review, and the tasks that require judgment, creativity, or domain expertise that agents lack. Our HiVE methodology, High-Velocity Engineering, encodes this operating model.

The difference between Phase 2 and Phase 3 is not just the volume of AI usage. It is who is doing what. In Phase 2, the human writes code and the AI helps. In Phase 3, the AI writes code and the human directs, reviews, and decides. The human’s role shifts from implementer to architect, specifier, and evaluator.

This shift is uncomfortable for engineers who derive professional identity from writing code. We went through that discomfort ourselves. The resolution came from recognizing that the value was never in the typing. The value was always in the thinking – understanding the problem, designing the solution, evaluating the tradeoffs, ensuring quality. Those activities are more prominent in an AI-native workflow, not less.

Specific Mistakes We Made

Retrospectives are only valuable if they include the failures. Here are specific mistakes we made during this transition:

Mistake 1: Trusting AI output without sufficient verification. Early in our adoption, we shipped agent-generated code with insufficient review. On one project, an agent-generated database migration contained a subtle error that did not surface in our test environment but caused data corruption in production. The migration was technically correct SQL, but it applied transformations in the wrong order for the production data set. We caught and fixed it within hours, but it was a wake-up call about the difference between “compiles and passes tests” and “is correct.”

Mistake 2: Underinvesting in specifications. For the first several months of agentic engineering, we tried to use our existing user stories as agent input. The results were inconsistent. We knew that specifications needed to be more precise for agents, but the organizational habit of writing user stories was strong, and the upfront cost of writing detailed specifications felt like it slowed us down. It took three projects with poor first-pass success rates before we committed fully to spec-driven development. Once we did, first-pass success rates went from roughly 40% to over 85%.

Mistake 3: Treating all tasks as equally suitable for AI. Not every task benefits from agent implementation. Tasks that require deep domain expertise, nuanced stakeholder communication, or creative problem-solving in ambiguous situations are still better handled by experienced humans. We wasted time on prompting and re-prompting agents for tasks that an experienced engineer would have completed faster and better. Learning to classify tasks by suitability for agent implementation was a skill that took months to develop.

Mistake 4: Ignoring the organizational change. Adopting AI-native development is not a tooling change. It is an organizational change. Roles shift. Skills become more or less valuable. Workflows change. We underestimated the adjustment period required for our team and for our clients’ teams. Some engineers thrived immediately. Others needed time and coaching to find their footing in the new model.

Mistake 5: Over-indexing on speed. The speed of AI-assisted development is seductive. We could generate a first draft of an entire system in a day. But first drafts are not finished systems. Several times, we showed clients an AI-generated prototype and set expectations for delivery timelines based on the speed of generation rather than the total effort including review, testing, hardening, and integration. The generation was fast. Everything else took the same amount of time it always did.

What the Next Thirteen Years Look Like

Prediction is a fool’s errand in technology, but patterns are observable.

The pattern we see is that AI will continue to compress the implementation phase of software development. The time from “we know what to build” to “the code exists” will keep shrinking. This means that the phases that bookend implementation – understanding the problem, designing the solution, specifying the requirements, and verifying the result – will become proportionally more important and more valuable.

The engineers who thrive will be the ones who are excellent at these bookend activities. Understanding business domains deeply. Designing systems that handle real-world complexity. Writing specifications that are precise enough for agents to execute correctly. Evaluating output with the judgment that comes from experience.

The engineering organizations that thrive will be the ones that treated AI adoption as an organizational transformation, not just a tool purchase. They invested in their foundations. They adapted their processes. They developed the discipline of specification-driven development. They built the evaluation infrastructure to know whether their AI-assisted output actually works.

At CONFLICT, we will keep doing what we have always done: build software that solves real problems for real organizations, adapt our methods to use the best available tools, and maintain the engineering discipline that makes speed sustainable rather than reckless. The tools have changed dramatically since 2012. The principles have not changed at all.

The next thirteen years will bring capabilities we cannot imagine today, just as GPT-3’s capabilities were unimaginable in 2012. We will adopt them, make mistakes with them, learn from those mistakes, and keep building. That is what thirteen years of practice teaches you: the technology changes, the learning never stops, and the fundamentals always matter more than the tools.

posted by admin

Mar 07, 2026 - 17