/images/blog-generated/we-replaced-our-own-workflows-first.webp

We Replaced Our Own Workflows Before We Replaced Yours

There is a credibility problem in AI consulting. Companies that have never used AI agents for their own engineering are advising clients on how to adopt them. Agencies that still run two-week sprints with fifteen-person teams are selling “AI transformation” engagements. Firms whose internal tooling consists of Jira, Confluence, and a shared Google Drive are promising to build AI-native platforms for enterprises.

This is not a criticism of AI expertise. It is a criticism of unearned authority. If your own workflows have not been rebuilt around AI, you do not understand the problems that arise when you do, and you cannot credibly guide someone else through the transition.

At CONFLICT, every tool and methodology we recommend to clients, we built for ourselves first. Not as a side project. Not as a demo. As production infrastructure that runs our business every day. We replaced our own workflows before we offered to replace anyone else’s, and that decision is the reason our tools work in the real world and not just in pitch decks.

The Principle Behind the Practice

The concept of dogfooding – using your own products internally before releasing them to customers – has been a principle in software development since at least the 1980s. The term is most commonly attributed to Microsoft, where Paul Maritz is reported to have sent a memo titled “Eating our own Dogfood” urging the company to increase internal usage of its own products. The reasoning was simple: if Microsoft’s employees would not use Microsoft’s software, why should anyone else?

The principle has proven durable because it addresses a real problem. Products built in isolation from their use context tend to optimize for the wrong things. They look good in demos. They check the feature boxes. But they fail in the workflows where actual work happens, because the builders never experienced those workflows themselves.

For a consultancy, dogfooding is more than a product development practice. It is a credibility mechanism. When we tell a client that our discovery process produces better context for AI agents, we can demonstrate it because we run that process on our own projects. When we say that spec-driven development produces higher first-pass accuracy, we can show the data from our own engagements. When we recommend a tool, we can describe the problems we hit using it and the workarounds we developed, because we hit those problems first.

This is the difference between theoretical knowledge and operational knowledge. Theoretical knowledge says “agents need structured context to produce good output.” Operational knowledge says “agents need structured context, and here is the specific schema we use, the failure mode we discovered when the schema was too flat, the iteration we did to add cross-references, and the quality improvement we measured after each change.” Clients benefit from the second kind, and you only get the second kind by doing the work yourself.

The Internal Stack

Here is what our internal tooling actually looks like – the systems we built to run our own engineering practice, each of which emerged from a specific problem we could not solve with vendor products.

CalliopeAI: Multi-Model Orchestration

The problem: we were building AI features for clients across multiple model providers – OpenAI, Anthropic, Google, and others – and every project required custom integration code for each provider. Switching between models meant rewriting prompts, response parsing, and error handling. Evaluating which model performed best for a given task required building one-off benchmarks for every engagement. The overhead was real and it was growing.

The solution: CalliopeAI is our AI workbench. It provides a unified interface for working with any model provider, with three capabilities that no vendor product combined in a way that worked for our needs.

Multi-model orchestration. A single API call can route to any supported model. The routing can be static (always use Claude for analysis tasks) or dynamic (route to the model that scores highest on this task type based on historical evaluation data). When a provider goes down, traffic automatically reroutes. When a new model launches, it can be added to the rotation without changing application code.

Prompt versioning. Every prompt is versioned, with metadata about which models it was written for, what evaluations it has passed, and how it performs relative to previous versions. This solves the problem we kept hitting where a prompt that worked well on GPT-4 produced inconsistent results on Claude, or where a prompt improvement for one task regressed performance on another. Version control for prompts is as important as version control for code, and we did not find a vendor product that handled it with the rigor we needed.

Evaluation. Built-in evaluation pipelines that measure model and prompt performance against defined criteria. When we change a prompt, the evaluation pipeline runs automatically and tells us whether the change improved, degraded, or had no effect on output quality. This replaced the ad-hoc “run it a few times and see if it looks right” approach that most teams use, which is fine for experimentation but unacceptable for production systems.

We use CalliopeAI on every client engagement that involves AI capabilities, which at this point is most of them. We also use it internally for our own AI-powered tools. The system has processed millions of requests across multiple providers, and every failure mode, rate limit edge case, and provider quirk we have encountered is handled in production-tested code.

PlanOpticon: From Meetings to Knowledge

The problem: we record every client discovery session, every internal architecture review, every planning meeting. We have always done this. What we did not do was extract value from the recordings. They sat in Google Drive, and the knowledge in them was accessible only to the people who attended the meeting. If someone needed context from a session they missed, they had to ask a colleague or – in theory – rewatch the recording. Nobody rewatches recordings.

The solution: PlanOpticon processes video recordings and extracts structured knowledge. It produces full transcripts with speaker identification, extracts key decisions and action items, identifies diagrams and visual content, and builds knowledge graphs that map relationships between concepts, people, and decisions across sessions.

What started as a tool for processing client meetings evolved into something more significant: a planning agent. PlanOpticon does not just extract knowledge – it uses that knowledge to generate project plans grounded in real domain context. When you process ten discovery sessions through PlanOpticon, the knowledge graph contains the domain model, the stakeholder priorities, the technical constraints, the organizational dynamics, and the decision history. Feed that knowledge graph to the planning agent and it produces a project plan that reflects the actual complexity of the project, not a template based on estimated story points.

We open-sourced PlanOpticon on PyPI and GitHub, which we wrote about in our piece on why we open source our best work. The strategic reasoning applies: the code demonstrates our capability in a way that no pitch deck can. But the operational benefit is equally important. Running PlanOpticon on our own meetings means that every internal project benefits from the same knowledge extraction and planning capabilities we offer to clients. Our internal planning is better because of the tool, and the tool is better because we use it for internal planning.

Boilerworks: Production-Ready From Day One

The problem: every new project started with 40 to 60 hours of scaffolding. CI/CD pipelines, database migrations, authentication, logging, health checks, monitoring, testing infrastructure, Dockerfiles, Kubernetes manifests, secrets management, deployment scripts. We were solving the same problems on every engagement, slightly differently each time, and the slight differences made it impossible to simply copy the previous project’s infrastructure.

The solution: Boilerworks is an opinionated project scaffolding platform. Describe what you are building – web API, background worker, data pipeline, frontend application – and Boilerworks generates a complete, production-ready project foundation with infrastructure configuration, CI/CD, application scaffolding, testing, observability, and security baseline.

The key word is “opinionated.” Boilerworks does not ask whether you want structured logging. It sets up structured logging. It does not ask whether you want integration tests. It configures the integration test framework. These opinions are not arbitrary. They are distilled from thirteen years of building production systems and observing which shortcuts during scaffolding produce problems six months later.

We use Boilerworks to start every internal project and every client engagement. This means that every project we ship has the same production-grade foundation: the same logging patterns, the same monitoring setup, the same security baseline, the same deployment pipeline. The consistency reduces cognitive load when engineers move between projects and ensures that no project ships without the observability and operational infrastructure that production systems require.

The internal usage also drives improvement. Every time Boilerworks generates a foundation that needs modification for a specific project type, we feed that modification back into the platform. The scaffolding gets more comprehensive and more accurate with every project, because we are the primary users and we feel the friction of every gap.

HiVE: The Methodology That Connects Everything

The tools above are components. HiVE – High-Velocity Engineering – is the methodology that ties them together into a coherent delivery system.

HiVE is spec-driven, agent-executed, and human-reviewed. The workflow moves through defined stages:

Discovery and context building. Deep stakeholder sessions, processed through PlanOpticon, producing a federated knowledge system that serves as the foundation for everything downstream.
Specification. Senior engineers write formal specifications – functional requirements, non-functional requirements, interface contracts, validation criteria, domain context – drawing from the knowledge system built in discovery.
Project scaffolding. Boilerworks generates the production-ready project foundation, configured for the specific technology stack, cloud provider, and operational requirements of the engagement.
Agent execution. AI agents implement the specifications, using CalliopeAI for any AI capabilities and operating within the guardrails defined by the specification.
Quality gates. Automated testing, security scanning, performance validation, and specification compliance checks enforce quality structurally. Human review focuses on architectural decisions, business logic correctness, and domain alignment – the things that require judgment, not just verification.
Deployment and monitoring. The infrastructure generated by Boilerworks includes deployment pipelines and monitoring from day one. There is no “hardening phase” because the foundation was production-grade from the start.

We did not design HiVE in a conference room and then impose it on projects. We evolved it through dozens of engagements, observing what worked, what failed, and what needed adjustment. The methodology is a product of experience, not theory. It works because it has been tested against the full range of project types, client organizations, and technical challenges that a thirteen-year consultancy encounters.

How We Use Agentic Coding Internally

Beyond our custom tools, we use agentic coding tools – Claude Code, primarily – for our own development. This is not a productivity hack bolted onto our existing process. It is integrated into the HiVE workflow at every level.

Our engineers write specifications for internal features the same way they write specifications for client features. Those specifications go to agents for implementation. The output is reviewed against the specification and the quality gates. The process is identical whether we are building a feature for a client platform or a feature for CalliopeAI.

This means that every improvement we make to our specification process, our quality gates, or our agent orchestration benefits both our internal tools and our client delivery. The feedback loop is tight. When we discover that a certain specification format produces better agent output on a client project, we adopt that format for internal development. When we discover a failure mode in agent-generated code on an internal project, we add a quality gate that catches it on client projects too.

The volume of this internal usage matters. We are not running agents on a couple of side projects and extrapolating. We are running agents on every project, every day, across every part of our business. We know what agents are good at (implementing well-specified features, writing tests, generating boilerplate, refactoring code that follows clear patterns). We know what they are bad at (making architectural decisions without sufficient context, handling ambiguous requirements, reasoning about complex system interactions that span multiple services). We know these things from thousands of hours of operational experience, not from reading research papers.

When a client asks us whether agents can handle a particular type of work, our answer comes from direct experience, not theoretical assessment. That is the value of dogfooding at scale.

Why Vendor Products Were Not Enough

A reasonable question: why build custom tools instead of buying vendor products? The AI tooling market has exploded. There are products for prompt management, model evaluation, project scaffolding, meeting transcription, and every other capability we built internally. Why not just buy the best-in-class vendor for each category?

We tried. We evaluated vendor products for every capability we ended up building. Here is what we found:

Vendor products optimize for the general case. A prompt management tool built for the market needs to serve solo developers, startups, and enterprises. The result is a product that is broad but shallow – it handles the 80% case well and the 20% case poorly. Our needs are consistently in the 20%. We need prompt versioning that integrates with our evaluation pipeline, our model routing logic, and our specification workflow. No vendor product provided that integration because no vendor product was designed for our specific workflow.

Vendor products create dependencies. Every SaaS product in your stack is a dependency. The vendor can change pricing, deprecate features, suffer outages, or get acquired. When your delivery methodology depends on a vendor product, you have outsourced a critical capability to a company whose incentives may not align with yours. Our tools are owned by us. We control the roadmap, the uptime, and the feature set.

Vendor products do not compose well. We needed meeting transcription that feeds into knowledge graph extraction that feeds into project planning that feeds into specification generation that feeds into agent-driven development. No vendor ecosystem provides that end-to-end pipeline. Building with vendors means building integration glue between five or six products, each with its own API conventions, authentication model, and update cadence. The integration glue becomes its own maintenance burden.

Vendor products do not encode operational knowledge. When we build a tool, we encode our operational knowledge into it. PlanOpticon does not just transcribe meetings – it extracts the specific types of knowledge that matter for software project discovery because we built it to serve that purpose. Boilerworks does not just scaffold projects – it scaffolds them with the specific patterns and configurations that thirteen years of production deployments have taught us are essential. This operational knowledge is our competitive advantage, and it lives in the tools.

This does not mean we use zero vendor products. We use cloud infrastructure from AWS. We use model APIs from OpenAI, Anthropic, and Google. We use GitHub for version control. We buy commodity infrastructure and build differentiated capability. The dividing line is clear: if the capability is a commodity, buy it. If the capability is a differentiator, build it.

What This Means for Trust

When a consultancy recommends a tool or methodology they do not use themselves, the recommendation is theoretical. It may be well-researched. It may be based on industry best practices. But it has not been tested against the consultancy’s own operational reality.

When we recommend HiVE, we are recommending a methodology we have run on our own projects for years. When we recommend spec-driven development, we are recommending a process that governs how we build our own tools. When we integrate CalliopeAI into a client’s platform, we are deploying a system that handles our own AI workloads. When we scaffold a project with Boilerworks, we are using the same foundation that powers our internal systems.

This is not a marketing claim. It is a verifiable structural fact. Our open-source projects – PlanOpticon on GitHub and PyPI – are available for anyone to inspect. The architecture, the code quality, the testing patterns, the documentation – it is all public. A prospective client can evaluate our engineering standards by looking at the software we use ourselves.

The trust equation in consulting has always been: “Can you do what you say you can do?” Open source and dogfooding together provide the most credible answer possible to that question. We do not just say we can build production-grade AI systems. We publish our own and let people evaluate them on their merits.

The Compounding Advantage of Internal Usage

There is a compounding effect to building and using your own tools that vendor products cannot replicate.

Every project we run generates feedback that improves our tools. A discovery session that reveals a gap in PlanOpticon’s knowledge extraction gets fed into the next PlanOpticon release. A client engagement where Boilerworks’ scaffolding did not include a needed integration type results in that integration type being added for every future project. An evaluation pipeline in CalliopeAI that catches a failure mode on one engagement catches it on every subsequent engagement.

This flywheel has been running for years. The tools get better with every project. The projects get better because the tools are better. The methodology evolves based on what the tools make possible. New tool capabilities enable new methodological improvements, which create new tool requirements. The cycle is self-reinforcing.

A consultancy using vendor products cannot drive this cycle. They can request features from vendors. They can file bug reports. They can switch to different vendors when their needs diverge from the product roadmap. But they cannot encode their operational learning directly into the tools that power their delivery, and that encoding is where the compounding advantage lives.

The Harder Path That Pays Off

Building internal tools is not the easy path. It requires engineering investment that does not directly produce client revenue. It requires maintaining multiple production systems in addition to client deliverables. It requires the discipline to actually use your own tools even when a quick workaround would be faster for a specific project.

We have been on this path for years, long before AI made it trendy to talk about internal tooling. We built our own infrastructure because we were frustrated with the alternatives. We built our own delivery methodology because the industry-standard approaches were not producing the results we wanted. We rebuilt our workflows around AI because we saw the potential to deliver better outcomes for clients, and we wanted to understand the transition deeply enough to guide others through it.

The payoff is a consultancy that practices what it preaches. Not because it is virtuous, but because it is the only way to build the operational knowledge that makes the preaching worth listening to. We replaced our own workflows first. Then we proved the replacements work by running our business on them for years. Then – and only then – we started offering to replace yours.

That sequence matters. It is the difference between a consultancy that sells ideas and one that sells proven systems. We have always preferred to be the latter.

posted by admin

Apr 15, 2026 - 15