/images/blog-generated/dora-metrics-cannot-measure-what-agents-do.webp

DORA Metrics Cannot Measure What Agents Do

The DORA metrics – deployment frequency, lead time for changes, mean time to recover, change failure rate – were the most useful framework the industry produced for measuring software delivery in the last decade. They captured something real: high-performing teams ship more often, with shorter cycles, recover faster from incidents, and break fewer things when they ship. The four numbers correlated with each other and with the kinds of organizational health that engineering leaders had previously struggled to measure.

We adopted them. Most serious engineering orgs did. They have been good for the industry.

They are also, increasingly, the wrong instrument for measuring what happens in an org where agents do most of the implementation work. We have been watching DORA dashboards from agent-heavy teams paint pictures that are confusing at best and actively misleading at worst. The metrics are doing what they were designed to do. The thing they were designed to measure has changed shape underneath them.

What DORA Actually Measured

DORA’s four metrics were proxies for a deeper claim: software delivery quality is determined by how well a team’s process is functioning. Frequent deploys mean the team has confidence in its pipeline. Short lead times mean the team has cut wait states out of its workflow. Low mean time to recover means the team can detect and respond to production issues efficiently. Low change failure rate means the team’s pre-production discipline – review, testing, validation – is doing its job.

These proxies were tight because the underlying activity was human, slow, and serial. Each deploy represented hours of human work. Each commit was the product of human authorship. Each test run was scaffolding around human-written code. The metrics measured the throughput and quality of human-mediated work, which was the dominant cost of software delivery.

The proxies were also tight because the work was relatively uniform across the team. A senior engineer’s commit and a mid-level engineer’s commit, while different in quality, were broadly the same kind of artifact: a focused, scoped change written by a human who carried the implementation in their head. The metrics averaged across this uniform substrate and produced useful signals.

Agent-heavy delivery breaks both of these assumptions.

What Changes When Agents Do the Writing

The first thing that changes is the artifact distribution. A team running agents produces commits at a wildly higher rate – sometimes hundreds per day per engineer – because the implementation cost has collapsed. The classical “deployment frequency” number goes up dramatically. A DORA dashboard would interpret this as “the team’s process is healthier.” It is not. The team’s output rate is higher, but the question of whether the team’s process is healthier is independent.

The second thing is the variance in artifact quality. Agent-generated commits cluster around two extremes: very high quality on routine work where the agent’s training distribution covers the case well, and very low quality on adversarial edge cases where the agent hallucinates plausible-looking but broken behavior. The average commit quality is not the right summary statistic for this distribution. A team might be shipping ten high-quality commits and one low-quality one. The DORA “change failure rate” averages across them and gives you a number that does not capture the bimodal reality.

The third thing is the locus of work. Lead time for changes used to be a measure of how long it took to move a unit of work through the team’s process: spec to merge to deploy. In an agent-heavy team, the work is increasingly front-loaded into specification and review. The implementation phase between them is short; the spec phase is long; the review phase is long. The traditional lead time, measured from “first commit” to “deployed,” captures only the middle part – the part that has been compressed to near-zero. The actual cycle time is dominated by activities that DORA does not measure.

The fourth thing is mean time to recover. When an incident is rooted in agent-generated code, the recovery path is different than for human-generated code. The fix is often not “patch the bug” but “improve the spec and regenerate.” The mean time to recover number conflates these two recovery modes and gives you an aggregate that hides the more interesting distinction.

The result is a dashboard that still produces four numbers, that still trend, that still get reviewed in monthly business reviews, and that no longer correspond to the underlying reality of how the team is functioning.

What We Have Started Measuring Instead

We have not replaced DORA with a different four-metric framework, because we have not yet found four metrics that correlate as cleanly with delivery health in the new environment. What we have done is supplement DORA with a set of measurements that capture what agents specifically change.

Specification quality. Measured as the rate of spec revisions per shipped feature. A team that ships features with stable, accurate specs is producing high-quality work; a team that has to revise specs three or four times per feature is producing work that is technically passing tests but failing the intent. We track this. We review trends. We treat spec instability as a leading indicator of downstream problems.

Review depth. Measured as time spent in review per line of code reviewed, segmented by reviewer seniority. Agent-generated code requires more review per line than equivalent human-generated code – because the reviewer carries the cognitive load that the author used to. If the review depth number drops, it is rarely a productivity win; it is usually a sign that reviewers are skimming work they should be reading carefully.

Spec-to-implementation drift. Measured by sampling shipped features and evaluating whether the implementation matches the spec. This is a manual measurement, done on a rotation, on a sample. It is the most expensive measurement we run, and the most informative. Drift trends are the canary on agent productivity gains: if the team is shipping faster and drift is growing, the apparent productivity gain is being purchased with technical debt that has not yet shown up in production.

Agent rerun rate. Measured as the number of times a given specification has to be regenerated before producing acceptable output. A high rerun rate is a sign that the specifications are not yet precise enough for the model and tooling in use. It is also a sign that the team is burning compute and reviewer time on attempts that are not converging.

Incident root-cause distribution. Measured by tagging incidents with whether the root cause was specification (we asked for the wrong thing), generation (the agent produced something different from what was specified), review (the spec and the implementation were both wrong and the reviewer missed it), or operation (the system was correct as deployed and failed for environmental reasons). The distribution tells us where to invest. A team whose incidents are dominated by specification issues needs better specification practice; one whose incidents are dominated by review issues needs better review tooling; one whose incidents are dominated by operation issues needs better infrastructure.

These measurements are noisier than DORA, harder to dashboard, and require more judgment to interpret. They are also the measurements that correspond to the work that actually determines delivery health in an agent-heavy environment.

Why DORA Will Stick Around Anyway

Most engineering orgs will not abandon DORA. The metrics are well-known, defended in industry research, embedded in tooling, and culturally entrenched. The dashboards exist. The KPIs are signed off. The board reviews them. Replacing them is a political project, not just an analytical one.

That is fine. DORA will continue to produce numbers. The numbers will continue to mean something. They will just mean less than they used to.

The senior leaders who care about understanding what is actually happening in their delivery process will quietly add the supplementary measurements alongside DORA. The dashboards will get longer. The conversations will get more nuanced. The four-number summary will fade into one slide among several, instead of the headline.

This is the normal path for measurement frameworks at the end of their useful life. They do not get abandoned. They get supplemented, then over-supplemented, then quietly de-emphasized as the new measurements take their place. We are in the early supplementation phase for DORA, and the supplementations are not yet standardized.

The Honest Statement

DORA was the right framework for the era of human-authored software delivery. It is no longer sufficient for the era of agent-augmented delivery. The metrics still produce numbers; the numbers still trend; the trends no longer reliably correspond to what is happening in the work.

If your engineering leadership team is still reviewing DORA without supplementary measurements, you are reviewing a dashboard that is increasingly disconnected from operational reality. Add the supplementations. Pay attention to specification quality, review depth, drift, rerun rate, and incident root-cause distribution. Treat DORA as a partial summary, not as the answer.

The industry will eventually agree on a new four-metric framework. It does not yet exist. While we wait, measure what actually matters.