AI ROI math is contaminated at the inputs. The 40% time savings is self-reported. The 3x PR throughput is a review-queue traffic jam. The board number is one cherry-picked team. Four measurement layers, the rework tax nobody applies, and the attribution problem.
Why self-reported time savings overstate gains by 30-50% — and what controlled experiments actually find
The rework tax: roughly 37-40% of gross AI time savings evaporate into error correction
Four measurement layers and which questions each one answers (most orgs stop at Layer 1)
Attribution: three approaches that get you a defensible number without a perfect control group
The full cost denominator — license fees are 10-25% of true cost
A SQL query for rework-adjusted ROI, a TypeScript measurement schema, and four YAML configs you can run Monday
Your vendor reports 40% developer time savings. Your VP of Engineering shows a 3x increase in pull request volume. The CEO tells the board the AI investment is paying back ahead of schedule. Everyone is happy.
None of those numbers mean what anyone in the room thinks they mean.
The 40% figure comes from a self-reported survey where developers estimated how long the task would have taken without AI — about as reliable as asking someone how much they would have spent without a coupon. The 3x PR throughput is real, and it is a problem: AI generated more small, trivial changes that now clog the review queue. The board slide projected one team's output across the entire org.
This is the operating state of AI ROI measurement in most companies. A polite fiction nobody examines closely because everyone needs the number to be true.
Spend is compounding faster than measurement. The gap is structural, not technical.
Gartner forecasts worldwide AI spending hits $2.5 trillion in 2026[1] — roughly 44% above 2025. The same firm predicts over 40% of agentic AI projects get cancelled by end of 2027 due to escalating costs and unclear value[2].
Read those two numbers next to each other. Spend is accelerating. Measurement is not. Approximately 29% of executives say they can confidently quantify AI ROI. Deloitte's 2026 State of AI report found 74% of organizations hope to grow revenue through AI — only around 20% report actually doing so[3]. Hope is not a metric.
The most rigorous data we have now contradicts what vendors sell. METR ran a randomized controlled trial with 16 experienced open-source developers completing 246 tasks — tasks randomly assigned to allow or disallow early-2025 AI tools. The result: AI increased task completion time by 19%[10]. Before the study, developers forecast AI would reduce completion time by 24%. After completing it, they estimated a 20% reduction. The self-reported belief and the measured reality pointed in opposite directions.
Faros AI went wider: they analyzed telemetry from 10,000 developers across 1,255 engineering teams and found developers merge 98% more PRs and complete 21% more tasks with AI tools — yet DORA metrics at the organizational level stayed flat[11]. Review time increased 91%. PR size surged 154%. More code, faster, with no improvement in how quickly software reached production or how stable it was when it got there.
A Workday study found roughly 37-40% of time saved through AI is offset by time spent correcting, verifying, or rewriting low-quality outputs[5]. Only 14% of employees consistently reported net-positive outcomes.
The problem is not the formula. The inputs to the formula are contaminated.
Organizations are not bad at arithmetic. They are calculating against poisoned inputs. Five structural failure modes show up in nearly every AI ROI deck.
Measuring AI productivity requires knowing what would have happened without AI. You cannot run the same quarter twice. Self-reported estimates ('this would have taken me 4 hours') carry a known bias — people overestimate task difficulty after getting help, the same way you overestimate how long a drive would have taken once you used the GPS.
When a team ships 30% faster after adopting an AI coding tool, was it the AI? The new team lead who started the same month? The deploy pipeline that landed the same sprint? The fact that this feature was a straightforward CRUD endpoint? Isolating the AI signal from every other concurrent variable is structurally hard in real environments.
Bain's 2025 Technology Report makes this concrete: writing and testing code accounts for only 25-35% of the time from initial idea to product launch[12]. AI accelerates that slice. The remaining 65-75% — requirements gathering, planning, deployment, maintenance — stays unchanged. Speed up one stage without clearing the next, and you have built a more impressive traffic jam.
Raw throughput ignores the rework tax. AI helps you draft in 20 minutes instead of 60. You spend 25 minutes fixing hallucinations and correcting tone. Net savings: 15 minutes — not 40. Most organizations track the 40 and quietly forget the 25.
ROI requires a cost denominator. Most organizations dramatically undercount cost. License fees are the visible part. Training time, prompt engineering effort, review overhead for AI outputs, infrastructure for local models, opportunity cost of integration work — all belong in the denominator. Almost nobody puts them there.
Conflate what AI does with what the business gets, and the math falls apart.
Honest measurement separates what AI tools do from what the business gets. Different things, different time horizons, different metrics. Conflating them is where almost every ROI claim breaks.
The stack is four layers. Each answers a specific question. You need all four. Most organizations stop at Layer 1 and file the win report.
Usage is necessary. Usage is not value. Email has 100% adoption.
Layer 1 tracks whether people use the tools you bought. Active users, session frequency, feature adoption, prompts per day. This is where every vendor dashboard lives, and it is the layer most organizations mistake for ROI.
Activity answers exactly one question: are people using the tool? That matters — a tool nobody uses has zero ROI by definition. But a tool everyone uses also has zero ROI if it does not change outcomes. Email has 100% adoption. Nobody claims email has positive ROI.
Treat activity as a health check. If adoption is low, investigate. If adoption is high, move to Layer 2. Either way, activity is not the answer.
Total number of AI tool licenses purchased
Monthly active users with no segmentation
Total prompts sent across the organization
Percentage of developers with Copilot enabled
Weekly active users who use the tool 3+ days per week
Adoption rate by team, role, and tenure band
Feature-level usage — completions vs chat vs inline
30-day drop-off — who started and stopped
Gross savings is theater. Net savings is the only number worth defending.
Layer 2 measures whether AI makes individual tasks faster or higher quality. Time-and-motion studies, A/B experiments, before-and-after comparisons. This is also the layer most exposed to the biases above.
The discipline that separates real Layer 2 measurement from theater is the rework discount. For every time-savings claim, you need a paired measurement of time spent on error correction, review, and revision of AI-generated output. Net savings — gross saved minus rework — is the only number you can hand to finance with a straight face.
The METR experiment[10] is the sharpest illustration of why this matters. Developers believed they were faster. The clock said they were slower. The belief came from the first-draft experience (AI autocompleted the function in seconds). The clock captured the full cycle: reading the output, spotting the edge-case the model missed, rewriting the error handling, running tests, debugging the subtle off-by-one that only appeared at scale. First-draft time and total task time are different numbers. Most Layer 2 measurement only captures the first one.
| Task | Without AI | With AI (gross) | Rework time | Net savings | Actual gain |
|---|---|---|---|---|---|
| Write first draft of feature spec | 90 min | 25 min | 20 min | 45 min | 50% |
| Generate unit test scaffolding | 45 min | 10 min | 15 min | 20 min | 44% |
| Draft customer email response | 15 min | 3 min | 8 min | 4 min | 27% |
| Code review preparation | 30 min | 12 min | 5 min | 13 min | 43% |
| Data analysis script | 60 min | 15 min | 22 min | 23 min | 38% |
Team-level throughput, quality, cycle time. The layer that requires patience and a control group.
Layer 3 is where individual task improvements either compound into delivery gains or evaporate into bottleneck shifts. This is the most important layer and the one that demands the most patience — meaningful delivery outcome data takes 8-12 weeks to stabilize, longer in complex orgs.
The LSE Business Review named the failure mode: current measurement focuses on minutes saved and cost reduced, almost nothing on the quality or novelty of what gets produced[4]. Quality and novelty are harder to observe than time savings. That difficulty is not a reason to skip them.
The Faros AI data[11] is the clearest available evidence of what Layer 3 looks like when you skip Layer 2 discipline. Ninety-three percent of developers at surveyed organizations used AI coding tools. PR volume nearly doubled. Review time increased 91%. PR size surged 154%. Bug rates ticked up 9%. DORA metrics — change lead time, deployment frequency, change failure rate, time to restore — stayed flat at the organizational level. More code was generated faster. None of it translated into faster, more reliable software reaching users.
Measure at the team level, not the individual level. Individual metrics produce toxic incentives — people optimize for looking productive with AI rather than being productive with AI. The question is whether the team ships better work faster, not who accepted the most suggestions.
Here is the counterintuitive pattern most organizations miss. Some of the best-performing AI-augmented teams show lower story point velocity than comparable non-AI teams at week 8 — because they are shipping fewer, higher-impact deliverables. Points measure throughput. Throughput is not value. A team shipping 20 high-impact features beats one shipping 45 minor tickets, and the dashboard will tell you the opposite if you let it.
AI-assisted task completion rate — share of tasks where AI was used and the output was accepted without major revision
Review cycle time — are code and content reviews getting faster or slower since adoption?
First-pass quality rate — share of AI-assisted deliverables accepted on first review
Rework ratio — hours correcting AI output divided by hours saved generating it
End-to-end cycle time — ticket creation to production deploy, not just coding time
Defect escape rate — bugs found in production per release, controlling for release volume
Feature throughput — features delivered per sprint, adjusted for scope and complexity
Customer-facing quality — NPS, support ticket volume, error rates in user-facing flows
Revenue, margin, avoided cost, strategic optionality. Where ROI either lives or does not.
Layer 4 connects delivery improvements to business outcomes. This is where ROI actually lives. It is also the layer that requires the closest collaboration between engineering, finance, and product — the three functions that usually disagree about what counts.
The formula is straightforward:
ROI = (Revenue delta + Margin improvement + Avoided cost) - Total cost of ownership
The formula is not the problem. Honest inputs are the problem. Revenue delta from AI is nearly impossible to isolate — did the feature drive revenue because it shipped faster, or because it was the right feature regardless of build speed? Margin improvements require accounting for the full cost stack, not just license fees. Avoided cost is inherently speculative.
McKinsey's 2025 State of AI[13] puts numbers on the gap: only 39% of respondents attribute any EBIT impact to AI at all. Of those, most report that less than 5% of their organization's EBIT is attributable to AI use. At the function level, the picture is better — software engineering, manufacturing, and IT report 10-20% cost reductions, while marketing and product development see revenue uplifts above 10% in high-performing organizations. The key variable McKinsey identifies: high performers set explicit growth targets and redesign workflows, not just tool access.
Gartner introduced two adjacent frameworks worth using: Return on Employee (ROE) measures how AI changes employee capability and satisfaction. Return on Future (ROF) quantifies strategic optionality — the future opportunities AI capabilities create[8]. Neither is traditional ROI. That is the point. Traditional ROI was built for capital expenditures with predictable returns, not for capability investments where the upside is uncertain and potentially structural.
A four-step rollout the finance partner will not laugh at.
Three checks that take under an hour and expose the worst distortions.
You do not need to redesign your measurement infrastructure this week. You need to stop producing numbers you cannot defend. Three checks expose the worst distortions without waiting for a control group to mature.
Check 1: Pull your PR review time trend. If your team adopted AI coding tools in the last 6 months, compare median review cycle time before and after. If review time increased, you have PR bloat. Faros AI found a 91% increase in review time[11] across 10,000 developers — your team may not be an outlier. AI accelerated code generation and shifted the bottleneck downstream.
Check 2: Apply the rework ratio to your last productivity claim. Take the last time-savings number your team reported to leadership. Ask whoever reported it: what share of that saved time was spent reviewing, correcting, or rewriting AI output? If they do not have a number, the original claim is not ROI — it is activity data relabeled. The Workday research[5] puts the rework rate at 37-40% across knowledge-work tasks. Apply that discount as a conservative floor and see if the story holds.
Check 3: Count the variables. List every material change that happened in the same quarter your AI ROI number was measured. New team members, process changes, infrastructure upgrades, project-mix shifts, new tooling beyond AI. If there are more than two concurrent changes, you cannot attribute the outcome to AI with confidence. Reframe the number explicitly: 'We saw X improvement in a quarter that included Y and Z other changes. AI contributed — we cannot isolate the share.'
| Situation | Best approach | Confidence | Time to results |
|---|---|---|---|
| Pre-rollout, planning phase | Parallel team experiment (treatment + control) | High | 12+ weeks |
| Control group politically impossible | Alternating sprint design (AI on / AI off) | Medium | 6-8 weeks |
| Already rolled out, no baseline | Regression discontinuity on rollout date | Low-medium | Retrospective |
| Leadership wants a number now | Confidence-ranged estimate with explicit caveats | Low | Immediate |
| Task-level only, no delivery data | Time-diary study with rework discount applied | Low-medium | 2-4 weeks |
If your last deck did any of these, the number was not real.
AI saves 40 minutes. Rework takes 25. The savings is 15. Report the net number or do not report a number.
People overestimate savings and underestimate cleanup. METR's RCT showed developers believed they were 20% faster when they were actually 19% slower. Surveys are sentiment data, not ROI inputs.
The team that adopted AI first is usually the most enthusiastic and capable. Their numbers are the ceiling, not the average.
You need a real control group or a real pre-AI baseline. Hypothetical counterfactuals are not evidence.
Faster coding that creates a review bottleneck has not improved delivery. Faros AI measured this directly: 98% more PRs, flat DORA metrics. Measure end-to-end or do not measure.
License fees, training, integration, review overhead, infrastructure — all of it goes in the denominator.
Adoption is a leading indicator. Revenue is a lagging outcome. They are not interchangeable. Stop.
Three approaches that get closer to honest. None of them are easy. All of them beat the alternative.
Attribution — how much of an improvement was caused by AI versus everything else changing at the same time — is the hardest problem in AI ROI measurement. There is no perfect solution. Three approaches get you closer to a defensible number than the default of attributing every win to AI and hoping nobody asks.
Before-and-after comparison with no controls
Self-reported developer surveys on time saved
Vendor-provided productivity dashboards
Anecdotal success stories from champion users
Parallel team experiments with matched control groups
Structured time-diary studies with sampled participants
Independent measurement using delivery system data
Statistical analysis controlling for project complexity and team changes
Approach 1: Parallel team experiments. Match teams on size, seniority, project type, sprint cadence. Assign one to treatment (with AI), one to control (without AI). Run for 12 weeks minimum. Compare Layer 3 delivery outcomes, not Layer 2 task metrics. This is the gold standard. It requires organizational will to temporarily withhold AI tools from some teams. Most orgs cannot stomach that. The ones that do get the cleanest numbers.
Approach 2: Alternating sprint design. When sustaining a control group is politically impossible, alternate sprints with and without AI tools. Two on, two off, repeated three times. Compare delivery metrics across the alternating periods. Controls for team composition. Does not control for project variation.
Approach 3: Regression discontinuity. If you rolled out AI on a specific date, compare delivery trends before and after, controlling for other known changes. Weaker than experiments. Works retrospectively when nobody planned ahead. Use team-level data, not org-level — Simpson's paradox is real and it lives in aggregated AI ROI dashboards.
License cost is 10-25% of true cost. The rest is doing damage you are not measuring.
| Cost category | Typical range | Usually tracked? | Notes |
|---|---|---|---|
| Tool license fees | $200-600/yr | Yes | The only cost most orgs count |
| Onboarding and training | $500-1,200/yr | Rarely | Initial training plus ongoing learning time |
| Prompt engineering effort | $300-800/yr | No | Time crafting, testing, and refining prompts |
| Review overhead for AI output | $1,000-3,000/yr | No | Code review, content review, fact-checking — Faros data shows 91% review time increase |
| Integration and maintenance | $200-500/yr | Sometimes | IDE plugins, API integrations, config drift |
| Infrastructure (local models) | $0-2,000/yr | Varies | GPU compute for teams running local models |
| Opportunity cost of adoption | $500-1,500/yr | Never | Time evaluating, comparing, and switching tools |
| Total realistic cost | $2,700-9,600/yr | — | 3-16x the license fee alone |
Restraint over impression. Every metric earns its place by changing a decision.
An honest AI ROI dashboard is an exercise in restraint. The temptation is to fill it with up-and-to-the-right activity charts. Resist. Every metric on the dashboard answers a decision: expand the tool, reduce the tool, change how the tool is used, or investigate further. If a metric does not move a decision, cut it.
The dashboards are not the bottleneck. The incentives that produce the numbers are.
The biggest barrier to honest AI ROI measurement is not technical. It is political. Nobody wants to be the person telling the CEO that the AI investment the board approved is showing ambiguous returns. So numbers get massaged, uncomfortable findings get footnoted, and the executive summary stays optimistic.
This pattern does not break with better dashboards. It breaks with structural changes to who owns the measurement.
A note on what we got wrong the first time. The first measurement frameworks we built shared Layer 3 delivery data with the same team that owned the AI rollout. The data going into board decks improved every quarter. Not because results improved — because measurement ownership was misaligned. Moving measurement to the finance and people analytics function, with engineering as a consumer rather than an owner, produced numbers 40% lower on average and dramatically more credible to the board. Uncomfortable. Necessary. Drift is the default state of any measurement system without an owner whose incentives point the other way.
Separate the team that measures from the team that deploys. The people responsible for AI adoption should not be calculating its ROI. Stand up an independent measurement function — even if it is one analyst — reporting to finance or strategy, not engineering.
Pre-register hypotheses. Before deploying a tool, write down what you expect it to improve, by how much, over what time period. This blocks the post-hoc rationalization where any metric that went up becomes the goal you had all along.
Publish negative results internally. Build a culture where reporting that an AI tool did not produce expected ROI is rewarded, not punished. The orgs that learn fastest are the ones that admit what does not work.
Tie incentives to outcomes, not adoption. If the AI champion's bonus depends on adoption rates, they will drive adoption regardless of value. Tie incentives to Layer 3 and 4.
How long should we measure before reporting AI ROI?
Twelve weeks minimum for Layer 3 delivery data to stabilize. Layer 1 and 2 are available immediately. Neither is ROI. Reporting earlier creates pressure to lock in optimistic narratives that quietly become the official story. The number that lands in the board deck this quarter sets the bar you will be measured against next quarter. Set it honestly or do not set it.
What if leadership demands ROI numbers before we have reliable data?
Report what you have with explicit confidence ranges. 'Layer 1 adoption is at 78%. Layer 2 gross time savings is 25-35% with a 15-20% rework discount still being measured. Layer 3-4 needs 8 more weeks.' Honest uncertainty is more defensible than confident fiction. Senior leaders who have seen a few cycles know the difference.
Should we measure individual developer productivity with AI tools?
No. Individual metrics produce gaming, resentment, and misleading signals. Measure at the team level. A team of ten developers using AI effectively does not look like any individual metric — what matters is whether the team ships better work faster, not whether Developer #7 accepted more suggestions than Developer #3. Individual AI productivity dashboards are a recruitment problem waiting to happen.
How do we handle the Hawthorne effect in AI measurement?
You cannot eliminate it. You can reduce it. Use long measurement windows — the effect fades over time. Pull metrics from delivery system data rather than human observation. Compare against control groups who also know they are being measured. The bias affects both groups similarly, which is exactly the point of a control.
What is a realistic payback period for AI developer tools?
For well-implemented coding assistants with honest cost accounting: 2-4 quarters to net positive ROI at the team level. If anyone claims payback in weeks, they are either excluding costs from the denominator or counting gross savings without the rework discount. Both are common. Both are wrong.
The METR study found AI made developers slower. Does that mean AI tools are useless?
No — but it means the productivity narrative needs more precision. The METR study used experienced developers on open-source tasks with early-2025 models. Tool maturity, task type, team experience level, and model capability all shift the outcome. The finding is not 'AI is useless.' It is 'controlled experiments produce different numbers than surveys, and the gap tells you something important about how your organization is measuring.'
How do we separate AI impact from other concurrent changes?
Pre-register your baseline, run a matched control group, and track everything else that changed in the same window. If you cannot do a control group, apply regression discontinuity on the rollout date — compare trend lines before and after, and explicitly list every other change in your confidence caveats. The goal is not perfect isolation; it is a defensible narrative that acknowledges what you cannot control.
Statistics from Gartner, Deloitte, Workday, METR, Anthropic, Faros AI, Bain, McKinsey, and LSE research published between 2025 and early 2026. AI ROI measurement is a fast-moving field and specific percentages will shift as model capability improves. The structural problems — attribution difficulty, rework tax, counterfactual bias, bottleneck shifting — are stable regardless of which year's data you reference.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.