Measuring AI ROI: The Honest Framework Finance Will Accept

Most AI ROI Numbers Are Fiction. Here Is How You Stop Producing Them.

AI ROI math is contaminated at the inputs. The 40% time savings is self-reported. The 3x PR throughput is a review-queue traffic jam. The board number is one cherry-picked team. Four measurement layers, the rework tax nobody applies, and the attribution problem.

Governance & AdoptionbeginnerMar 3, 20267 min read

By Viktor Bezdek · VP Engineering, Groupon

Your vendor reports 40% developer time savings. Your VP of Engineering shows a 3x increase in pull request volume. The CEO tells the board the AI investment is paying back ahead of schedule. Everyone is happy.

None of those numbers mean what anyone in the room thinks they mean.

The 40% figure comes from a self-reported survey where developers estimated how long the task would have taken without AI — about as reliable as asking someone how much they would have spent without a coupon. The 3x PR throughput is real, and it is a problem: AI generated more small, trivial changes that now clog the review queue. The board slide projected one team's output across the entire org.

This is the operating state of AI ROI measurement in most companies. A polite fiction nobody examines closely because everyone needs the number to be true.

The Industry Already Knows the Numbers Are Bad

Spend is compounding faster than measurement. The gap is structural, not technical.

$2.5T

Global AI spending Gartner forecasts for 2026. Actual spend depends on adoption velocity.

40%+

Agentic AI projects Gartner expects to be cancelled by 2027. Treat as directional, not gospel.

29%

Executives who say they can confidently measure AI ROI. Methodology varies across surveys.

~40%

Of AI time savings consumed by rework and error correction (Workday, 2025). Varies by task.

Gartner forecasts worldwide AI spending hits $2.5 trillion in 2026^[1] — roughly 44% above 2025. The same firm predicts over 40% of agentic AI projects get cancelled by end of 2027 due to escalating costs and unclear value^[2].

Read those two numbers next to each other. Spend is accelerating. Measurement is not. Approximately 29% of executives say they can confidently quantify AI ROI. Deloitte's 2026 State of AI report found 74% of organizations hope to grow revenue through AI — only around 20% report actually doing so^[3]. Hope is not a metric.

A Workday study found roughly 40% of time saved through AI is offset by time spent correcting, verifying, or rewriting low-quality outputs — the ratio shifts by task type and team maturity^[5]. Only 14% of employees in that study consistently reported net-positive outcomes. The productivity gain that lights up a vendor slide deck dissolves the moment you account for the cleanup it triggered downstream.

Why the Math Breaks Before It Even Starts

The problem is not the formula. The inputs to the formula are contaminated.

Organizations are not bad at arithmetic. They are calculating against poisoned inputs. Five structural failure modes show up in nearly every AI ROI deck.

[01]
The counterfactual problem
Measuring AI productivity requires knowing what would have happened without AI. You cannot run the same quarter twice. Self-reported estimates ('this would have taken me 4 hours') carry a known bias — people overestimate task difficulty after getting help, the same way you overestimate how long a drive would have taken once you used the GPS.
[02]
The attribution problem
When a team ships 30% faster after adopting an AI coding tool, was it the AI? The new team lead who started the same month? The deploy pipeline that landed the same sprint? The fact that this feature was a straightforward CRUD endpoint? Isolating the AI signal from every other concurrent variable is structurally hard in real environments.
[03]
The local optimization trap
AI accelerates individual tasks. Tasks are not the bottleneck in most knowledge work. Writing code faster does not help if the constraint is review, QA, or stakeholder approval. Speed up one stage without clearing the next, and you have built a more impressive traffic jam.
[04]
The quality discount nobody applies
Raw throughput ignores the rework tax. AI helps you draft in 20 minutes instead of 60. You spend 25 minutes fixing hallucinations and correcting tone. Net savings: 15 minutes — not 40. Most organizations track the 40 and quietly forget the 25.
[05]
The denominator problem
ROI requires a cost denominator. Most organizations dramatically undercount cost. License fees are the visible part. Training time, prompt engineering effort, review overhead for AI outputs, infrastructure for local models, opportunity cost of integration work — all belong in the denominator. Almost nobody puts them there.

Four Measurement Layers. Each Answers a Different Question.

Conflate what AI does with what the business gets, and the math falls apart.

Honest measurement separates what AI tools do from what the business gets. Different things, different time horizons, different metrics. Conflating them is where almost every ROI claim breaks.

The stack is four layers. Each answers a specific question. You need all four. Most organizations stop at Layer 1 and file the win report.

AI ROI Measurement Stack: Four Layers Between Tool and Outcome

Activity is the floor, not the ceiling. Most orgs stop at the gate and claim ROI. The numbers that pay back live one layer deeper.

Layer 1: Activity Is a Health Check, Not a Value Metric

Usage is necessary. Usage is not value. Email has 100% adoption.

Layer 1 tracks whether people use the tools you bought. Active users, session frequency, feature adoption, prompts per day. This is where every vendor dashboard lives, and it is the layer most organizations mistake for ROI.

Activity answers exactly one question: are people using the tool? That matters — a tool nobody uses has zero ROI by definition. But a tool everyone uses also has zero ROI if it does not change outcomes. Email has 100% adoption. Nobody claims email has positive ROI.

Treat activity as a health check. If adoption is low, investigate. If adoption is high, move to Layer 2. Either way, activity is not the answer.

Vanity

Total number of AI tool licenses purchased
Monthly active users with no segmentation
Total prompts sent across the organization
Percentage of developers with Copilot enabled

Decision-Driving

Weekly active users who use the tool 3+ days per week
Adoption rate by team, role, and tenure band
Feature-level usage — completions vs chat vs inline
30-day drop-off — who started and stopped

Layer 2: The Rework Discount Is the Whole Game

Gross savings is theater. Net savings is the only number worth defending.

Layer 2 measures whether AI makes individual tasks faster or higher quality. Time-and-motion studies, A/B experiments, before-and-after comparisons. This is also the layer most exposed to the biases above.

The discipline that separates real Layer 2 measurement from theater is the rework discount. For every time-savings claim, you need a paired measurement of time spent on error correction, review, and revision of AI-generated output. Net savings — gross saved minus rework — is the only number you can hand to finance with a straight face.

Task	Without AI	With AI (gross)	Rework time	Net savings	Actual gain
Write first draft of feature spec	90 min	25 min	20 min	45 min	50%
Generate unit test scaffolding	45 min	10 min	15 min	20 min	44%
Draft customer email response	15 min	3 min	8 min	4 min	27%
Code review preparation	30 min	12 min	5 min	13 min	43%
Data analysis script	60 min	15 min	22 min	23 min	38%

Layer 3: Where Task Gains Either Compound or Evaporate

Team-level throughput, quality, cycle time. The layer that requires patience and a control group.

Layer 3 is where individual task improvements either compound into delivery gains or evaporate into bottleneck shifts. This is the most important layer and the one that demands the most patience — meaningful delivery outcome data takes 8-12 weeks to stabilize, longer in complex orgs.

The LSE Business Review named the failure mode: current measurement focuses on minutes saved and cost reduced, almost nothing on the quality or novelty of what gets produced^[4]. Quality and novelty are harder to observe than time savings. That difficulty is not a reason to skip them.

Measure at the team level, not the individual level. Individual metrics produce toxic incentives — people optimize for looking productive with AI rather than being productive with AI. The question is whether the team ships better work faster, not who accepted the most suggestions.

Here is the counterintuitive pattern most organizations miss. Some of the best-performing AI-augmented teams show lower story point velocity than comparable non-AI teams at week 8 — because they are shipping fewer, higher-impact deliverables. Points measure throughput. Throughput is not value. A team shipping 20 high-impact features beats one shipping 45 minor tickets, and the dashboard will tell you the opposite if you let it.

Leading indicators (visible in 2-4 weeks)

AI-assisted task completion rate — share of tasks where AI was used and the output was accepted without major revision
Review cycle time — are code and content reviews getting faster or slower since adoption?
First-pass quality rate — share of AI-assisted deliverables accepted on first review
Rework ratio — hours correcting AI output divided by hours saved generating it

Lagging indicators (meaningful at 8-12 weeks)

✓
End-to-end cycle time — ticket creation to production deploy, not just coding time
✓
Defect escape rate — bugs found in production per release, controlling for release volume
✓
Feature throughput — features delivered per sprint, adjusted for scope and complexity
✓
Customer-facing quality — NPS, support ticket volume, error rates in user-facing flows

Layer 4: The Only Layer That Pays Back

Revenue, margin, avoided cost, strategic optionality. Where ROI either lives or does not.

Layer 4 connects delivery improvements to business outcomes. This is where ROI actually lives. It is also the layer that requires the closest collaboration between engineering, finance, and product — the three functions that usually disagree about what counts.

The formula is straightforward:

ROI = (Revenue delta + Margin improvement + Avoided cost) - Total cost of ownership

The formula is not the problem. Honest inputs are the problem. Revenue delta from AI is nearly impossible to isolate — did the feature drive revenue because it shipped faster, or because it was the right feature regardless of build speed? Margin improvements require accounting for the full cost stack, not just license fees. Avoided cost is inherently speculative.

Gartner introduced two adjacent frameworks worth using: Return on Employee (ROE) measures how AI changes employee capability and satisfaction. Return on Future (ROF) quantifies strategic optionality — the future opportunities AI capabilities create^[8]. Neither is traditional ROI. That is the point. Traditional ROI was built for capital expenditures with predictable returns, not for capability investments where the upside is uncertain and potentially structural.

Net savings

Rework-adjusted cost savings per quarter, fully loaded

Time-to-market

Feature delivery speed delta versus pre-AI baseline

Capacity freed

Hours reallocated from routine to strategic work per team

Quality delta

Change in defect rate and customer satisfaction scores

Building the Stack Without Lying to Yourself

A four-step rollout the finance partner will not laugh at.

[01]

Establish the pre-AI baseline before deploying anything

yaml

# baseline-metrics.yml — measure the system before you change it
baseline:
  period: "4 weeks minimum before AI rollout"
  metrics:
    - cycle_time_p50: "median days from ticket to deploy"
    - cycle_time_p90: "90th percentile, where outliers live"
    - defect_escape_rate: "bugs per release reaching production"
    - first_pass_review_rate: "% of PRs approved without revision"
    - team_throughput: "story points or features per sprint"
  rules:
    - "Same team, same type of work, or the comparison is fiction"
    - "Exclude outlier sprints — launches, incidents, holidays"
    - "Record project complexity scores for later normalization"

[02]

Roll out to a subset of teams. Keep a real control group.

yaml

# rollout-plan.yml — matched teams, 12-week minimum, no shortcuts
rollout:
  treatment_group:
    teams: ["backend-payments", "frontend-dashboard"]
    headcount: 14
  control_group:
    teams: ["backend-orders", "frontend-onboarding"]
    headcount: 12
  duration: "12 weeks minimum"
  matching_criteria:
    - "Similar team size and seniority mix"
    - "Similar project type and complexity"
    - "Same sprint cadence and review process"

[03]

Collect all four layers from week one — not as an afterthought

typescript

// measurement-collection.ts — one shape, four layers, no missing fields
interface AIROIMeasurement {
  layer1_activity: {
    weeklyActiveUsers: number;
    sessionsPerUserPerWeek: number;
    featureUsageBreakdown: Record<string, number>;
    dropoffRate30Day: number;
  };
  layer2_efficiency: {
    grossTimeSavedMinutes: number;
    reworkTimeMinutes: number;
    netTimeSavedMinutes: number;
    reworkRatio: number; // rework / gross savings — the number that matters
  };
  layer3_delivery: {
    cycleTimeP50Days: number;
    defectEscapeRate: number;
    firstPassReviewRate: number;
    featureThroughputPerSprint: number;
  };
  layer4_business: {
    costPerFeatureDelivered: number;
    capacityFreedHoursPerWeek: number;
    revenuePerEngineerPerQuarter: number;
  };
}

[04]

Run quarterly honest-ROI reviews with cross-functional attendance

yaml

# quarterly-review-template.yml — the people in the room set the truth bar
review:
  attendees:
    - engineering_lead
    - finance_partner
    - product_manager
    - hr_people_analytics  # for satisfaction and capacity data
  agenda:
    - "Layer 1-2 dashboard review (10 min)"
    - "Layer 3 treatment vs control comparison (20 min)"
    - "Layer 4 financial impact estimate (15 min)"
    - "Rework tax trend (10 min)"
    - "Decision: expand, maintain, or reduce investment (5 min)"
  anti_patterns:
    - "Never present Layer 1 metrics as ROI"
    - "Never use self-reported time savings without rework discount"
    - "Never compare against a hypothetical baseline"

Seven Patterns of Self-Deception in AI ROI Reporting

If your last deck did any of these, the number was not real.

AI ROI Self-Deception Patterns

[01]

Counting gross savings without the rework discount

AI saves 40 minutes. Rework takes 25. The savings is 15. Report the net number or do not report a number.

[02]

Treating self-reported surveys as primary evidence

People overestimate savings and underestimate cleanup. Surveys are sentiment data, not ROI inputs.

[03]

Projecting one team's results across the org

The team that adopted AI first is usually the most enthusiastic and capable. Their numbers are the ceiling, not the average.

[04]

Comparing against a fictional 'without AI' scenario

You need a real control group or a real pre-AI baseline. Hypothetical counterfactuals are not evidence.

[05]

Measuring task speed while ignoring system throughput

Faster coding that creates a review bottleneck has not improved delivery. Measure end-to-end or do not measure.

[06]

Excluding AI costs from the denominator

License fees, training, integration, review overhead, infrastructure — all of it goes in the denominator.

[07]

Presenting leading indicators as if they were lagging outcomes

Adoption is a leading indicator. Revenue is a lagging outcome. They are not interchangeable. Stop.

Attribution: How Much of This Is Actually the AI?

Three approaches that get closer to honest. None of them are easy. All of them beat the alternative.

Attribution — how much of an improvement was caused by AI versus everything else changing at the same time — is the hardest problem in AI ROI measurement. There is no perfect solution. Three approaches get you closer to a defensible number than the default of attributing every win to AI and hoping nobody asks.

Weak Attribution

Before-and-after comparison with no controls
Self-reported developer surveys on time saved
Vendor-provided productivity dashboards
Anecdotal success stories from champion users

Defensible Attribution

Parallel team experiments with matched control groups
Structured time-diary studies with sampled participants
Independent measurement using delivery system data
Statistical analysis controlling for project complexity and team changes

Approach 1: Parallel team experiments. Match teams on size, seniority, project type, sprint cadence. Assign one to treatment (with AI), one to control (without AI). Run for 12 weeks minimum. Compare Layer 3 delivery outcomes, not Layer 2 task metrics. This is the gold standard. It requires organizational will to temporarily withhold AI tools from some teams. Most orgs cannot stomach that. The ones that do get the cleanest numbers.

Approach 2: Alternating sprint design. When sustaining a control group is politically impossible, alternate sprints with and without AI tools. Two on, two off, repeated three times. Compare delivery metrics across the alternating periods. Controls for team composition. Does not control for project variation.

Approach 3: Regression discontinuity. If you rolled out AI on a specific date, compare delivery trends before and after, controlling for other known changes. Weaker than experiments. Works retrospectively when nobody planned ahead. Use team-level data, not org-level — Simpson's paradox is real and it lives in aggregated AI ROI dashboards.

The Denominator Is Bigger Than the License Fee

License cost is 10-25% of true cost. The rest is doing damage you are not measuring.

Cost category	Typical range	Usually tracked?	Notes
Tool license fees	$200-600/yr	Yes	The only cost most orgs count
Onboarding and training	$500-1,200/yr	Rarely	Initial training plus ongoing learning time
Prompt engineering effort	$300-800/yr	No	Time crafting, testing, and refining prompts
Review overhead for AI output	$1,000-3,000/yr	No	Code review, content review, fact-checking
Integration and maintenance	$200-500/yr	Sometimes	IDE plugins, API integrations, config drift
Infrastructure (local models)	$0-2,000/yr	Varies	GPU compute for teams running local models
Opportunity cost of adoption	$500-1,500/yr	Never	Time evaluating, comparing, and switching tools
Total realistic cost	$2,700-9,600/yr	—	3-16x the license fee alone

An ROI Dashboard That Does Not Lie

Restraint over impression. Every metric earns its place by changing a decision.

An honest AI ROI dashboard is an exercise in restraint. The temptation is to fill it with up-and-to-the-right activity charts. Resist. Every metric on the dashboard answers a decision: expand the tool, reduce the tool, change how the tool is used, or investigate further. If a metric does not move a decision, cut it.

AI ROI Dashboard Design Principles

Lead with Layer 3 delivery outcomes — not Layer 1 activity
Show rework-adjusted savings alongside gross savings on every efficiency metric
Include a treatment-vs-control comparison on at least one metric
Display total cost of ownership, not license cost, in any ROI calculation
Show leading indicators with directional arrows, not as achievements
Attach a confidence interval or uncertainty range to every projected number
Separate team-level from org-level views — aggregation hides signal
Add a visible rework-tax trend line that updates quarterly

roi-dashboard-query.sql

-- Rework-adjusted ROI by team, quarterly. Net only. No vanity columns.
WITH team_metrics AS (
  SELECT
    t.team_name,
    t.quarter,
    SUM(m.gross_time_saved_hours) AS gross_saved,
    SUM(m.rework_hours) AS rework,
    SUM(m.gross_time_saved_hours) - SUM(m.rework_hours) AS net_saved,
    SUM(c.total_cost) AS total_cost,
    -- Net savings valued at blended hourly rate
    (SUM(m.gross_time_saved_hours) - SUM(m.rework_hours))
      * t.blended_hourly_rate AS net_value
  FROM teams t
  JOIN ai_metrics m ON t.id = m.team_id
  JOIN ai_costs c ON t.id = c.team_id AND m.quarter = c.quarter
  GROUP BY t.team_name, t.quarter, t.blended_hourly_rate
)
SELECT
  team_name,
  quarter,
  gross_saved,
  rework,
  ROUND(rework / NULLIF(gross_saved, 0) * 100, 1) AS rework_pct,
  net_saved,
  total_cost,
  ROUND((net_value - total_cost) / NULLIF(total_cost, 0) * 100, 1) AS roi_pct
FROM team_metrics
ORDER BY quarter DESC, roi_pct DESC;

Honest Measurement Is a Governance Problem, Not a Data Problem

The dashboards are not the bottleneck. The incentives that produce the numbers are.

The biggest barrier to honest AI ROI measurement is not technical. It is political. Nobody wants to be the person telling the CEO that the AI investment the board approved is showing ambiguous returns. So numbers get massaged, uncomfortable findings get footnoted, and the executive summary stays optimistic.

This pattern does not break with better dashboards. It breaks with structural changes to who owns the measurement.

A note on what we got wrong the first time. The first measurement frameworks we built shared Layer 3 delivery data with the same team that owned the AI rollout. The data going into board decks improved every quarter. Not because results improved — because measurement ownership was misaligned. Moving measurement to the finance and people analytics function, with engineering as a consumer rather than an owner, produced numbers 40% lower on average and dramatically more credible to the board. Uncomfortable. Necessary. Drift is the default state of any measurement system without an owner whose incentives point the other way.

Governance moves that make honest measurement possible

Separate the team that measures from the team that deploys. The people responsible for AI adoption should not be calculating its ROI. Stand up an independent measurement function — even if it is one analyst — reporting to finance or strategy, not engineering.
Pre-register hypotheses. Before deploying a tool, write down what you expect it to improve, by how much, over what time period. This blocks the post-hoc rationalization where any metric that went up becomes the goal you had all along.
Publish negative results internally. Build a culture where reporting that an AI tool did not produce expected ROI is rewarded, not punished. The orgs that learn fastest are the ones that admit what does not work.
Tie incentives to outcomes, not adoption. If the AI champion's bonus depends on adoption rates, they will drive adoption regardless of value. Tie incentives to Layer 3 and 4.

How long should we measure before reporting AI ROI?

Twelve weeks minimum for Layer 3 delivery data to stabilize. Layer 1 and 2 are available immediately. Neither is ROI. Reporting earlier creates pressure to lock in optimistic narratives that quietly become the official story. The number that lands in the board deck this quarter sets the bar you will be measured against next quarter. Set it honestly or do not set it.

What if leadership demands ROI numbers before we have reliable data?

Report what you have with explicit confidence ranges. 'Layer 1 adoption is at 78%. Layer 2 gross time savings is 25-35% with a 15-20% rework discount still being measured. Layer 3-4 needs 8 more weeks.' Honest uncertainty is more defensible than confident fiction. Senior leaders who have seen a few cycles know the difference.

Should we measure individual developer productivity with AI tools?

No. Individual metrics produce gaming, resentment, and misleading signals. Measure at the team level. A team of ten developers using AI effectively does not look like any individual metric — what matters is whether the team ships better work faster, not whether Developer #7 accepted more suggestions than Developer #3. Individual AI productivity dashboards are a recruitment problem waiting to happen.

How do we handle the Hawthorne effect in AI measurement?

You cannot eliminate it. You can reduce it. Use long measurement windows — the effect fades over time. Pull metrics from delivery system data rather than human observation. Compare against control groups who also know they are being measured. The bias affects both groups similarly, which is exactly the point of a control.

What is a realistic payback period for AI developer tools?

For well-implemented coding assistants with honest cost accounting: 2-4 quarters to net positive ROI at the team level. If anyone claims payback in weeks, they are either excluding costs from the denominator or counting gross savings without the rework discount. Both are common. Both are wrong.

A note on methodology

Statistics from Gartner, Deloitte, Workday, METR, Anthropic, and LSE research published between 2025 and early 2026. AI ROI measurement is a fast-moving field and specific percentages will shift. The structural problems — attribution difficulty, rework tax, counterfactual bias — are stable regardless of which year's data you reference.

Key terms in this piece

AI ROI measurementAI productivity metricsmeasuring AI ROIAI rework taxAI attribution problemAI leading lagging indicatorshonest AI metricsAI cost accounting

Sources

[1]Gartner: Worldwide AI Spending Will Total $2.5 Trillion in 2026(gartner.com)↩
[2]Gartner Predicts Over 40% of Agentic AI Projects Will Be Canceled by End of 2027(gartner.com)↩
[3]Deloitte 2026 State of AI in the Enterprise(deloitte.com)↩
[4]LSE Business Review: AI Productivity Gains Should Be Measured in More Than Minutes Saved(blogs.lse.ac.uk)↩
[5]Tech.co: Time Saved by AI Offset by Fixing Errors (Workday Research)(tech.co)↩
[6]METR: Uplift Update — Developer Productivity Experiment Findings(metr.org)↩
[7]Anthropic Research: Estimating Productivity Gains from Claude(anthropic.com)↩
[8]Gartner: AI Value Metrics — Return on Employee and Return on Future Frameworks(gartner.com)↩
[9]Larridin: AI ROI Measurement Best Practices(larridin.com)↩

Share this article

X LinkedIn Hacker News

Most AI ROI Numbers Are Fiction. Here Is How You Stop Producing Them.

Governance & AdoptionbeginnerMar 3, 20267 min read

By Viktor Bezdek · VP Engineering, Groupon

Task

Without AI

With AI (gross)

Rework time

Net savings

Actual gain

Write first draft of feature spec

90 min

25 min

20 min

45 min

50%

Generate unit test scaffolding

45 min

10 min

15 min

20 min

44%

Draft customer email response

15 min

3 min

8 min

4 min

27%

Code review preparation

30 min

12 min

5 min

13 min

43%

Data analysis script

60 min

15 min

22 min

23 min

38%

# baseline-metrics.yml — measure the system before you change it baseline: period: "4 weeks minimum before AI rollout" metrics: - cycle_time_p50: "median days from ticket to deploy" - cycle_time_p90: "90th percentile, where outliers live" - defect_escape_rate: "bugs per release reaching production" - first_pass_review_rate: "% of PRs approved without revision" - team_throughput: "story points or features per sprint" rules: - "Same team, same type of work, or the comparison is fiction" - "Exclude outlier sprints — launches, incidents, holidays" - "Record project complexity scores for later normalization"

# rollout-plan.yml — matched teams, 12-week minimum, no shortcuts rollout: treatment_group: teams: ["backend-payments", "frontend-dashboard"] headcount: 14 control_group: teams: ["backend-orders", "frontend-onboarding"] headcount: 12 duration: "12 weeks minimum" matching_criteria: - "Similar team size and seniority mix" - "Similar project type and complexity" - "Same sprint cadence and review process"

// measurement-collection.ts — one shape, four layers, no missing fields interface AIROIMeasurement { layer1_activity: { weeklyActiveUsers: number; sessionsPerUserPerWeek: number; featureUsageBreakdown: Record<string, number>; dropoffRate30Day: number; }; layer2_efficiency: { grossTimeSavedMinutes: number; reworkTimeMinutes: number; netTimeSavedMinutes: number; reworkRatio: number; // rework / gross savings — the number that matters }; layer3_delivery: { cycleTimeP50Days: number; defectEscapeRate: number; firstPassReviewRate: number; featureThroughputPerSprint: number; }; layer4_business: { costPerFeatureDelivered: number; capacityFreedHoursPerWeek: number; revenuePerEngineerPerQuarter: number; }; }

# quarterly-review-template.yml — the people in the room set the truth bar review: attendees: - engineering_lead - finance_partner - product_manager - hr_people_analytics # for satisfaction and capacity data agenda: - "Layer 1-2 dashboard review (10 min)" - "Layer 3 treatment vs control comparison (20 min)" - "Layer 4 financial impact estimate (15 min)" - "Rework tax trend (10 min)" - "Decision: expand, maintain, or reduce investment (5 min)" anti_patterns: - "Never present Layer 1 metrics as ROI" - "Never use self-reported time savings without rework discount" - "Never compare against a hypothetical baseline"

Cost category

Typical range

Usually tracked?

Notes

Tool license fees

$200-600/yr

Yes

The only cost most orgs count

Onboarding and training

$500-1,200/yr

Rarely

Initial training plus ongoing learning time

Prompt engineering effort

$300-800/yr

Time crafting, testing, and refining prompts

Review overhead for AI output

$1,000-3,000/yr

Code review, content review, fact-checking

Integration and maintenance

$200-500/yr

Sometimes

IDE plugins, API integrations, config drift

Infrastructure (local models)

$0-2,000/yr

Varies

GPU compute for teams running local models

Opportunity cost of adoption

$500-1,500/yr

Never

Time evaluating, comparing, and switching tools

Total realistic cost

$2,700-9,600/yr

—

3-16x the license fee alone

-- Rework-adjusted ROI by team, quarterly. Net only. No vanity columns. WITH team_metrics AS ( SELECT t.team_name, t.quarter, SUM(m.gross_time_saved_hours) AS gross_saved, SUM(m.rework_hours) AS rework, SUM(m.gross_time_saved_hours) - SUM(m.rework_hours) AS net_saved, SUM(c.total_cost) AS total_cost, -- Net savings valued at blended hourly rate (SUM(m.gross_time_saved_hours) - SUM(m.rework_hours)) * t.blended_hourly_rate AS net_value FROM teams t JOIN ai_metrics m ON t.id = m.team_id JOIN ai_costs c ON t.id = c.team_id AND m.quarter = c.quarter GROUP BY t.team_name, t.quarter, t.blended_hourly_rate ) SELECT team_name, quarter, gross_saved, rework, ROUND(rework / NULLIF(gross_saved, 0) * 100, 1) AS rework_pct, net_saved, total_cost, ROUND((net_value - total_cost) / NULLIF(total_cost, 0) * 100, 1) AS roi_pct FROM team_metrics ORDER BY quarter DESC, roi_pct DESC;

The Industry Already Knows the Numbers Are Bad

Why the Math Breaks Before It Even Starts

The counterfactual problem

The attribution problem

The local optimization trap

The quality discount nobody applies

The denominator problem

Four Measurement Layers. Each Answers a Different Question.

Layer 1: Activity Is a Health Check, Not a Value Metric

Layer 2: The Rework Discount Is the Whole Game

Layer 3: Where Task Gains Either Compound or Evaporate

Leading indicators (visible in 2-4 weeks)

Lagging indicators (meaningful at 8-12 weeks)

Layer 4: The Only Layer That Pays Back

Building the Stack Without Lying to Yourself

Establish the pre-AI baseline before deploying anything

Roll out to a subset of teams. Keep a real control group.

Collect all four layers from week one — not as an afterthought

Run quarterly honest-ROI reviews with cross-functional attendance

Seven Patterns of Self-Deception in AI ROI Reporting

AI ROI Self-Deception Patterns

Counting gross savings without the rework discount

Treating self-reported surveys as primary evidence

Projecting one team's results across the org

Comparing against a fictional 'without AI' scenario

Measuring task speed while ignoring system throughput

Excluding AI costs from the denominator

Presenting leading indicators as if they were lagging outcomes

Attribution: How Much of This Is Actually the AI?

The Denominator Is Bigger Than the License Fee

An ROI Dashboard That Does Not Lie

AI ROI Dashboard Design Principles

Honest Measurement Is a Governance Problem, Not a Data Problem

Governance moves that make honest measurement possible

A note on methodology

Related

The Industry Already Knows the Numbers Are Bad

Why the Math Breaks Before It Even Starts

The counterfactual problem

The attribution problem

The local optimization trap

The quality discount nobody applies

The denominator problem

Four Measurement Layers. Each Answers a Different Question.

Layer 1: Activity Is a Health Check, Not a Value Metric

Layer 2: The Rework Discount Is the Whole Game

Layer 3: Where Task Gains Either Compound or Evaporate

Leading indicators (visible in 2-4 weeks)

Lagging indicators (meaningful at 8-12 weeks)

Layer 4: The Only Layer That Pays Back

Building the Stack Without Lying to Yourself

Establish the pre-AI baseline before deploying anything

Roll out to a subset of teams. Keep a real control group.

Collect all four layers from week one — not as an afterthought

Run quarterly honest-ROI reviews with cross-functional attendance

Seven Patterns of Self-Deception in AI ROI Reporting

AI ROI Self-Deception Patterns

Counting gross savings without the rework discount

Treating self-reported surveys as primary evidence

Projecting one team's results across the org

Comparing against a fictional 'without AI' scenario

Measuring task speed while ignoring system throughput

Excluding AI costs from the denominator

Presenting leading indicators as if they were lagging outcomes

Attribution: How Much of This Is Actually the AI?

The Denominator Is Bigger Than the License Fee

An ROI Dashboard That Does Not Lie

AI ROI Dashboard Design Principles

Honest Measurement Is a Governance Problem, Not a Data Problem

Governance moves that make honest measurement possible

A note on methodology

Related