Sprint velocity becomes meaningless when agents generate code faster than humans can review it. Story points go up; cycle time stays flat. The sprint board looks productive right up until the review queue contains thirty-seven unmerged pull requests and it's day eight of ten.
This is what happens when teams add coding agents without redesigning the operating model around them. The agents execute overnight — reliably, quickly, without complaint. By morning, GitHub shows dozens of open pull requests. The developers arrive, feel good about the output, and then realize nobody planned for who reviews all of this. The sprint was designed the old way: estimate what the team can build, assign the work, measure velocity in points completed. That model assumed code generation was the rate-limiting step. Agents removed that assumption in one sprint cycle.
The throughput wall is the point where agent code generation capacity outpaces human review capacity. It is not a tooling problem — adding better agents makes it worse. It is a sprint design problem: the operating model was built for a world where writing code was slow and expensive. Redesigning for a world where writing code is nearly free requires inverting the planning logic entirely. You plan sprints backwards now, from review throughput, not forward from agent capability.
When the Constraint Shifts Overnight
Code generation is no longer the slowest stage. Review is. Everything downstream from that fact is different.
Eliyahu Goldratt's Theory of Constraints[1] offers the clearest framework for what happens here. Every system with interdependent stages is bounded by its slowest stage — the constraint. Improve any other stage and you do not improve the system: you build up more work-in-progress in front of the actual bottleneck.
For most software delivery pipelines, code generation was never the constraint. Code review, QA, and deployment approval were slower. AI agents accelerated code generation — a non-constraint stage — without touching any downstream stage. The result is more code arriving at the review gate faster than reviewers can clear it. The queue grows. Review quality degrades as reviewers rush through larger piles of increasingly unfamiliar code. And the problem compounds: agents building on top of unreviewed code produce PRs with assumptions baked in that nobody has validated.[6]
This is why practitioners describe encountering the wall as a sudden shift rather than a gradual slowdown.[4] The agents are efficient enough that a small team can hit review capacity within the first sprint or two of adoption. The code keeps generating. The reviews don't keep up.
Why Adding More Agents Makes the Wall Taller
Optimizing a non-constraint stage builds inventory, not throughput.
- Automated (Overnight)
- startSprint Start
- ↓
- forkFork
- ↓
- ↓
- ↓
- actionCoding Agent A
- ↓ overnight
- actionCoding Agent B
- ↓ overnight
- actionCoding Agent C
- ↓ overnight
- Human Review Gate
- joinJoin
- ↓
- queuePR Queue — the wall
- ↓ clears slowly
- manualHuman Reviewer
- ↓
- decisionPR Outcome?
- ↓ approved
- ↓ returned
- actionQA Pass
- ↓
- endDeployed
The Agile Leadership Day India framework for AI-augmented Scrum teams[5] puts this plainly: "A 24/7 AI agent will quickly outpace human reviewers. If you do not plan human capacity for code review, your agents will stack up a massive backlog of unmerged pull requests, stalling your entire continuous integration pipeline."
The instinct when facing this is to add more agents — the reasoning being that agents are cheap, so scaling them is low-cost. This is the wrong response. More agents at the generation stage mean more PRs at the review stage. The constraint is not generation; it is review. Adding capacity to the wrong stage increases inventory without improving throughput.[7] The correct response is to constrain agent output to what reviewers can actually clear.
Designing the Sprint Backwards From Review Capacity
The planning sequence flips: review budget first, agent assignment second.
The inversion is this: instead of asking "what can agents generate this sprint?", ask "what can reviewers approve this sprint?" Build the sprint backwards from that number.
This feels counterintuitive because agents have surplus capacity. They could generate far more than the sprint assigns them. That surplus feels like waste. It is not waste — it is correctly identified excess that exceeds the system's throughput capacity. Agents sitting idle while the review queue clears is the right state. Agents running while the review queue grows is the wrong state, regardless of how full the sprint board looks.
Assign tickets to team and agents; estimate all work in story points
Agents execute overnight; morning reveals a large, unplanned PR queue
Measure velocity by story points or PRs opened
Review happens when reviewers have bandwidth — backlog carries forward
Sprint ends with open agent PRs; next sprint starts already behind
Calculate review throughput first: reviewers × sustainable hours × sprint days
Set sprint PR budget from review capacity — this caps agent ticket assignment
Measure velocity by PRs merged, not opened
Stage agent execution so PRs arrive in daily batches reviewers can absorb
Sprint ends with zero open agent PRs — no carryover, clean start
The Review Throughput Formula
A calculation every engineering manager should run before the next sprint kickoff.
The formula has three inputs: the number of available reviewers, the sustainable daily review hours per reviewer, and the average hours required to review one agent PR.
Sprint review throughput = (reviewers × sustainable review hours per day × sprint days) ÷ hours per agent PR review
The "sustainable" qualifier matters. A senior engineer can maintain focused code review for roughly 2–2.5 hours per day before quality degrades meaningfully. Review beyond that threshold still happens, but defect detection drops — reviewers are reading without fully processing. This is not a criticism of the individuals; it is how sustained cognitive load works under context-switching pressure.
Using 2.5 hours as the sustainable budget: a team with three reviewers across a ten-day sprint has 75 review-hours available. If the average agent PR is 400 lines and requires approximately 75 minutes of careful review, that's roughly 60 PRs the team can responsibly approve in that sprint. Assign agents to work that will generate more than 60 PRs and the sprint is over-committed before it starts.
Two variables drive this formula above all others: PR count and PR size. Reducing average PR size from 600 lines to 300 lines roughly doubles review throughput with no additional headcount. This is the highest-leverage control variable in a hybrid team's sprint design.
| PR Size (lines) | Defect Detection Rate | Typical Review Time | Elite Benchmark |
|---|---|---|---|
| < 100 lines | 87% | 20–30 min | ✓ Below elite median |
| 101–300 lines | 78% | 45–75 min | ✓ At elite median |
| 301–600 lines | 65% | 90–150 min | ⚠ Above elite median |
| 601–1,000 lines | 42% | 2.5–3.5 hrs | ✗ Well above limit |
| > 1,000 lines | 28% | 3.5+ hrs (often rushed) | ✗ No review is reliable |
Ticket Sizing When 30–50% of Tickets Are Agent-Executed
Story points estimate human effort. They need a replacement for agent work.
Story points were designed to capture human cognitive and execution cost. An agent executes a 5-point ticket in minutes. Applying story points to agent work produces inflated velocity numbers and, worse, a planning model that has no concept of the review burden being created.
Hybrid teams need a different sizing unit for agent work: scope (what gets changed) rather than effort (how long it takes). The relevant question for an agent ticket is not "how many hours?" but "how many files, how many systems, and how many review-hours does this generate?"
- 1
Separate agent tickets from human tickets in the backlog
Tag every ticket as human-executed or agent-executed before backlog grooming. Human tickets retain story points. Agent tickets get scope-based sizing: small (1–2 files, under 200 lines), medium (3–8 files, 200–500 lines), large (more than 8 files or more than 500 lines). Large agent tickets must be broken into medium tickets before assignment — an agent that generates a 1,200-line PR from a single large ticket will produce a review that nobody can responsibly approve in one session.
- 2
Calculate the sprint PR budget before assigning agent tickets
At sprint planning, run the review throughput formula before discussing agent ticket assignments. This number is a hard cap — it does not flex upward because agents have surplus capacity. Assign agent tickets until the cumulative expected PR count hits the budget, then stop. Work beyond the budget is backlogged to the next sprint.
- 3
Run a mid-sprint review queue health check
At the sprint midpoint, check the ratio of open agent PRs to remaining reviewer capacity. If more than 30% of agent PRs have been waiting more than two days without a first-review comment, the sprint is over-committed or execution staged poorly. Pause new agent ticket assignment. Clear the queue first, then resume. This ceremony prevents the compounding failure: agents continuing to generate while the backlog grows, ending the sprint with dozens of unmerged PRs that carry forward as technical debt and planning confusion.
The Three Sprint Ceremonies Hybrid Teams Are Missing
Add these to planning cadence before the throughput wall hits you in month two.
- Sprint Planning
- startReview Capacity
- ↓ caps the sprint
- actionSprint PR Budget
- ↓
- Agent Execution
- forkFork
- ↓
- ↓
- ↓
- actionAgent Task A
- ↓
- actionAgent Task B
- ↓
- actionAgent Task C
- ↓
- joinJoin
- ↓
- queueStaged PR Batches
- ↓
- Human Review
- manualHuman Review Gate
- ↓
- decisionPR Fits Budget?
- ↓ approved
- ↓ too large
- endMerged & Shipped
- actionOversize Returned
- ↓ re-scope
Acceptance Criteria That Actually Clear the Wall
An agent PR isn't done when it's submitted. It's done when it's reviewed and merged.
Standard acceptance criteria were written for human-generated PRs, where the developer carries context about what they changed and why. Agent-generated PRs arrive without that human context — the reviewer is reading code from a process that has no standing relationship with the codebase and no institutional memory.
This means agent ticket acceptance criteria need two additions: context requirements (what the agent must include in the PR description to make the reviewer's job feasible) and scope constraints (guardrails that prevent agents from modifying code outside the ticket's stated boundaries, which is a common pattern that silently expands review surface area).
Required acceptance criteria for agent-executed tickets
- ✓
PR size within the sprint limit (team-defined, typically 300–500 lines). Agent must submit separate PRs if scope overflows the limit.
- ✓
PR description includes: what changed, why, and which areas carry the most risk. Agent-generated summaries count if complete.
- ✓
Test coverage at or above the codebase baseline. Agents must generate tests alongside implementation code — not as a separate follow-up ticket.
- ✓
No changes to files outside the ticket's stated scope. Agents commonly 'improve' adjacent code; this creates unplanned review surface.
- ✓
Review-ready signal only after a developer scans for obvious issues. One human pass before the PR enters the queue.
Warning signals that the throughput wall is already here
Review queue depth growing sprint-over-sprint without a matching increase in merged PRs
PRs sitting more than three days without a first-review comment
Reviewers approving PRs in under ten minutes — pattern recognition, not genuine review
Sprint velocity (points completed) rising while cycle time stays flat or grows
Agent-generated tickets entering QA at higher defect rates than human-written tickets
How many agent-generated PRs can a senior engineer responsibly review per week?
There is no universal number, but you can calculate it from your team's context. A sustainable upper bound is roughly 2–2.5 hours of focused review per day. At 45–90 minutes per small-to-medium agent PR (200–400 lines), that is 2–3 agent PRs per reviewer per day, or 10–15 per reviewer per week. For larger PRs (600+ lines), the number drops toward 5–10 per week. Beyond these ranges, defect detection rates fall measurably — reviewers start pattern-matching rather than reading carefully. The LinearB benchmark data puts elite team PR sizes below 219 lines for exactly this reason: small PRs review fast and catch more defects per hour of reviewer time.
What should we do when agents generate more PRs than reviewers can handle in a sprint?
Stop assigning new agent tickets. The instinct is to keep agents running because they are cheap and the tickets exist in the backlog. Resist it. A growing review queue creates compounding problems: agents may start building on unreviewed code, reviewers lose context between sessions, and the queue ages in a way that makes it progressively harder to clear. Pause agent execution, clear the existing queue to zero, then resume with a corrected sprint PR budget. One sprint of under-assignment is better than three sprints of compounding overhang.
Do story points still work for sprints with agent-executed tickets?
For human-executed work, yes. For agent-executed work, no — not as a capacity planning tool. Story points capture human effort; agents collapse execution time to near-zero, making point estimates meaningless for agent tickets. Most teams running hybrid sprints eventually split their sprint board: agent tickets tracked by scope and review budget, human tickets tracked by story points. The two systems run in parallel without conflict.
Should we hire more reviewers or improve AI review tooling first?
Improve tooling first. AI-assisted code review tools that handle first-pass flagging (security vulnerabilities, missing test coverage, obvious code smells) can meaningfully reduce the time a human reviewer needs per PR. If automated tooling reduces per-PR review time by 20–30%, that directly expands sprint PR budget without headcount. Once tooling is in place and the PR budget is still insufficient, then additional reviewer headcount is justified — and the tooling makes new reviewers more effective from day one.
How does sprint retrospective change with a hybrid human-agent team?
The core retro questions shift. Instead of 'did we estimate correctly?', ask 'was our review capacity accurately planned?'. Instead of 'what slowed individual developers?', ask 'where did the review queue back up and why?'. Velocity calibration — the traditional retro focus — becomes less relevant because agent execution time is consistent and fast. Throughput system analysis becomes the primary goal: was the sprint PR budget right? Did staging work? Were acceptance criteria enforced? These questions produce more actionable adjustments than estimation accuracy discussions.
Sources and data notes
Defect detection rates by PR size (87% to 28% across size ranges) are from SmartBear/Cisco research as reported and contextualized in Vitalii Petrenko's analysis of the LinearB 8.1M PR dataset. LinearB benchmark figures (elite team median PR size and pickup time) are from the same source. The sustainable review hours estimate (2–2.5 hours per day) is practitioner-derived and consistent with cognitive load research on sustained analytical tasks, but not from a single citable study — treat it as a calibration starting point, not a fixed ceiling. Specific teams report different sustainable windows based on reviewer experience, codebase familiarity, and PR quality.
Sources:
- Goldratt Institute: Theory of Constraints
- Vitalii Petrenko: The Hidden Cost of Slow Code Reviews — 8M PR Dataset
- Abhilash M: Your AI Coding Agent Is a 100x Developer, But Code Review Isn't
- Kukicola: The Review Bottleneck
- Agile Leadership Day India: AI-Augmented Scrum Framework
- ZenBusiness: Breaking Bottlenecks — Theory of Constraints in Software
- iSixSigma: From Concept to Code — TOC for Software Development
- GitHub: PR Throughput Metrics in Copilot Usage API
- Qodo: Code Review Process at Scale
- More Than Monkeys: The Pragmatic Engineer's Guide to TOC
- [1]Goldratt Institute: Theory of Constraints(goldratt.com)↩
- [2]Vitalii Petrenko: The Hidden Cost of Slow Code Reviews — Data from 8 Million PRs (LinearB benchmark, SmartBear/Cisco defect data)(medium.com)↩
- [3]Abhilash M: Your AI Coding Agent Is a 100x Developer — But Your Code Review Process Isn't(medium.com)↩
- [4]Kukicola: The Review Bottleneck — When AI Codes Faster Than You Can Read(kukicola.io)↩
- [5]Agile Leadership Day India: AI-Augmented Scrum Framework — Running Scrum When Half Your Team is AI Agents(agileleadershipdayindia.org)↩
- [6]ZenBusiness: Breaking Bottlenecks — Applying Theory of Constraints to Software Development(tech.zenbusiness.com)↩
- [7]iSixSigma: From Concept to Code — Leveraging Theory of Constraints for Software Development(isixsigma.com)↩
- [8]GitHub: Pull Request Throughput and Time-to-Merge Available in Copilot Usage Metrics API(github.blog)↩
- [9]Qodo: Build a Code Review Process That Handles 10x More PRs(qodo.ai)↩
- [10]More Than Monkeys: The Pragmatic Engineer's Guide to the Theory of Constraints(morethanmonkeys.medium.com)↩