Velocity is a lie when agents generate code faster than humans can read it. Story points climb. Cycle time does not. The board looks productive right up until day eight, when the review queue holds thirty-seven open pull requests and nobody is sure which ones are safe to merge.
This is what teams hit in the first or second sprint after onboarding coding agents. The agents run overnight. By morning, GitHub shows dozens of PRs. The developers feel the dopamine, then realize nobody owns review. The sprint was planned the old way — estimate what the team can build, assign the work, count points. That model assumed code generation was the slow stage. Agents removed that assumption in one cycle.
The throughput wall is the point where agent generation capacity outruns human review capacity. It is not a tooling problem. Better agents make it worse. It is a sprint design problem: the operating model was built for a world where writing code was the bottleneck. Writing code is no longer the bottleneck. Review is. And review does not scale by adding agents — it scales by capping what agents are allowed to push at it.
The planning logic inverts. You no longer ask what agents can generate this sprint. You ask what reviewers can clear. Then you build the sprint backwards from that number.
The Constraint Moved. Your Sprint Design Did Not.
Code generation stopped being the slow stage. Everything downstream of that fact is different.
Goldratt's Theory of Constraints[1] names this exactly. Every system of interdependent stages is bounded by its slowest stage — the constraint. Improve any other stage and you do not improve the system. You build inventory in front of the actual bottleneck.
For most software pipelines, code generation was never the constraint. Review, QA, and deployment approval were slower. Agents accelerated a non-constraint stage and left every downstream stage untouched. The result: more code arriving at the review gate faster than reviewers can clear it. The queue grows. Quality erodes — reviewers rush larger piles of unfamiliar code. Then the failure compounds: agents start building on top of unreviewed code, baking in assumptions nobody validated.[6]
Practitioners describe hitting the wall as a step change, not a gradual slowdown.[4] One sprint, the queue is fine. The next, it has calcified. Agents are efficient enough that a small team blows through review capacity inside the first two sprints of adoption.
What broke in our early hybrid designs: we assumed senior engineers would drift toward review as their primary contribution once agents handled implementation. They did not. They kept owning implementation and treated agents as a force multiplier on their own output. Review became a shared burden with no owner — which is the structural definition of nobody's job. The fix was not a tool. It was role redesign: one senior engineer per sprint named as primary reviewer, protected from new implementation assignments for the sprint duration. That single ownership change broke the queue accumulation pattern.
More Agents Build Inventory, Not Throughput
Optimizing a non-constraint stage is the most reliable way to make the constraint worse.
The Agile Leadership Day India framework for AI-augmented Scrum[5] states the failure mode plainly: "A 24/7 AI agent will quickly outpace human reviewers. If you do not plan human capacity for code review, your agents will stack up a massive backlog of unmerged pull requests, stalling your entire continuous integration pipeline."
The instinct on first contact is to add more agents. Agents are cheap. Scaling them feels low-risk. It is the wrong move. More agents at generation means more PRs at review. Generation is not the constraint. Adding capacity to a non-constraint stage builds inventory. It does not raise throughput.[7] The correct move is the opposite: constrain agent output to what reviewers can actually clear, and treat any surplus generation capacity as deliberate slack — not waste.
Plan the Sprint Backwards From Review Capacity
Review budget first. Agent assignment second. The order is not negotiable.
The inversion is one sentence: stop asking what agents can generate this sprint. Ask what reviewers can approve. Build everything backwards from that number.
This feels wrong because agents have surplus capacity sitting on the table. They could generate three times the work the sprint assigns them. That surplus reads as waste. It is not waste. It is correctly identified excess that exceeds the system's actual throughput. Agents idle while the review queue clears is the right state. Agents running while the review queue grows is the failure state — no matter how green the sprint board looks.
Tickets assigned to team and agents, all estimated in story points
Agents run overnight; morning reveals an unplanned PR queue
Velocity measured in story points or PRs opened
Review happens when reviewers find bandwidth — backlog rolls forward
Sprint ends with open agent PRs; next sprint starts already underwater
Review throughput calculated first: reviewers × sustainable hours × sprint days
Sprint PR budget set from review capacity — caps agent ticket assignment
Velocity measured in PRs merged, not opened
Agent execution staged so PRs land in daily batches reviewers can absorb
Sprint ends with zero open agent PRs — no carryover, clean start
The Review Throughput Formula
Three inputs. One number. Run it before sprint kickoff or run into the wall again.
Three inputs: number of available reviewers, sustainable daily review hours per reviewer, average hours required to review one agent PR.
Sprint review throughput = (reviewers × sustainable review hours per day × sprint days) ÷ hours per agent PR review
The word sustainable carries the formula. A senior engineer can hold focused review for roughly 2–2.5 hours per day before quality degrades meaningfully. Past that threshold, review still happens. Defect detection drops. Reviewers are reading without processing. This is not a character flaw. It is how sustained cognitive load behaves under context-switching pressure.
Use 2.5 hours as the sustainable budget. Three reviewers across a ten-day sprint give you 75 review-hours. If the average agent PR is 400 lines and takes roughly 75 minutes of careful review, that is about 60 PRs the team can responsibly approve in the sprint. Assign agents to work that will produce more than 60 PRs and the sprint is over-committed before it starts.
Two variables dominate the formula: PR count and PR size. Cutting average PR size from 600 lines to 300 lines roughly doubles review throughput with zero new headcount. PR size is the highest-leverage control variable in a hybrid team's sprint design. Nothing else comes close.
| PR Size (lines) | Defect Detection Rate | Typical Review Time | Elite Benchmark |
|---|---|---|---|
| < 100 lines | 87% | 20–30 min | ✓ Below elite median |
| 101–300 lines | 78% | 45–75 min | ✓ At elite median |
| 301–600 lines | 65% | 90–150 min | ⚠ Above elite median |
| 601–1,000 lines | 42% | 2.5–3.5 hrs | ✗ Well above limit |
| 28% | 3.5+ hrs (often rushed) | ✗ No review is reliable |
Story Points Lie When Half the Tickets Are Agent-Executed
Story points estimate human effort. Agent work needs a different unit, not a smaller number.
Story points were built to capture human cognitive and execution cost. An agent finishes a 5-point ticket in minutes. Apply story points to agent work and you get inflated velocity numbers and — worse — a planning model with no concept of the review burden it is generating.
Hybrid teams need a different unit for agent work: scope (what gets changed), not effort (how long it takes). The relevant question for an agent ticket is not "how many hours?" — it is "how many files, how many systems, how many review-hours does this generate?"
The non-obvious corollary: teams that drop story points entirely for hybrid sprints often report better planning accuracy, not worse. Story points in a hybrid context create false precision. You are estimating effort for work that finishes in minutes while ignoring the review burden that actually determines sprint throughput. Replacing points with scope sizing for agent tickets removes a metric that was misleading the planning process, not informing it.
- [01]
Split the backlog: agent tickets and human tickets are different objects
Tag every ticket as human-executed or agent-executed before grooming. Human tickets keep story points. Agent tickets get scope sizing: small (1–2 files, under 200 lines), medium (3–8 files, 200–500 lines), large (more than 8 files or more than 500 lines). Large agent tickets are not sprint-ready. Decompose them into mediums or do not assign them. An agent that turns a single large ticket into a 1,200-line PR has produced something nobody can responsibly review in one session — and the reviewer who tries is doing pattern-matching, not review.
- [02]
Compute the sprint PR budget before any agent tickets get assigned
Run the review throughput formula at sprint planning before discussing agent ticket assignments. The number is a hard cap. It does not flex up because agents have surplus capacity. Assign agent tickets until cumulative expected PR count hits the budget. Then stop. Anything past the cap is backlogged to the next sprint. Honor the cap and the queue stays clean. Bend it once and the wall returns by mid-sprint.
- [03]
Run a mid-sprint queue health check — before the queue tells you it is too late
At the sprint midpoint, check the ratio of open agent PRs to remaining reviewer capacity. If more than 30% of agent PRs have been waiting more than two days without a first-review comment, the sprint is over-committed or staging is broken. Pause new agent assignment. Clear the queue. Then resume. This ceremony exists to prevent the compounding failure mode: agents continuing to generate while the backlog grows, ending the sprint with dozens of unmerged PRs that carry forward as technical debt and planning noise.
Three Ceremonies Hybrid Teams Are Missing
Add these to the planning cadence before the wall finds you in month two.
Acceptance Criteria That Actually Clear the Wall
An agent PR is not done when it is submitted. It is done when it is reviewed and merged.
Standard acceptance criteria were written for human-generated PRs, where the developer carries context about what changed and why. Agent PRs arrive without that context. The reviewer is reading code from a process that has no standing relationship with the codebase and no institutional memory.
Agent ticket acceptance criteria need two additions the human version does not. Context requirements — what the agent must include in the PR description so the reviewer's job is feasible. Scope constraints — guardrails that stop agents from modifying code outside the ticket's stated boundaries, which is a common drift pattern that silently expands the review surface. Both belong in the ticket template, enforced before the PR opens. If they live in reviewer goodwill, they will be skipped under load.
Required acceptance criteria for agent-executed tickets
- ✓
PR size within the sprint limit (team-defined, typically 300–500 lines). Scope overflow means separate PRs, not one big one.
- ✓
PR description names what changed, why, and which areas carry the most risk. Agent-generated summaries count if they are complete.
- ✓
Test coverage at or above the codebase baseline. Tests ship with implementation, not as a follow-up ticket nobody picks up.
- ✓
No changes to files outside the ticket's stated scope. Agents 'improve' adjacent code by default — that is unplanned review surface.
- ✓
Review-ready signal only after a developer scans for obvious failures. One human pass before the PR enters the queue.
Warning signals that the wall is already here
Review queue depth growing sprint-over-sprint without a matching rise in merged PRs
PRs sitting more than three days without a first-review comment
Reviewers approving PRs in under ten minutes — pattern recognition, not review
Sprint velocity (points completed) climbing while cycle time stays flat or grows
Agent-generated tickets entering QA at higher defect rates than human-written ones
How many agent-generated PRs can a senior engineer responsibly review per week?
No universal number, but you can compute it from your team. Sustainable upper bound is roughly 2–2.5 hours of focused review per day. At 45–90 minutes per small-to-medium agent PR (200–400 lines), that is 2–3 agent PRs per reviewer per day, or 10–15 per reviewer per week. For PRs above 600 lines, the number drops toward 5–10 per week. Past those ranges, defect detection rates fall measurably — reviewers stop reading and start pattern-matching. The LinearB benchmark puts elite team PR sizes below 219 lines for exactly this reason: small PRs review fast and catch more defects per hour of reviewer time.
What should we do when agents generate more PRs than reviewers can handle in a sprint?
Stop assigning new agent tickets. The instinct is to keep agents running because they are cheap and the backlog is full. Resist it. A growing review queue compounds: agents start building on unreviewed code, reviewers lose context between sessions, and the queue ages in a way that makes it progressively harder to clear. Pause agent execution. Drain the queue to zero. Resume with a corrected sprint PR budget. One sprint of under-assignment beats three sprints of compounding overhang every time.
Do story points still work for sprints with agent-executed tickets?
For human-executed work, yes. For agent-executed work, no — not as a capacity planning tool. Story points capture human effort. Agents collapse execution time to near-zero, which makes point estimates meaningless for agent tickets. Most teams running hybrid sprints end up splitting the board: agent tickets tracked by scope and review budget, human tickets tracked by story points. The two systems run in parallel without conflict because they measure different things.
Should we hire more reviewers or improve AI review tooling first?
Tooling first. AI-assisted review tools that handle first-pass flagging — security vulnerabilities, missing test coverage, obvious code smells — meaningfully reduce the time a human reviewer needs per PR. If automated tooling cuts per-PR review time by 20–30%, that directly expands the sprint PR budget without headcount. Once tooling is in place and the budget is still too small, then additional reviewer headcount is justified — and the tooling makes new reviewers more effective from day one. Hire into a working system, not into a broken one.
How does sprint retrospective change with a hybrid human-agent team?
The questions shift. Instead of 'did we estimate correctly?', ask 'was review capacity planned accurately?'. Instead of 'what slowed individual developers?', ask 'where did the review queue back up and why?'. Velocity calibration — the traditional retro focus — becomes less interesting because agent execution time is consistent and fast. Throughput system analysis becomes the primary work: was the sprint PR budget right, did staging hold, were acceptance criteria enforced? These produce more actionable adjustments than estimation accuracy ever did.
- [1]Goldratt Institute: Theory of Constraints(goldratt.com)↩
- [2]Vitalii Petrenko: The Hidden Cost of Slow Code Reviews — Data from 8 Million PRs (LinearB benchmark, SmartBear/Cisco defect data)(medium.com)↩
- [3]Abhilash M: Your AI Coding Agent Is a 100x Developer — But Your Code Review Process Isn't(medium.com)↩
- [4]Kukicola: The Review Bottleneck — When AI Codes Faster Than You Can Read(kukicola.io)↩
- [5]Agile Leadership Day India: AI-Augmented Scrum Framework — Running Scrum When Half Your Team is AI Agents(agileleadershipdayindia.org)↩
- [6]ZenBusiness: Breaking Bottlenecks — Applying Theory of Constraints to Software Development(tech.zenbusiness.com)↩
- [7]iSixSigma: From Concept to Code — Leveraging Theory of Constraints for Software Development(isixsigma.com)↩
- [8]GitHub: Pull Request Throughput and Time-to-Merge Available in Copilot Usage Metrics API(github.blog)↩
- [9]Qodo: Build a Code Review Process That Handles 10x More PRs(qodo.ai)↩
- [10]More Than Monkeys: The Pragmatic Engineer's Guide to the Theory of Constraints(morethanmonkeys.medium.com)↩