A practical guide to loop engineering in Claude Code: how to design tests, hooks, subagents, workflows, and context gates that turn agentic coding into evidence-driven iteration.
Loop engineering Claude Code is the discipline of deciding what feedback an agent must receive before you trust its next move. The prompt starts the session, but the loop determines whether the session improves, drifts, or stops early with a confident summary and no evidence.
Claude Code is built for this. The official best-practices guide frames it as an agentic coding environment that can read files, run commands, make changes, and work through problems autonomously, but it also warns that autonomy needs verification criteria, context management, and evidence rather than assertions.[1] Anthropic's agent patterns research makes the same point at the system level: coding agents work unusually well because code can be checked with tests, builds, tool results, and environmental ground truth at each step.[7]
That is the practical shift. A weak Claude Code workflow says, "Fix the bug." A stronger workflow says, "Write the failing test, run this command, edit only the files that satisfy the failure, rerun the focused test, then run the broader check before you report." The second prompt is not just longer. It defines the feedback topology.
This matters because Claude Code sessions are no longer one answer at a time. Anthropic's 2026 analysis of Claude Code sessions found that users make most planning decisions while Claude makes most execution decisions; a typical prompt can trigger many agent actions, with a long tail of sessions exceeding 100 actions per prompt.[11] The human still steers, but the machine is doing enough action that vague completion criteria become a production risk.
The answer is not to bury Claude under rules. The answer is to engineer loops that make the right behavior cheaper than the wrong behavior. Use tests when the desired behavior is executable. Use hooks when a rule must run every time. Use subagents when research or review would pollute the main context. Use headless mode when the loop belongs in CI. Use /goal or explicit stop gates when the task is long enough that "done" needs an independent check.
The failure mode is not that Claude Code ignores you. It is that it optimizes against incomplete signals.
A Claude Code loop has six parts: a goal, a context boundary, an action surface, a verification signal, a repair path, and a stop rule. If any part is missing, the human becomes that missing part.
The goal says what success means. The context boundary says which files, docs, logs, screenshots, or external sources matter. The action surface says which tools Claude may use. The verification signal says what must turn green. The repair path says what Claude should do when the signal fails. The stop rule says when it is allowed to finish.
The official best-practices guide is explicit about the verification signal: give Claude a check it can run, such as a test suite, build, linter, script, or browser screenshot, and ask it to show evidence rather than merely assert success.[1] That maps cleanly to Anthropic's evaluator-optimizer pattern, where one pass generates a result and another pass evaluates it against clear criteria.[7]
In Claude Code, the evaluator does not have to be another model. It can be bun test. It can be a Zod schema validator. It can be npm run build. It can be a browser snapshot after clicking the changed UI. It can also be a subagent or workflow that reads the transcript and tries to refute the result. The implementation detail matters less than the contract: every loop iteration must return information that can change the next action.
A practical loop contract should fit on one screen:
The failure budget is underused. If Claude tries three fixes and the same test still fails, the loop should change shape. That may mean re-reading the code, asking for one clarification, adding instrumentation, using a subagent, shrinking the task, or deleting the attempted abstraction. Without a failure budget, the loop becomes repetition with new syntax.
One large instruction tries to cover intent, style, tests, and edge cases.
The final answer is treated as the completion signal.
Failures are handled by adding more instructions after the fact.
Research, editing, and verification all share one crowded context window.
The human discovers missing tests or stale assumptions during review.
A small contract defines objective, scope, red signal, green signal, and stop condition.
Completion requires evidence from a command, validator, browser check, or reviewer.
Failures feed back into a bounded repair loop with an iteration budget.
Noisy research and adversarial review move to subagents or workflows.
The human reviews evidence and design tradeoffs instead of re-running basic checks.
TDD is not ceremony here. It is how you give the agent a measurable target.
Do not start with the implementation. Ask Claude to add one test that expresses the behavior through the public surface. Run it and confirm it fails for the expected reason.
Tell Claude which files are in scope and which convention sources to read first. This prevents accidental architecture changes during a local bug fix.
Name the exact focused command and the broader command that must pass. For a UI change, include a browser action, not just a render test.
After repeated failures, require root-cause analysis before another fix. The loop should become more diagnostic, not more frantic.
The final message should say what passed, what did not run, what was assumed, and what the reviewer should inspect first.
loop-contract.mdGoal: add merchant CSV import validation.
Scope:
- Read src/imports/csv.ts, tests/imports, and package scripts first.
- Edit only csv import code and its tests unless you state a scope expansion.
Red signal:
- Add one failing test for duplicate merchant_id rows returning a validation error with the duplicate IDs.
- Run: bun run test -- tests/imports/csv.test.ts -t duplicate
- Confirm it fails because the behavior is missing.
Green signal:
- Implement the smallest fix.
- Run the focused duplicate test.
- Run the full csv import test file.
- Run typecheck if this changes exported types.
Stop rule:
- Do not report completion unless the focused and file-level tests pass.
- Include the exact commands and outcomes.
Failure budget:
- If two fixes fail with the same assertion, stop editing and explain the data flow before trying again.Stop hooks convert a quality rule from an instruction into a lifecycle constraint.
.claude/settings.json and .claude/hooks/verify-before-stop.sh{
"hooks": {
"Stop": [
{
"hooks": [
{
"type": "command",
"command": "${CLAUDE_PROJECT_DIR}/.claude/hooks/verify-before-stop.sh",
"timeout": 120,
"statusMessage": "Verifying before stop"
}
]
}
]
}
}
#!/usr/bin/env bash
# .claude/hooks/verify-before-stop.sh
set -euo pipefail
# Avoid recursive blocking when Claude is already responding to this hook.
if jq -e '.stop_hook_active == true' >/dev/null; then
exit 0
fi
bun run test -- lib/articles.test.ts >&2 || {
echo 'Focused tests are failing. Fix them before stopping.' >&2
exit 2
}
exit 0The longer the session, the more aggressively the loop must manage what enters the main context.
Claude Code loads CLAUDE.md at session start, so it is the right place for commands, repo etiquette, and broad conventions. Keep it short enough to remain useful.[1]
A deep research routine, release checklist, or content pipeline should load only when needed. This avoids paying context for every session.
Claude Code subagents run in their own context with their own prompts, tools, permissions, and model settings. They can return a summary instead of flooding the main session with every search result.[3]
Anthropic's context engineering guidance recommends giving agents the smallest high-signal set of tokens that maximizes the desired outcome, while using compaction, notes, retrieval, and subagents to avoid context overload.[8]
Agent tools should return high-signal information, expose unambiguous parameters, and be tested with realistic tasks rather than superficial demos.[9]
.claude/agents/loop-verifier.md---
name: loop-verifier
description: Read-only reviewer that checks whether a Claude Code session satisfied its loop contract.
tools: Read, Grep, Glob, Bash
model: sonnet
permissionMode: read-only
maxTurns: 6
---
You are a verification reviewer. Do not edit files.
Inputs:
- The user's objective
- The intended edit scope
- The verification commands claimed by the main session
- The current git diff
Check:
1. Does every changed production line trace to the objective?
2. Did the session create or reuse a failing test before implementation?
3. Do the claimed commands cover the behavior, not just syntax?
4. Are there unverified assumptions or stale docs?
5. Is there any scope expansion that should be called out?
Return:
- verdict: pass | fail
- blocking_findings: filename:line plus reason
- missing_evidence: commands or browser checks still needed
- reviewer_next_step: the first thing a human should inspect| Mechanism | Use it when | Failure it prevents |
|---|---|---|
| One prompt | The task is small, the check is obvious, and the human is watching. | Unverified final summaries. |
/goal | The task may span turns and needs a standing completion condition. | Stopping before the observable target holds. |
| Stop hook | A command must run before Claude can finish, with no exceptions. | Forgetting to run tests, lint, schema validation, or browser checks. |
| PostToolUse hook | You want immediate feedback after edits or tool calls. | Accumulating many bad edits before the agent notices. |
| Verification subagent | The worker context is biased or crowded and a fresh read-only review is cheap. | Self-grading and hidden scope drift. |
| Workflow | The process needs multiple agents, repeatable orchestration, or scripted fan-out. | Rebuilding a complex process by memory in every session.[5] |
Headless claude -p | The task is bounded enough to run inside CI or a local script. | Manual review of a repeatable diff or artifact check.[4] |
When a loop becomes repeatable, remove it from the chat transcript and make it a command.
scripts/claude-loop-review.sh#!/usr/bin/env bash
set -euo pipefail
prompt='Review this diff for missing tests, unsafe scope expansion, and unsupported completion claims. Return pass only when the diff has behavior evidence.'
schema='{"type":"object","properties":{"verdict":{"type":"string","enum":["pass","fail"]},"findings":{"type":"array","items":{"type":"string"}}},"required":["verdict","findings"]}'
git diff --merge-base main HEAD |
claude --bare -p "$prompt" \
--allowedTools 'Read,Grep' \
--output-format json \
--json-schema "$schema" |
tee /tmp/claude-loop-review.json |
jq -e '.structured_output.verdict == "pass"'The easiest way to make loop engineering Claude Code concrete is to write the loop as if it were a small runbook. Here are three complete patterns that cover most real engineering work.
Use this when the symptom is already known and the codebase has a test runner. Claude Code's common workflow recipes treat test-running and verification as a normal part of coding work, not a special ceremony.[6] Start by asking Claude to read the failing behavior path, not to patch it. The first turn should produce a small hypothesis and one reproducing test. A good prompt is: Users can submit the signup form twice by double-clicking. Read the form component and existing tests. Add one failing test that proves the duplicate submit happens, run that test, then fix the smallest production code path that makes it pass. Do not change analytics or styling.
The red signal is the duplicate-submit test. The green signal is the focused test passing, followed by the containing test file. The repair rule is narrow: if the second fix still fails, Claude must stop editing and explain whether the duplicate request comes from button state, form state, mutation retries, or route navigation. That prevents the agent from trying unrelated debounce, state, and API changes in one diff.
The mistake to avoid is asking for the fix and then asking for tests afterward. That reverses the learning signal. Claude may write tests around the implementation it just created, and those tests often encode the current shape of the code rather than the behavior the product needs. When the test comes first, the implementation has to satisfy an external condition. When the test comes second, the test can become a justification artifact.
Use this when the change touches what a user sees. The loop needs two verifiers: a code-level verifier and a browser verifier. The prompt should name the page, viewport, primary action, and expected state after interaction. For example: On /automation, add a compact newsletter CTA after the third article card. Match the existing card density. Add or update the relevant component test first. After tests pass, open the route in a browser, click the CTA, and re-snapshot the result. Report the screenshot or snapshot evidence.
The browser requirement is not decorative. A component test can prove that JSX exists while the page still fails because of stale data, hidden overflow, mobile overlap, disabled buttons, or a broken client interaction. The loop closes only after the browser action creates the expected state. If the page has no meaningful changed control, click the nearest workflow control that proves the page remains navigable, such as a table-of-contents link, source backlink, share button, search button, or form submit.
A good UI loop also names the visual risks before the edit: text overflow, small touch targets, broken focus order, hover-only affordances, stale image paths, and mobile layout shifts. That list is not a style lecture. It is the verifier's search plan. If Claude knows the risks before it opens the browser, it can inspect the rendered page adversarially instead of confirming that the DOM contains the new node.
Use this when the change spans many files or data records. A migration loop needs inventory before edits. The prompt should force Claude to list the affected contract, search for all call sites, and define a static invariant before implementation. For example: Rename the article relationship type from applies-to to applies in the schema and all data. First inspect the schema, content files, tests, and rendering code. Add a failing schema or fixture test that rejects the old value. Then update only the places required by that invariant. Run the validator and the article tests.
The red signal is the schema or fixture that proves the old state is no longer accepted. The green signal is a validator over the migrated data plus the tests that exercise rendering. The stop rule should require a diff audit: no unrelated formatting churn, no generated files unless the pipeline requires them, and no relationship slugs invented by pattern matching.
A migration loop should also have a rollback question. If this change ships and a downstream consumer still sends the old value, what happens? Sometimes the right answer is a hard rejection. Sometimes it is a compatibility adapter with a removal date. Sometimes it is a two-step migration where readers accept both values before writers switch. The loop should force that decision before the edit, because the wrong compatibility posture can be invisible to local tests.
When a Claude Code loop goes wrong, the symptom usually falls into one of five buckets.
First, the red signal is too weak. The test checks truthiness, a status code, or an internal call count instead of the behavior the user cares about. Fix the assertion before fixing production code. A weak red signal trains the agent to satisfy the wrong target. Anthropic's agent eval guidance makes the same operational point at a larger scale: evaluation tasks should be realistic and paired with verifiable outcomes before teams rely on iteration signals.[10]
Second, the context boundary is too broad. Claude reads half the repository, spends the context budget, and returns with a plan that mixes the real task with every nearby smell. Fix the boundary by naming the files, examples, and docs that matter. Push unrelated findings into a not changed note.
Third, the verifier is too late. If the loop runs tests only at the end of a long edit, the failure report arrives after many variables changed. Add a PostToolUse hook, a focused command after each patch, or a smaller task slice. Earlier feedback makes root cause cheaper.
Fourth, the stop rule is social instead of mechanical. A line in CLAUDE.md that says to run tests is useful, but it is advisory. A Stop hook that blocks until tests pass is mechanical. Use social rules for judgment and deterministic rules for invariants.
Fifth, the repair path repeats itself. Two failed attempts with the same assertion should not lead to a third blind patch. The loop should switch to diagnosis: print the actual value, compare a working fixture, inspect the call path, or ask one scoped question. Repetition is not iteration unless the next action uses new evidence.
The final output of an engineered loop is not a victory lap. It is a review handoff. The reviewer needs to know what changed, what proved it, what did not run, and where the risk remains. A good handoff says: Changed the CSV import validator to reject duplicate merchant IDs. Authored a failing duplicate-row test first; it failed on the missing validation branch, then passed after the fix. Ran bun run test -- tests/imports/csv.test.ts and bun run test -- --project unit. Did not run browser verification because no UI path changed. Assumed duplicate IDs should reject the whole file rather than keep first row.
That paragraph gives the reviewer leverage. They can inspect the policy assumption first, then the validator, then the test. They do not have to infer whether the agent remembered to run the suite. They do not have to decode whether the final answer is based on fresh output or hope.
These examples show the same structure with different tools. Bug fixes emphasize a reproducing test. UI work adds browser interaction. Migrations add inventory and static invariants. Failure-mode loops add diagnosis. Review loops add evidence handoff. The loop changes because the failure mode changes. That is the point: do not cargo-cult a single Claude Code ritual. Engineer the smallest loop that catches the specific way this task can go wrong.
Is loop engineering just better prompting?
No. Better prompting helps, but loop engineering decides the feedback topology around the prompt: what evidence is gathered, what gates are deterministic, which context is isolated, and when the agent is allowed to stop.
Should every Claude Code task use hooks and subagents?
No. Start with the lightest loop that can enforce the risk. A small typo fix may need one prompt and one focused command. A migration, auth change, or UI workflow deserves tests, browser verification, and possibly a fresh reviewer.
What is the most common bad loop?
The most common bad loop is implementation without a red signal. Claude edits until the code looks plausible, then the human becomes the first real verifier. Write or identify the failing check first.
When should a loop stop trying fixes?
After repeated failures with the same symptom, stop patching and switch to diagnosis. Re-read the data flow, compare working examples, instrument a boundary, or split the task. More edits are not always more progress.