Ship an agent. It triages tickets, flags transactions, routes leads. Week one, it works. Week three, the team is dismissing 40% of its alerts. Week six, someone has built a private spreadsheet to track which alerts are worth opening.
That spreadsheet is the failure mode and the fix in the same place. The dismiss, modify, and escalate actions your reviewers already perform are labeled training data. The model is generating the calibration signal it needs, every day, in production. Most teams log that signal as a debug artifact and forget about it.
This is the architecture that uses it. An audit log that captures every decision with the threshold that triggered it. A weekly tuner agent that reads four weeks of human responses and proposes evidence-backed threshold changes. A human approval gate that keeps the system accountable. Eight weeks in, the agent matches your team's judgment without anyone editing a rule.
Drift Is the Default. Manual Tuning Is the Tax.
The world moves faster than the prompt. Without a feedback mechanism, accuracy decays from the day you deploy.
Drift is structural, not exceptional. The fraud patterns that defined last quarter's training data are not this quarter's. The triage agent calibrated for 200 daily tickets behaves differently at 2,000. The world keeps moving. The prompt does not.
Manual tuning is the obvious response and a bad one. It depends on someone noticing — usually a frustrated stakeholder filing a complaint long after dismissals have normalized. It then routes through an engineer who has to diagnose, adjust, redeploy. And every fix addresses the symptom that escalated, not the structure underneath it. The cycle is whack-a-mole because the inputs are anecdotes.
A 2025 ISACA analysis tracked self-modifying AI systems with no structured feedback mechanism and found roughly 3x higher rates of unexpected behavioral change versus systems with formal oversight — the magnitude varies by system type but the direction does not.[5] The fix is not heroics. It is plumbing. Every human response to an agent decision becomes structured data. A second agent reads that data and proposes calibrations. A human approves them. The loop closes.
The Audit Log Is the Substrate. Get the Schema Right or Tune Nothing.
The tuner is only as good as what the log captured. Every record needs decision, evidence, and human response in one place.
Before you can tune anything, you need something to tune from. That means logging every agent decision with enough structure to reconstruct why it fired and what happened next.
The audit log is not a debug log. A debug log records what code ran. The audit log records what the agent decided, what evidence it used, and how a human responded. Three fields, one row, queryable for pattern analysis.
schemas/action-log.tsinterface ActionLogEntry {
id: string;
timestamp: string; // ISO 8601
agentId: string; // Which agent fired the decision
sessionId: string; // Groups related decisions in one run
// What the agent decided
decision: {
action: string; // e.g. "flag", "escalate", "auto-resolve"
confidence: number; // 0-1 score from the model
thresholdUsed: number; // The threshold that gated the action
reasoning: string; // Short rationale captured at decision time
};
// Evidence the agent considered
context: {
inputHash: string; // Dedup key for the input
features: Record<string, number>; // Scored features that drove the decision
matchedRules: string[]; // Rules or patterns that matched
};
// Human response (filled async, never blocks the agent)
humanResponse?: {
action: "approve" | "dismiss" | "modify" | "escalate";
respondedAt: string;
responderId: string;
modifiedAction?: string; // What they changed it to, if anything
note?: string; // Free-form rationale, optional
timeToRespond: number; // Seconds from decision to human response
};
}Store the entries somewhere queryable with time-range indexing. A PostgreSQL table with JSONB columns covers most teams. High-volume systems should partition by week — the tuner only ever reads the last four.
The human response field starts empty and gets backfilled as reviewers process their queue. That asynchrony is the whole point. The agent does not block waiting for a human. The human responds when they get to it. The tuner reads the joined record on a weekly cadence. Every layer runs at the speed it can sustain.
The Loop, End to End
How dismissals, modifications, and escalations move from human reviewers back into the threshold configuration the agent reads at runtime.
Four components, four cadences. Decoupling them is the architecture.
The Primary Agent runs in real time. It reads the current threshold config, makes decisions, writes every one to the action log. It never blocks on a human.
The Action Log accumulates decision records. Reviewer responses backfill async — most teams see the human field populated within 24-48 hours.
The Tuner Agent runs once a week, typically Sunday night or Monday morning. It reads four weeks of joined data, clusters dismissals and escalations, and emits a structured proposal with evidence counts.
The Approval Interface routes the proposals to a designated approver — a team lead or ops owner. They accept, reject, or modify each one. Nothing reaches the live config without explicit human sign-off.
An engineer reviews alert quality monthly, when they remember
Threshold changes ride a code deploy
No structured record of which alerts get dismissed
Drift surfaces through stakeholder complaints
Each fix patches a single symptom
The tuner reads four weeks of joined responses every week
Threshold changes ship via config flip after approval, no deploy
Every dismiss, modify, and escalate is captured with the threshold that fired
Drift surfaces in pattern clusters before a human notices
Proposals address recurring clusters with cited evidence
The Tuner Is an Analyst, Not a Fine-Tune
An LLM reading structured data and producing structured proposals. Three phases. A hard cap on output.
The tuner is not a fine-tune. It is not a training run. It is an LLM analyst that reads structured logs and produces structured proposals. Treat it as the data analyst who works exclusively on your agent's behavior, on a weekly schedule, with no other distractions.
Three phases per cycle: pattern detection, root cause clustering, proposal generation.
We got the third phase wrong on the first build. We let the tuner emit as many proposals as it found patterns for, assuming throughput would speed convergence. It did the opposite. Approvers waved through too many changes in a single week, and when missed escalations spiked the following week, no one could attribute the regression to a specific change. The 3-proposal cap looked arbitrary until we ran it. The cap forces the tuner to rank by expected impact, which forces it to write better proposals. The constraint is the leverage point.
- [01]
Aggregate the Response Patterns
typescript// Phase 1: Pattern Detection const fourWeeks = await actionLog.query({ from: subWeeks(now, 4), to: now, hasHumanResponse: true }); const patterns = { falsePositives: fourWeeks.filter( e => e.humanResponse?.action === "dismiss" ), missedSignals: fourWeeks.filter( e => e.humanResponse?.action === "escalate" ), modifications: fourWeeks.filter( e => e.humanResponse?.action === "modify" ), approvals: fourWeeks.filter( e => e.humanResponse?.action === "approve" ) }; - [02]
Cluster Dismissals and Escalations by Feature Similarity
typescript// Phase 2: Root Cause Clustering const dismissalClusters = clusterByFeatures( patterns.falsePositives, { minClusterSize: 5, similarityThreshold: 0.8 } ); const escalationClusters = clusterByFeatures( patterns.missedSignals, { minClusterSize: 3, similarityThreshold: 0.7 } ); // Tighter threshold for missed signals. // Missing a real one costs more than a false alarm. // The asymmetry lives in the clustering, not the prompt. - [03]
Emit Proposals With Evidence Attached
typescript// Phase 3: Proposal Generation const proposals = await tunerLLM.generate({ system: TUNER_SYSTEM_PROMPT, data: { dismissalClusters, escalationClusters, currentThresholds, weeklyTrends: computeTrends(fourWeeks) }, outputSchema: ProposalSchema });
The Tuner Prompt: Constraints In, Proposals Out
The system prompt is where the tuner becomes a tool instead of an opinion. Hard rules, structured output, no narrative.
prompts/tuner-system.txtYou are a threshold tuning analyst for an AI agent system.
Your job: read 4 weeks of agent decision logs and propose
threshold adjustments that cut false positives without
lifting missed signals.
INPUT:
- Clusters of dismissed decisions (false positives)
- Clusters of escalated decisions (missed signals)
- Current threshold configuration
- Week-over-week trend data
RULES:
1. No proposal without at least 5 cited log entries
2. Each proposal must include: current value, proposed value,
expected impact, supporting evidence count
3. Flag any proposal that might lift missed signals
4. If dismissal rate < 15%, propose nothing (system is healthy)
5. Hard cap: 3 proposals per cycle
6. Confidence label per proposal: low / medium / high
OUTPUT FORMAT:
{
proposals: [{
thresholdName: string,
currentValue: number,
proposedValue: number,
direction: "increase" | "decrease",
confidence: "low" | "medium" | "high",
expectedImpact: string,
evidenceCount: number,
sampleEntryIds: string[],
riskAssessment: string
}],
summary: string,
systemHealth: "healthy" | "needs-attention" | "degraded"
}The Approval Gate Is What Stops Recursive Drift
Tuner proposes. Human disposes. Without that gate, the system optimizes against itself.
The approval gate is the structural difference between a self-improving system and an unsupervised one. The tuner emits proposals. A human accepts, rejects, or modifies. The gate is not decoration — it is the mechanism that keeps the system from optimizing against criteria it set for itself.
A 2026 OneReach AI enterprise study tracked agentic AI deployments and found that systems with human-in-the-loop oversight ran roughly 60% fewer production incidents than fully autonomous deployments — the spread varies by use case but the pattern holds.[4] The approver does not need to read every statistical detail. They need to answer one question: does this change match how we want the system to behave?
| Field | Source | Why it is in the view |
|---|---|---|
| Threshold name | Tuner proposal | Names the decision boundary that moves |
| Current value | Live config | The baseline being changed |
| Proposed value | Tuner analysis | The new threshold |
| Evidence count | Action log query | How many log entries back the proposal |
| Sample entries | Action log | 3-5 representative dismissed or escalated cases |
| Expected impact | Tuner estimate | Predicted shift in false positive or miss rate |
| Risk assessment | Tuner analysis | What this change might break |
| Approver decision | Human input | Accept, reject, or modify with rationale captured |
Eight Weeks to Convergence. The Phases Are Predictable.
Two months of weekly cycles, four phases. Knowing the curve is what stops teams from killing the loop in week three.
The convergence pattern is predictable enough to plan around. The reason teams need to know it: phase one looks like nothing is working, and the temptation to pull the plug peaks exactly when the loop is collecting the data it needs. Set expectations on the curve, not the first cycle.
- [01]
Weeks 1-2: The Loop Is Logging, Not Tuning
The system writes decisions and human responses. It proposes nothing. This is the dataset the tuner needs for its first read. Dismissal rates stay high — the agent is running on its original, untuned thresholds and nothing is changing yet. The phase looks like inaction. It is the substrate.
- [02]
Weeks 3-4: First Cuts, High Confidence
The tuner runs its first cycle on two weeks of data. Early proposals tend to be the obvious ones — large dismissal clusters, clear feature patterns. False positives often drop 15-25% in this phase, though the magnitude depends entirely on how miscalibrated the original thresholds were.
- [03]
Weeks 5-6: Smaller Clusters, Tighter Calls
Four weeks of data including post-adjustment performance. The tuner can now measure the impact of its own earlier proposals. Proposals get more specific — smaller clusters, tighter confidence intervals. False positive rates often shave another 10-15%. Watch for over-correction surfacing as new escalation clusters.
- [04]
Weeks 7-8: The Tuner Goes Quiet
Equilibrium. Dismissal rates settle below 15%. The tuner starts proposing nothing, which is the correct output for a healthy system. The approver shifts from active reviewer to exception-based oversight. The loop is now calibrated to your team's judgment.
Reference Layout
Where the four components live in the repo.
Self-Improving Agent Project
treeself-improving-agent/
├── src/
│ ├── agent/
│ │ ├── primary-agent.ts
│ │ ├── decision-engine.ts
│ │ └── threshold-config.ts
│ ├── tuner/
│ │ ├── tuner-agent.ts
│ │ ├── pattern-detector.ts
│ │ ├── proposal-generator.ts
│ │ └── prompts/tuner-system.txt
│ ├── audit/
│ │ ├── action-log.ts
│ │ ├── schemas.ts
│ │ └── migrations/
│ └── approval/
│ ├── approval-api.ts
│ └── notification.ts
├── config/
│ ├── thresholds.json
│ └── tuner-schedule.json
└── tests/
├── tuner.test.ts
├── pattern-detector.test.ts
└── approval-flow.test.tsGuardrails: Bound the Loop or It Will Run Away
A self-tuning system without limits is a system that will eventually optimize itself off a cliff. These are the bounds that keep it reversible.
Bounds on the Loop
Maximum 3 threshold changes per cycle
Caps blast radius. Keeps attribution clean — when something regresses, you can name the change that did it.
No threshold moves more than 20% in a single cycle
Prevents overnight personality changes. Large corrections spread across multiple cycles, which is what convergence actually looks like.
Every change passes through human approval before it goes live
The tuner proposes. It does not deploy. The gate is the mechanism, not a courtesy.
Auto-rollback if error rate exceeds baseline by 10%
Post-change degradation past tolerance restores the previous config without waiting for a human. Fast escape hatch, no questions.
Full audit trail of every proposal, approval, rejection, and rollback
Replay fidelity for compliance and debugging. Every change links back to the evidence that produced it.
The tuner cannot rewrite its own evaluation criteria
Engineers set the meta-rules. The tuner does not. Without this bound, the loop optimizes the loop, and recursive drift sets in.
What to Watch
Four numbers that tell you the loop is healthy or that something has slipped.
90%
Build Order
Sequenced so each step unblocks the next. Skip nothing.
Pre-Production Build Order
Action log schema defined with decision, context, and human response fields
Action log storage deployed with time-range indexing
Primary agent instrumented to write every decision to the log
Human response capture wired into the existing review workflow
Pattern detection with clustering for dismissals and escalations implemented
Tuner system prompt written and tested against sample data
Approval interface built with evidence display attached to every proposal
Auto-rollback triggers configured with explicit tolerance bands
Two-week baseline collection period scheduled before first tuner run
First tuner cycle executed with full team review
Eight-week convergence tracked and the final threshold config locked in
Frequently Asked Questions
What if our team does not respond to enough alerts to generate useful data?
Below roughly 60% response rate the patterns get noisy and the tuner starts proposing on thin evidence. Fix the input before fixing the loop. Replace full triage with a binary thumbs-up/thumbs-down on each alert. Even that signal is enough for the tuner to identify the worst false positive clusters. A complete record of weak signals beats a sparse record of strong ones.
Does this work for rule-based agents, not just LLMs?
Yes. The loop is agent-architecture agnostic. Rule-based systems are easier to tune than LLM confidence scores — there is no ambiguity about what fired the decision when a deterministic rule tree is the engine. The tuner itself uses an LLM to read the data, but the agent it tunes can be a decision tree, a logistic regression, or a transformer. One adaptation: replace the confidence field in the action log with the rule or rule combination that triggered the decision. The tuner clusters by rule rather than by confidence range, and the rest works the same.
How do you stop the tuner from over-fitting to recent data?
The 4-week rolling window is the primary defense — long enough to absorb temporal variation, short enough to react to real shifts. The 3-proposal cap and 20% per-threshold cap add the damping. Domains with strong seasonality (retail, fraud, anything tax-cycle adjacent) extend the window to 6 or 8 weeks. The window is a constraint, not a default.
What does it mean when the tuner and the approver consistently disagree?
Persistent rejection is a signal that the tuner's system prompt is stale, not that the approver is wrong. Track rejection rate and rationale. After three consecutive rejections of the same proposal type, feed the rejection reasoning back into the tuner prompt as a new constraint. That is the meta-feedback loop — the tuner gets tuned by the same mechanism it tunes everything else with.
Can the system get too conservative over time?
Threshold creep is a real failure mode. The tuner optimizes against dismissals, dismissals drop, and the system slowly tightens until it suppresses real signals. Defend by tracking escalation rate alongside dismissal rate. If escalations fall below historical norms while dismissals fall, the loop has gone too far. Bake the trade-off into the tuner prompt explicitly. Watch both numbers, not one.
Self-improving agents are not a research problem. They are a plumbing problem. The agents in production today that stay sharp are not running clever architectures. They are running boring, well-instrumented feedback loops. An audit log with the right schema. A tuner on a weekly schedule. A human at the gate. Eight weeks.
There is a counterpoint worth naming, because it is the failure mode this article does not solve. The loop assumes the underlying task is stable. If your business logic shifts every quarter — pricing rules, compliance rewrites, product catalog churn — automated threshold tuning will mask the need for a real architecture revision. The agent will converge beautifully on a target that is no longer the target. Treat the loop as a precision tool for a stable domain. It is not a substitute for rethinking a misaligned system.
The team is already producing the calibration data. The infrastructure to use it is the work.
- [1]7 Tips to Build Self-Improving AI Agents With Feedback Loops(datagrid.com)↩
- [2]Autonomous AI Systems: Human-in-the-Loop Design(blog.eduonix.com)↩
- [3]Yohei Nakajima — Better Ways to Build Self-Improving AI Agents(yoheinakajima.com)↩
- [4]Human-in-the-Loop Agentic AI Systems — Enterprise Guide(onereach.ai)↩
- [5]Unseen, Unchecked, Unraveling: Inside the Risky Code of Self-Modifying AI — ISACA(isaca.org)↩
- [6]AI Trends 2026: Test-Time Reasoning and Reflective Agents — Hugging Face(huggingface.co)↩
- [7]Enterprise RLHF Implementation Checklist: Complete Deployment Framework(cleverx.com)↩
- [8]Agent Loop: Adaptive AI Agents — Complete Guide 2026(gleecus.com)↩