Every dismiss, modify, and escalate is a labeled training signal. Most teams log it as a debug artifact and move on. Here is the audit schema, the weekly tuner, and the human approval gate that turn that signal into thresholds that converge in eight weeks.
Why agent drift is structural — not a deployment bug you fix once
The exact audit log schema that makes threshold tuning possible
A three-phase tuner agent design with runnable TypeScript
The tuner system prompt constraints that prevent runaway optimization
The human approval gate and what the approver actually needs to see
A decision table for when to use this loop vs. when to redesign the agent
The eight-week convergence curve — and what week three looks like when you almost quit
Ship an agent. It triages tickets, flags transactions, routes leads. Week one, it works. Week three, the team is dismissing 40% of its alerts. Week six, someone has built a private spreadsheet to track which alerts are worth opening.
That spreadsheet is the failure mode and the fix in the same place. The dismiss, modify, and escalate actions your reviewers already perform are labeled training data. The model is generating the calibration signal it needs, every day, in production. Most teams log that signal as a debug artifact and forget about it.
This is the architecture that uses it. An audit log that captures every decision with the threshold that triggered it. A weekly tuner agent that reads four weeks of human responses and proposes evidence-backed threshold changes. A human approval gate that keeps the system accountable. Eight weeks in, the agent matches your team's judgment without anyone editing a rule.
The world moves faster than the prompt. Without a feedback mechanism, accuracy decays from the day you deploy.
Drift is structural, not exceptional. The fraud patterns that defined last quarter's training data are not this quarter's. The triage agent calibrated for 200 daily tickets behaves differently at 2,000. The world keeps moving. The prompt does not.
Manual tuning is the obvious response and a bad one. It depends on someone noticing — usually a frustrated stakeholder filing a complaint long after dismissals have normalized. It then routes through an engineer who has to diagnose, adjust, redeploy. And every fix addresses the symptom that escalated, not the structure underneath it. The cycle is whack-a-mole because the inputs are anecdotes.
The scale of the problem is concrete in domains where alerting agents have been deployed longest. Transaction monitoring false positive rates sit at 93% across the industry in 2025 — down from 97% in 2015, which means a decade of investment bought four percentage points of improvement with no feedback loop in place.[9] A 2025 ISACA analysis tracked self-modifying AI systems with no structured feedback mechanism and found roughly 3x higher rates of unexpected behavioral change versus systems with formal oversight — the magnitude varies by system type but the direction does not.[5]
The fix is not heroics. It is plumbing. Every human response to an agent decision becomes structured data. A second agent reads that data and proposes calibrations. A human approves them. The loop closes.
Threshold tuning converges on the right signal when the task is stable. When the task itself is wrong, convergence accelerates the problem.
| Situation | Right move | Why |
|---|---|---|
| Agent decisions are right 60–80% of the time; dismissals cluster around specific feature ranges | Deploy the loop | Threshold miscalibration — the tuner can fix this in 3–4 cycles |
| Dismissal rate is stable but has always been high (from day one) | Audit the task definition first | The agent may be solving the wrong problem; tuning won't help |
| Human responses are inconsistent — different reviewers dismiss the same alert | Fix the review process before the loop | The tuner needs a consistent signal to learn from; reviewer disagreement poisons the dataset |
| Alert volume is under ~30/day and the team reviews everything | Manual review is fine | Below this volume the 5-entry minimum per cluster rarely triggers; overhead outweighs benefit |
| Business logic changes quarterly — pricing rules, compliance rewrites, catalog churn | Use the loop, but trigger a manual threshold audit on every logic change | The tuner converges on a moving target; you need a human reset after each major shift |
| Reviewers respond to fewer than 60% of alerts | Fix response rate before adding the tuner | Pattern detection on sparse data produces noise, not signal |
The tuner is only as good as what the log captured. Every record needs decision, evidence, and human response in one place.
Before you can tune anything, you need something to tune from. That means logging every agent decision with enough structure to reconstruct why it fired and what happened next.
The audit log is not a debug log. A debug log records what code ran. The audit log records what the agent decided, what evidence it used, and how a human responded. Three fields, one row, queryable for pattern analysis.
Store the entries somewhere queryable with time-range indexing. A PostgreSQL table with JSONB columns covers most teams. High-volume systems should partition by week — the tuner only ever reads the last four.
The human response field starts empty and gets backfilled as reviewers process their queue. That asynchrony is the whole point. The agent does not block waiting for a human. The human responds when they get to it. The tuner reads the joined record on a weekly cadence. Every layer runs at the speed it can sustain.
One implementation detail that bites teams: the timeToRespond field is not just metadata. Fast responses (under 60 seconds) from the same reviewer on the same day are a signal that the reviewer is triaging by habit, not evidence — exactly the pattern that produces an unreliable training signal. The tuner should weight rapid dismissals lower than deliberate ones. Build that into the clustering logic from the start, not as a retrofit.
How dismissals, modifications, and escalations move from human reviewers back into the threshold configuration the agent reads at runtime.
Four components, four cadences. Decoupling them is the architecture.
The Primary Agent runs in real time. It reads the current threshold config, makes decisions, writes every one to the action log. It never blocks on a human.
The Action Log accumulates decision records. Reviewer responses backfill async — most teams see the human field populated within 24–48 hours.
The Tuner Agent runs once a week, typically Sunday night or Monday morning. It reads four weeks of joined data, clusters dismissals and escalations, and emits a structured proposal with evidence counts.
The Approval Interface routes the proposals to a designated approver — a team lead or ops owner. They accept, reject, or modify each one. Nothing reaches the live config without explicit human sign-off.
The cadence gap between components is not a weakness. It is why the loop is safe. Real-time agent decisions can't accidentally trigger threshold changes. Threshold changes can't bypass human review. Each boundary is a deliberate control point, not an accident of scheduling.
An engineer reviews alert quality monthly, when they remember
Threshold changes ride a code deploy
No structured record of which alerts get dismissed
Drift surfaces through stakeholder complaints
Each fix patches a single symptom
The tuner reads four weeks of joined responses every week
Threshold changes ship via config flip after approval, no deploy
Every dismiss, modify, and escalate is captured with the threshold that fired
Drift surfaces in pattern clusters before a human notices
Proposals address recurring clusters with cited evidence
An LLM reading structured data and producing structured proposals. Three phases. A hard cap on output.
The tuner is not a fine-tune. It is not a training run. It is an LLM analyst that reads structured logs and produces structured proposals. Treat it as the data analyst who works exclusively on your agent's behavior, on a weekly schedule, with no other distractions.
Three phases per cycle: pattern detection, root cause clustering, proposal generation.
We got the third phase wrong on the first build. We let the tuner emit as many proposals as it found patterns for, assuming throughput would speed convergence. It did the opposite. Approvers waved through too many changes in a single week, and when missed escalations spiked the following week, no one could attribute the regression to a specific change. The 3-proposal cap looked arbitrary until we ran it. The cap forces the tuner to rank by expected impact, which forces it to write better proposals. The constraint is the leverage point.
The system prompt is where the tuner becomes a tool instead of an opinion. Hard rules, structured output, no narrative.
Tuner proposes. Human disposes. Without that gate, the system optimizes against itself.
The approval gate is the structural difference between a self-improving system and an unsupervised one. The tuner emits proposals. A human accepts, rejects, or modifies. The gate is not decoration — it is the mechanism that keeps the system from optimizing against criteria it set for itself.
A 2026 OneReach AI enterprise study tracked agentic AI deployments and found that systems with human-in-the-loop oversight ran roughly 60% fewer production incidents than fully autonomous deployments — the spread varies by use case but the pattern holds.[4] The approver does not need to read every statistical detail. They need to answer one question: does this change match how we want the system to behave?
The approver role is not a rubber stamp. It is the one place in the loop where institutional knowledge, regulatory context, and domain judgment enter the system. Engineers who treat approvals as overhead are misreading the architecture.
| Field | Source | Why it is in the view |
|---|---|---|
| Threshold name | Tuner proposal | Names the decision boundary that moves |
| Current value | Live config | The baseline being changed |
| Proposed value | Tuner analysis | The new threshold |
| Evidence count | Action log query | How many log entries back the proposal |
| Sample entries | Action log | 3–5 representative dismissed or escalated cases |
| Expected impact | Tuner estimate | Predicted shift in false positive or miss rate |
| Risk assessment | Tuner analysis | What this change might break |
| Last change date | Config history | Was this threshold touched recently? How did it perform? |
| Approver decision | Human input | Accept, reject, or modify with rationale captured |
Two months of weekly cycles, four phases. Knowing the curve is what stops teams from killing the loop in week three.
The convergence pattern is predictable enough to plan around. The reason teams need to know it: phase one looks like nothing is working, and the temptation to pull the plug peaks exactly when the loop is collecting the data it needs. Set expectations on the curve, not the first cycle.
The system writes decisions and human responses. It proposes nothing. This is the dataset the tuner needs for its first read. Dismissal rates stay high — the agent is running on its original, untuned thresholds and nothing is changing yet. The phase looks like inaction. It is the substrate.
The tuner runs its first cycle on two weeks of data. Early proposals tend to be the obvious ones — large dismissal clusters, clear feature patterns. False positives often drop 15–25% in this phase, though the magnitude depends entirely on how miscalibrated the original thresholds were.
Four weeks of data including post-adjustment performance. The tuner can now measure the impact of its own earlier proposals. Proposals get more specific — smaller clusters, tighter confidence intervals. False positive rates often shave another 10–15%. Watch for over-correction surfacing as new escalation clusters.
Equilibrium. Dismissal rates settle below 15%. The tuner starts proposing nothing, which is the correct output for a healthy system. The approver shifts from active reviewer to exception-based oversight. The loop is now calibrated to your team's judgment.
The loop has predictable failure modes. Each one has an early signal if you are watching the right metric.
Self-improving systems fail in ways that look like success until they don't. The loop can converge on the wrong target, amplify a latent bias in the training signal, or drift past equilibrium into excessive conservatism. All three failure modes share one feature: they are invisible unless you are tracking the right counters.
Reward hacking. The tuner optimizes the metric it can see — dismissal rate. If reviewers start approving more alerts simply because the tuner has pushed thresholds up and fewer alerts fire, dismissal rate drops, the system looks healthy, and actual detection quality is unchanged. Defense: track true positive rate separately from dismissal rate. The two should move together. If dismissal rate falls while confirmed escalations also fall, you have drifted into over-suppression.
Reviewer habit drift. This is the inverse problem. Reviewers who process 100+ alerts per day develop pattern recognition that shortcuts deliberate review — exactly what produces the 22% accuracy decline documented in high-volume alert environments.[10] They dismiss quickly; the tuner reads the fast dismissals as signal; thresholds lift; more real signals get suppressed. Defense: the timeToRespond weight discussed in the schema section. Fast dismissals from the same reviewer in a single session get down-weighted in the clustering phase.
Threshold creep toward silence. The tuner optimizes against dismissals. Dismissals drop. The system slowly tightens until it suppresses real signals to maintain a clean dismissal rate. The loop has now confused silence with accuracy. Defense: alert when escalation rate falls below historical baseline alongside dismissal rate. Both should fall together in healthy convergence. If escalations fall faster than dismissals, the system is going quiet, not sharp.
Where the four components live in the repo.
treeself-improving-agent/
├── src/
│ ├── agent/
│ │ ├── primary-agent.ts
│ │ ├── decision-engine.ts
│ │ └── threshold-config.ts
│ ├── tuner/
│ │ ├── tuner-agent.ts
│ │ ├── pattern-detector.ts
│ │ ├── proposal-generator.ts
│ │ └── prompts/tuner-system.txt
│ ├── audit/
│ │ ├── action-log.ts
│ │ ├── schemas.ts
│ │ └── migrations/
│ └── approval/
│ ├── approval-api.ts
│ └── notification.ts
├── config/
│ ├── thresholds.json
│ └── tuner-schedule.json
└── tests/
├── tuner.test.ts
├── pattern-detector.test.ts
└── approval-flow.test.tsA self-tuning system without limits is a system that will eventually optimize itself off a cliff. These are the bounds that keep it reversible.
Caps blast radius. Keeps attribution clean — when something regresses, you can name the change that did it.
Prevents overnight personality changes. Large corrections spread across multiple cycles, which is what convergence actually looks like.
Lets new data accumulate before touching the same knob twice. Prevents the tuner from oscillating around a value.
The tuner proposes. It does not deploy. The gate is the mechanism, not a courtesy.
Post-change degradation past tolerance restores the previous config without waiting for a human. Fast escape hatch, no questions.
Replay fidelity for compliance and debugging. Every change links back to the evidence that produced it.
Engineers set the meta-rules. The tuner does not. Without this bound, the loop optimizes the loop, and recursive drift sets in.
Four numbers that tell you the loop is healthy or that something has slipped.
90%
Sequenced so each step unblocks the next. Skip nothing.
What if our team does not respond to enough alerts to generate useful data?
Below roughly 60% response rate the patterns get noisy and the tuner starts proposing on thin evidence. Fix the input before fixing the loop. Replace full triage with a binary thumbs-up/thumbs-down on each alert. Even that signal is enough for the tuner to identify the worst false positive clusters. A complete record of weak signals beats a sparse record of strong ones.
Does this work for rule-based agents, not just LLMs?
Yes. The loop is agent-architecture agnostic. Rule-based systems are easier to tune than LLM confidence scores — there is no ambiguity about what fired the decision when a deterministic rule tree is the engine. The tuner itself uses an LLM to read the data, but the agent it tunes can be a decision tree, a logistic regression, or a transformer. One adaptation: replace the confidence field in the action log with the rule or rule combination that triggered the decision. The tuner clusters by rule rather than by confidence range, and the rest works the same.
How do you stop the tuner from over-fitting to recent data?
The 4-week rolling window is the primary defense — long enough to absorb temporal variation, short enough to react to real shifts. The 3-proposal cap and 20% per-threshold cap add the damping. Domains with strong seasonality (retail, fraud, anything tax-cycle adjacent) extend the window to 6 or 8 weeks. The window is a constraint, not a default.
Can the system get too conservative over time?
Threshold creep is a real failure mode. The tuner optimizes against dismissals, dismissals drop, and the system slowly tightens until it suppresses real signals. Defend by tracking escalation rate alongside dismissal rate. If escalations fall below historical norms while dismissals fall, the loop has gone too far. Bake the trade-off into the tuner prompt explicitly. Watch both numbers, not one.
How does this differ from RLHF fine-tuning?
RLHF updates model weights using preference data. This loop updates threshold configuration using behavioral data. No GPU required. No model re-deployment. The tuner reads structured decision logs and proposes config changes — the model itself never changes. For most production triage, routing, and classification agents, threshold miscalibration is the primary failure mode, not the model's underlying capability. Fix the calibration first before deciding you need a fine-tune.
What does a healthy week-8 tuner output look like?
Ideally: no proposals, systemHealth: 'healthy'. The tuner's job is to run out of things to fix, not to keep finding problems. If the tuner is still generating 3 proposals in week 8, either the task domain is genuinely unstable (fine — the loop is doing its job) or the minClusterSize threshold is too low and the system is flagging noise as signal. Raise minClusterSize from 5 to 8 and see if proposals disappear. Silence is the success state.
Self-improving agents are not a research problem. They are a plumbing problem. The agents in production today that stay sharp are not running clever architectures. They are running boring, well-instrumented feedback loops. An audit log with the right schema. A tuner on a weekly schedule. A human at the gate. Eight weeks.
There is a counterpoint worth naming, because it is the failure mode this architecture does not solve. The loop assumes the underlying task is stable. If your business logic shifts every quarter — pricing rules, compliance rewrites, product catalog churn — automated threshold tuning will mask the need for a real architecture revision. The agent will converge beautifully on a target that is no longer the target. The loop is a precision tool for a stable domain. It is not a substitute for rethinking a misaligned system.
The team is already producing the calibration data. The infrastructure to use it is the work.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.