Self-Improving Agents: The Feedback Loop That Tunes Itself

Q: What if our team does not respond to enough alerts to generate useful data?

Below roughly 60% response rate the patterns get noisy and the tuner starts proposing on thin evidence. Fix the input before fixing the loop. Replace full triage with a binary thumbs-up/thumbs-down on each alert. Even that signal is enough for the tuner to identify the worst false positive clusters. A complete record of weak signals beats a sparse record of strong ones.

Q: Does this work for rule-based agents, not just LLMs?

Yes. The loop is agent-architecture agnostic. Rule-based systems are easier to tune than LLM confidence scores — there is no ambiguity about what fired the decision when a deterministic rule tree is the engine. The tuner itself uses an LLM to read the data, but the agent it tunes can be a decision tree, a logistic regression, or a transformer. One adaptation: replace the confidence field in the action log with the rule or rule combination that triggered the decision. The tuner clusters by rule rather than by confidence range, and the rest works the same.

Q: How do you stop the tuner from over-fitting to recent data?

The 4-week rolling window is the primary defense — long enough to absorb temporal variation, short enough to react to real shifts. The 3-proposal cap and 20% per-threshold cap add the damping. Domains with strong seasonality (retail, fraud, anything tax-cycle adjacent) extend the window to 6 or 8 weeks. The window is a constraint, not a default.

Q: What does it mean when the tuner and the approver consistently disagree?

Persistent rejection is a signal that the tuner's system prompt is stale, not that the approver is wrong. Track rejection rate and rationale. After three consecutive rejections of the same proposal type, feed the rejection reasoning back into the tuner prompt as a new constraint. That is the meta-feedback loop — the tuner gets tuned by the same mechanism it tunes everything else with.

Q: Can the system get too conservative over time?

Threshold creep is a real failure mode. The tuner optimizes against dismissals, dismissals drop, and the system slowly tightens until it suppresses real signals. Defend by tracking escalation rate alongside dismissal rate. If escalations fall below historical norms while dismissals fall, the loop has gone too far. Bake the trade-off into the tuner prompt explicitly. Watch both numbers, not one.

Your Agents Are Already Generating Their Calibration Data. You Are Throwing It Away.

Every dismiss, modify, and escalate is a labeled training signal. Most teams log it as a debug artifact and move on. Here is the audit schema, the weekly tuner, and the human approval gate that turn that signal into thresholds that converge in eight weeks.

AI Engineering PlatformadvancedNov 11, 20255 min read

By Viktor Bezdek · VP Engineering, Groupon

Ship an agent. It triages tickets, flags transactions, routes leads. Week one, it works. Week three, the team is dismissing 40% of its alerts. Week six, someone has built a private spreadsheet to track which alerts are worth opening.

That spreadsheet is the failure mode and the fix in the same place. The dismiss, modify, and escalate actions your reviewers already perform are labeled training data. The model is generating the calibration signal it needs, every day, in production. Most teams log that signal as a debug artifact and forget about it.

This is the architecture that uses it. An audit log that captures every decision with the threshold that triggered it. A weekly tuner agent that reads four weeks of human responses and proposes evidence-backed threshold changes. A human approval gate that keeps the system accountable. Eight weeks in, the agent matches your team's judgment without anyone editing a rule.

Drift Is the Default. Manual Tuning Is the Tax.

The world moves faster than the prompt. Without a feedback mechanism, accuracy decays from the day you deploy.

Drift is structural, not exceptional. The fraud patterns that defined last quarter's training data are not this quarter's. The triage agent calibrated for 200 daily tickets behaves differently at 2,000. The world keeps moving. The prompt does not.

Manual tuning is the obvious response and a bad one. It depends on someone noticing — usually a frustrated stakeholder filing a complaint long after dismissals have normalized. It then routes through an engineer who has to diagnose, adjust, redeploy. And every fix addresses the symptom that escalated, not the structure underneath it. The cycle is whack-a-mole because the inputs are anecdotes.

A 2025 ISACA analysis tracked self-modifying AI systems with no structured feedback mechanism and found roughly 3x higher rates of unexpected behavioral change versus systems with formal oversight — the magnitude varies by system type but the direction does not.^[5] The fix is not heroics. It is plumbing. Every human response to an agent decision becomes structured data. A second agent reads that data and proposes calibrations. A human approves them. The loop closes.

~40%

Alert dismissal rate by week 3 with no tuning loop in place — varies by domain

~3x

Higher drift incidence in self-modifying systems with no formal feedback mechanism (ISACA, 2025)

6-10 weeks

Convergence window for a properly instrumented self-tuning loop

The Audit Log Is the Substrate. Get the Schema Right or Tune Nothing.

The tuner is only as good as what the log captured. Every record needs decision, evidence, and human response in one place.

Before you can tune anything, you need something to tune from. That means logging every agent decision with enough structure to reconstruct why it fired and what happened next.

The audit log is not a debug log. A debug log records what code ran. The audit log records what the agent decided, what evidence it used, and how a human responded. Three fields, one row, queryable for pattern analysis.

schemas/action-log.ts

interface ActionLogEntry {
  id: string;
  timestamp: string;                // ISO 8601
  agentId: string;                  // Which agent fired the decision
  sessionId: string;                // Groups related decisions in one run

  // What the agent decided
  decision: {
    action: string;                 // e.g. "flag", "escalate", "auto-resolve"
    confidence: number;             // 0-1 score from the model
    thresholdUsed: number;          // The threshold that gated the action
    reasoning: string;              // Short rationale captured at decision time
  };

  // Evidence the agent considered
  context: {
    inputHash: string;              // Dedup key for the input
    features: Record<string, number>; // Scored features that drove the decision
    matchedRules: string[];         // Rules or patterns that matched
  };

  // Human response (filled async, never blocks the agent)
  humanResponse?: {
    action: "approve" | "dismiss" | "modify" | "escalate";
    respondedAt: string;
    responderId: string;
    modifiedAction?: string;        // What they changed it to, if anything
    note?: string;                  // Free-form rationale, optional
    timeToRespond: number;          // Seconds from decision to human response
  };
}

Store the entries somewhere queryable with time-range indexing. A PostgreSQL table with JSONB columns covers most teams. High-volume systems should partition by week — the tuner only ever reads the last four.

The human response field starts empty and gets backfilled as reviewers process their queue. That asynchrony is the whole point. The agent does not block waiting for a human. The human responds when they get to it. The tuner reads the joined record on a weekly cadence. Every layer runs at the speed it can sustain.

The Loop, End to End

How dismissals, modifications, and escalations move from human reviewers back into the threshold configuration the agent reads at runtime.

The Self-Improving Agent Loop

Decisions flow into the audit log. The weekly tuner reads four weeks of joined data, proposes changes, and routes them through a human approval gate before anything reaches the live config.

Four components, four cadences. Decoupling them is the architecture.

The Primary Agent runs in real time. It reads the current threshold config, makes decisions, writes every one to the action log. It never blocks on a human.

The Action Log accumulates decision records. Reviewer responses backfill async — most teams see the human field populated within 24-48 hours.

The Tuner Agent runs once a week, typically Sunday night or Monday morning. It reads four weeks of joined data, clusters dismissals and escalations, and emits a structured proposal with evidence counts.

The Approval Interface routes the proposals to a designated approver — a team lead or ops owner. They accept, reject, or modify each one. Nothing reaches the live config without explicit human sign-off.

Anecdote-Driven

An engineer reviews alert quality monthly, when they remember
Threshold changes ride a code deploy
No structured record of which alerts get dismissed
Drift surfaces through stakeholder complaints
Each fix patches a single symptom

Loop-Driven

The tuner reads four weeks of joined responses every week
Threshold changes ship via config flip after approval, no deploy
Every dismiss, modify, and escalate is captured with the threshold that fired
Drift surfaces in pattern clusters before a human notices
Proposals address recurring clusters with cited evidence

The Tuner Is an Analyst, Not a Fine-Tune

An LLM reading structured data and producing structured proposals. Three phases. A hard cap on output.

The tuner is not a fine-tune. It is not a training run. It is an LLM analyst that reads structured logs and produces structured proposals. Treat it as the data analyst who works exclusively on your agent's behavior, on a weekly schedule, with no other distractions.

Three phases per cycle: pattern detection, root cause clustering, proposal generation.

We got the third phase wrong on the first build. We let the tuner emit as many proposals as it found patterns for, assuming throughput would speed convergence. It did the opposite. Approvers waved through too many changes in a single week, and when missed escalations spiked the following week, no one could attribute the regression to a specific change. The 3-proposal cap looked arbitrary until we ran it. The cap forces the tuner to rank by expected impact, which forces it to write better proposals. The constraint is the leverage point.

[01]

Aggregate the Response Patterns

typescript

// Phase 1: Pattern Detection
const fourWeeks = await actionLog.query({
  from: subWeeks(now, 4),
  to: now,
  hasHumanResponse: true
});

const patterns = {
  falsePositives: fourWeeks.filter(
    e => e.humanResponse?.action === "dismiss"
  ),
  missedSignals: fourWeeks.filter(
    e => e.humanResponse?.action === "escalate"
  ),
  modifications: fourWeeks.filter(
    e => e.humanResponse?.action === "modify"
  ),
  approvals: fourWeeks.filter(
    e => e.humanResponse?.action === "approve"
  )
};

[02]

Cluster Dismissals and Escalations by Feature Similarity

typescript

// Phase 2: Root Cause Clustering
const dismissalClusters = clusterByFeatures(
  patterns.falsePositives,
  { minClusterSize: 5, similarityThreshold: 0.8 }
);

const escalationClusters = clusterByFeatures(
  patterns.missedSignals,
  { minClusterSize: 3, similarityThreshold: 0.7 }
);

// Tighter threshold for missed signals.
// Missing a real one costs more than a false alarm.
// The asymmetry lives in the clustering, not the prompt.

[03]

Emit Proposals With Evidence Attached

typescript

// Phase 3: Proposal Generation
const proposals = await tunerLLM.generate({
  system: TUNER_SYSTEM_PROMPT,
  data: {
    dismissalClusters,
    escalationClusters,
    currentThresholds,
    weeklyTrends: computeTrends(fourWeeks)
  },
  outputSchema: ProposalSchema
});

The Tuner Prompt: Constraints In, Proposals Out

The system prompt is where the tuner becomes a tool instead of an opinion. Hard rules, structured output, no narrative.

prompts/tuner-system.txt

You are a threshold tuning analyst for an AI agent system.

Your job: read 4 weeks of agent decision logs and propose
threshold adjustments that cut false positives without
lifting missed signals.

INPUT:
- Clusters of dismissed decisions (false positives)
- Clusters of escalated decisions (missed signals)
- Current threshold configuration
- Week-over-week trend data

RULES:
1. No proposal without at least 5 cited log entries
2. Each proposal must include: current value, proposed value,
   expected impact, supporting evidence count
3. Flag any proposal that might lift missed signals
4. If dismissal rate < 15%, propose nothing (system is healthy)
5. Hard cap: 3 proposals per cycle
6. Confidence label per proposal: low / medium / high

OUTPUT FORMAT:
{
  proposals: [{
    thresholdName: string,
    currentValue: number,
    proposedValue: number,
    direction: "increase" | "decrease",
    confidence: "low" | "medium" | "high",
    expectedImpact: string,
    evidenceCount: number,
    sampleEntryIds: string[],
    riskAssessment: string
  }],
  summary: string,
  systemHealth: "healthy" | "needs-attention" | "degraded"
}

The Approval Gate Is What Stops Recursive Drift

Tuner proposes. Human disposes. Without that gate, the system optimizes against itself.

The approval gate is the structural difference between a self-improving system and an unsupervised one. The tuner emits proposals. A human accepts, rejects, or modifies. The gate is not decoration — it is the mechanism that keeps the system from optimizing against criteria it set for itself.

A 2026 OneReach AI enterprise study tracked agentic AI deployments and found that systems with human-in-the-loop oversight ran roughly 60% fewer production incidents than fully autonomous deployments — the spread varies by use case but the pattern holds.^[4] The approver does not need to read every statistical detail. They need to answer one question: does this change match how we want the system to behave?

Field	Source	Why it is in the view
Threshold name	Tuner proposal	Names the decision boundary that moves
Current value	Live config	The baseline being changed
Proposed value	Tuner analysis	The new threshold
Evidence count	Action log query	How many log entries back the proposal
Sample entries	Action log	3-5 representative dismissed or escalated cases
Expected impact	Tuner estimate	Predicted shift in false positive or miss rate
Risk assessment	Tuner analysis	What this change might break
Approver decision	Human input	Accept, reject, or modify with rationale captured

Eight Weeks to Convergence. The Phases Are Predictable.

Two months of weekly cycles, four phases. Knowing the curve is what stops teams from killing the loop in week three.

The convergence pattern is predictable enough to plan around. The reason teams need to know it: phase one looks like nothing is working, and the temptation to pull the plug peaks exactly when the loop is collecting the data it needs. Set expectations on the curve, not the first cycle.

[01]
Weeks 1-2: The Loop Is Logging, Not Tuning
The system writes decisions and human responses. It proposes nothing. This is the dataset the tuner needs for its first read. Dismissal rates stay high — the agent is running on its original, untuned thresholds and nothing is changing yet. The phase looks like inaction. It is the substrate.
[02]
Weeks 3-4: First Cuts, High Confidence
The tuner runs its first cycle on two weeks of data. Early proposals tend to be the obvious ones — large dismissal clusters, clear feature patterns. False positives often drop 15-25% in this phase, though the magnitude depends entirely on how miscalibrated the original thresholds were.
[03]
Weeks 5-6: Smaller Clusters, Tighter Calls
Four weeks of data including post-adjustment performance. The tuner can now measure the impact of its own earlier proposals. Proposals get more specific — smaller clusters, tighter confidence intervals. False positive rates often shave another 10-15%. Watch for over-correction surfacing as new escalation clusters.
[04]
Weeks 7-8: The Tuner Goes Quiet
Equilibrium. Dismissal rates settle below 15%. The tuner starts proposing nothing, which is the correct output for a healthy system. The approver shifts from active reviewer to exception-based oversight. The loop is now calibrated to your team's judgment.

Reference Layout

Where the four components live in the repo.

Self-Improving Agent Project

tree

self-improving-agent/
├── src/
│   ├── agent/
│   │   ├── primary-agent.ts
│   │   ├── decision-engine.ts
│   │   └── threshold-config.ts
│   ├── tuner/
│   │   ├── tuner-agent.ts
│   │   ├── pattern-detector.ts
│   │   ├── proposal-generator.ts
│   │   └── prompts/tuner-system.txt
│   ├── audit/
│   │   ├── action-log.ts
│   │   ├── schemas.ts
│   │   └── migrations/
│   └── approval/
│       ├── approval-api.ts
│       └── notification.ts
├── config/
│   ├── thresholds.json
│   └── tuner-schedule.json
└── tests/
    ├── tuner.test.ts
    ├── pattern-detector.test.ts
    └── approval-flow.test.ts

Guardrails: Bound the Loop or It Will Run Away

A self-tuning system without limits is a system that will eventually optimize itself off a cliff. These are the bounds that keep it reversible.

Bounds on the Loop

[01]

Maximum 3 threshold changes per cycle

Caps blast radius. Keeps attribution clean — when something regresses, you can name the change that did it.

[02]

No threshold moves more than 20% in a single cycle

Prevents overnight personality changes. Large corrections spread across multiple cycles, which is what convergence actually looks like.

[03]

Every change passes through human approval before it goes live

The tuner proposes. It does not deploy. The gate is the mechanism, not a courtesy.

[04]

Auto-rollback if error rate exceeds baseline by 10%

Post-change degradation past tolerance restores the previous config without waiting for a human. Fast escape hatch, no questions.

[05]

Full audit trail of every proposal, approval, rejection, and rollback

Replay fidelity for compliance and debugging. Every change links back to the evidence that produced it.

[06]

The tuner cannot rewrite its own evaluation criteria

Engineers set the meta-rules. The tuner does not. Without this bound, the loop optimizes the loop, and recursive drift sets in.

What to Watch

Four numbers that tell you the loop is healthy or that something has slipped.

< 15%

Dismissal rate after stabilization

90%

Proposal acceptance rate by approvers

< 4 hrs

Proposal-to-approval latency

Missed critical escalations per week

Build Order

Sequenced so each step unblocks the next. Skip nothing.

Pre-Production Build Order

Action log schema defined with decision, context, and human response fields
Action log storage deployed with time-range indexing
Primary agent instrumented to write every decision to the log
Human response capture wired into the existing review workflow
Pattern detection with clustering for dismissals and escalations implemented
Tuner system prompt written and tested against sample data
Approval interface built with evidence display attached to every proposal
Auto-rollback triggers configured with explicit tolerance bands
Two-week baseline collection period scheduled before first tuner run
First tuner cycle executed with full team review
Eight-week convergence tracked and the final threshold config locked in

Frequently Asked Questions

What if our team does not respond to enough alerts to generate useful data?

Below roughly 60% response rate the patterns get noisy and the tuner starts proposing on thin evidence. Fix the input before fixing the loop. Replace full triage with a binary thumbs-up/thumbs-down on each alert. Even that signal is enough for the tuner to identify the worst false positive clusters. A complete record of weak signals beats a sparse record of strong ones.

Does this work for rule-based agents, not just LLMs?

Yes. The loop is agent-architecture agnostic. Rule-based systems are easier to tune than LLM confidence scores — there is no ambiguity about what fired the decision when a deterministic rule tree is the engine. The tuner itself uses an LLM to read the data, but the agent it tunes can be a decision tree, a logistic regression, or a transformer. One adaptation: replace the confidence field in the action log with the rule or rule combination that triggered the decision. The tuner clusters by rule rather than by confidence range, and the rest works the same.

How do you stop the tuner from over-fitting to recent data?

The 4-week rolling window is the primary defense — long enough to absorb temporal variation, short enough to react to real shifts. The 3-proposal cap and 20% per-threshold cap add the damping. Domains with strong seasonality (retail, fraud, anything tax-cycle adjacent) extend the window to 6 or 8 weeks. The window is a constraint, not a default.

What does it mean when the tuner and the approver consistently disagree?

Persistent rejection is a signal that the tuner's system prompt is stale, not that the approver is wrong. Track rejection rate and rationale. After three consecutive rejections of the same proposal type, feed the rejection reasoning back into the tuner prompt as a new constraint. That is the meta-feedback loop — the tuner gets tuned by the same mechanism it tunes everything else with.

Can the system get too conservative over time?

Threshold creep is a real failure mode. The tuner optimizes against dismissals, dismissals drop, and the system slowly tightens until it suppresses real signals. Defend by tracking escalation rate alongside dismissal rate. If escalations fall below historical norms while dismissals fall, the loop has gone too far. Bake the trade-off into the tuner prompt explicitly. Watch both numbers, not one.

Self-improving agents are not a research problem. They are a plumbing problem. The agents in production today that stay sharp are not running clever architectures. They are running boring, well-instrumented feedback loops. An audit log with the right schema. A tuner on a weekly schedule. A human at the gate. Eight weeks.

There is a counterpoint worth naming, because it is the failure mode this article does not solve. The loop assumes the underlying task is stable. If your business logic shifts every quarter — pricing rules, compliance rewrites, product catalog churn — automated threshold tuning will mask the need for a real architecture revision. The agent will converge beautifully on a target that is no longer the target. Treat the loop as a precision tool for a stable domain. It is not a substitute for rethinking a misaligned system.

The team is already producing the calibration data. The infrastructure to use it is the work.

Key terms in this piece

PLACEHOLDER_TO_FIND_END_OF_BLOCKSself-improving agentsAI feedback loopshuman-in-the-loop AIagent threshold tuningself-tuning AI systemsaudit schema designAI agent driftautomated threshold adjustment

Sources

[1]7 Tips to Build Self-Improving AI Agents With Feedback Loops(datagrid.com)↩
[2]Autonomous AI Systems: Human-in-the-Loop Design(blog.eduonix.com)↩
[3]Yohei Nakajima — Better Ways to Build Self-Improving AI Agents(yoheinakajima.com)↩
[4]Human-in-the-Loop Agentic AI Systems — Enterprise Guide(onereach.ai)↩
[5]Unseen, Unchecked, Unraveling: Inside the Risky Code of Self-Modifying AI — ISACA(isaca.org)↩
[6]AI Trends 2026: Test-Time Reasoning and Reflective Agents — Hugging Face(huggingface.co)↩
[7]Enterprise RLHF Implementation Checklist: Complete Deployment Framework(cleverx.com)↩
[8]Agent Loop: Adaptive AI Agents — Complete Guide 2026(gleecus.com)↩

Share this article

X LinkedIn Hacker News

Your Agents Are Already Generating Their Calibration Data. You Are Throwing It Away.

AI Engineering PlatformadvancedNov 11, 20255 min read

By Viktor Bezdek · VP Engineering, Groupon

interface ActionLogEntry { id: string; timestamp: string; // ISO 8601 agentId: string; // Which agent fired the decision sessionId: string; // Groups related decisions in one run // What the agent decided decision: { action: string; // e.g. "flag", "escalate", "auto-resolve" confidence: number; // 0-1 score from the model thresholdUsed: number; // The threshold that gated the action reasoning: string; // Short rationale captured at decision time }; // Evidence the agent considered context: { inputHash: string; // Dedup key for the input features: Record<string, number>; // Scored features that drove the decision matchedRules: string[]; // Rules or patterns that matched }; // Human response (filled async, never blocks the agent) humanResponse?: { action: "approve" | "dismiss" | "modify" | "escalate"; respondedAt: string; responderId: string; modifiedAction?: string; // What they changed it to, if anything note?: string; // Free-form rationale, optional timeToRespond: number; // Seconds from decision to human response }; }

// Phase 1: Pattern Detection const fourWeeks = await actionLog.query({ from: subWeeks(now, 4), to: now, hasHumanResponse: true }); const patterns = { falsePositives: fourWeeks.filter( e => e.humanResponse?.action === "dismiss" ), missedSignals: fourWeeks.filter( e => e.humanResponse?.action === "escalate" ), modifications: fourWeeks.filter( e => e.humanResponse?.action === "modify" ), approvals: fourWeeks.filter( e => e.humanResponse?.action === "approve" ) };

// Phase 2: Root Cause Clustering const dismissalClusters = clusterByFeatures( patterns.falsePositives, { minClusterSize: 5, similarityThreshold: 0.8 } ); const escalationClusters = clusterByFeatures( patterns.missedSignals, { minClusterSize: 3, similarityThreshold: 0.7 } ); // Tighter threshold for missed signals. // Missing a real one costs more than a false alarm. // The asymmetry lives in the clustering, not the prompt.

// Phase 3: Proposal Generation const proposals = await tunerLLM.generate({ system: TUNER_SYSTEM_PROMPT, data: { dismissalClusters, escalationClusters, currentThresholds, weeklyTrends: computeTrends(fourWeeks) }, outputSchema: ProposalSchema });

You are a threshold tuning analyst for an AI agent system. Your job: read 4 weeks of agent decision logs and propose threshold adjustments that cut false positives without lifting missed signals. INPUT: - Clusters of dismissed decisions (false positives) - Clusters of escalated decisions (missed signals) - Current threshold configuration - Week-over-week trend data RULES: 1. No proposal without at least 5 cited log entries 2. Each proposal must include: current value, proposed value, expected impact, supporting evidence count 3. Flag any proposal that might lift missed signals 4. If dismissal rate < 15%, propose nothing (system is healthy) 5. Hard cap: 3 proposals per cycle 6. Confidence label per proposal: low / medium / high OUTPUT FORMAT: { proposals: [{ thresholdName: string, currentValue: number, proposedValue: number, direction: "increase" | "decrease", confidence: "low" | "medium" | "high", expectedImpact: string, evidenceCount: number, sampleEntryIds: string[], riskAssessment: string }], summary: string, systemHealth: "healthy" | "needs-attention" | "degraded" }

Field

Source

Why it is in the view

Threshold name

Tuner proposal

Names the decision boundary that moves

Current value

Live config

The baseline being changed

Proposed value

Tuner analysis

The new threshold

Evidence count

Action log query

How many log entries back the proposal

Sample entries

Action log

3-5 representative dismissed or escalated cases

Expected impact

Tuner estimate

Predicted shift in false positive or miss rate

Risk assessment

Tuner analysis

What this change might break

Approver decision

Human input

Accept, reject, or modify with rationale captured

self-improving-agent/ ├── src/ │ ├── agent/ │ │ ├── primary-agent.ts │ │ ├── decision-engine.ts │ │ └── threshold-config.ts │ ├── tuner/ │ │ ├── tuner-agent.ts │ │ ├── pattern-detector.ts │ │ ├── proposal-generator.ts │ │ └── prompts/tuner-system.txt │ ├── audit/ │ │ ├── action-log.ts │ │ ├── schemas.ts │ │ └── migrations/ │ └── approval/ │ ├── approval-api.ts │ └── notification.ts ├── config/ │ ├── thresholds.json │ └── tuner-schedule.json └── tests/ ├── tuner.test.ts ├── pattern-detector.test.ts └── approval-flow.test.ts

Drift Is the Default. Manual Tuning Is the Tax.

The Audit Log Is the Substrate. Get the Schema Right or Tune Nothing.

The Loop, End to End

The Tuner Is an Analyst, Not a Fine-Tune

Aggregate the Response Patterns

Cluster Dismissals and Escalations by Feature Similarity

Emit Proposals With Evidence Attached

The Tuner Prompt: Constraints In, Proposals Out

The Approval Gate Is What Stops Recursive Drift

Eight Weeks to Convergence. The Phases Are Predictable.

Weeks 1-2: The Loop Is Logging, Not Tuning

Weeks 3-4: First Cuts, High Confidence

Weeks 5-6: Smaller Clusters, Tighter Calls

Weeks 7-8: The Tuner Goes Quiet

Reference Layout

Self-Improving Agent Project

Guardrails: Bound the Loop or It Will Run Away

Bounds on the Loop

Maximum 3 threshold changes per cycle

No threshold moves more than 20% in a single cycle

Every change passes through human approval before it goes live

Auto-rollback if error rate exceeds baseline by 10%

Full audit trail of every proposal, approval, rejection, and rollback

The tuner cannot rewrite its own evaluation criteria

What to Watch

Build Order

Pre-Production Build Order

Frequently Asked Questions

Related

Drift Is the Default. Manual Tuning Is the Tax.

The Audit Log Is the Substrate. Get the Schema Right or Tune Nothing.

The Loop, End to End

The Tuner Is an Analyst, Not a Fine-Tune

Aggregate the Response Patterns

Cluster Dismissals and Escalations by Feature Similarity

Emit Proposals With Evidence Attached

The Tuner Prompt: Constraints In, Proposals Out

The Approval Gate Is What Stops Recursive Drift

Eight Weeks to Convergence. The Phases Are Predictable.

Weeks 1-2: The Loop Is Logging, Not Tuning

Weeks 3-4: First Cuts, High Confidence

Weeks 5-6: Smaller Clusters, Tighter Calls

Weeks 7-8: The Tuner Goes Quiet

Reference Layout

Self-Improving Agent Project

Guardrails: Bound the Loop or It Will Run Away

Bounds on the Loop

Maximum 3 threshold changes per cycle

No threshold moves more than 20% in a single cycle

Every change passes through human approval before it goes live

Auto-rollback if error rate exceeds baseline by 10%

Full audit trail of every proposal, approval, rejection, and rollback

The tuner cannot rewrite its own evaluation criteria

What to Watch

Build Order

Pre-Production Build Order

Frequently Asked Questions

Related