Launches get conference talks. Retirements get archived repos and live credentials. Five sequential phases — audit, extract, shadow, communicate, shut down — and the security blast radius when you skip any of them.
Every conference talk covers the launch. Tool calls, context windows, the demo that ships. The end of the lifecycle gets nothing.
Teams that put agents in production through 2025 are now making their first retirement decisions. The customer support agent running on a model two generations behind. The internal research tool with 18 months of edge cases compressed into its system prompt and credentials wired into four systems no one mapped. The QA agent quietly invoking three other agents whose owners have no idea they are being called.
Shutting these down is messier than it looks. An agent is not just code. It is an identity with credentials, a memory store with accumulated patterns, a service with downstream consumers nobody enumerated. Archive the repo and close the ticket and you leave API keys live, service accounts provisioned, and calling agents broken on a delay timer. Security teams have a name for what you just created: ghost agents — retired in intent, live in practice, nobody watching.
Treat retiring an agent exactly like decommissioning a microservice. Version the prompt contract. Run a shadow period. Extract the patterns the agent learned in production before you wipe it. Communicate the deprecation with a sunset date downstream owners can plan against. The structural analogy makes the right behaviors obvious. Skipping any phase has a specific cost — and the cost compounds.
Inaction feels free until the bill arrives as an incident.
Retirement is harder than launch because the cost of doing nothing looks small until it is not.
Five triggers that should be treated as decisions, not vibes:
Performance has structurally degraded. Not a bad week. A trend. The base model is two generations old. Eval scores that looked acceptable six months ago now sit well under the current baseline. The gap between what the agent does and what a successor would do is wide enough to show up in the business numbers. That is a retirement signal, not a tuning task.
The use case has drifted. The workflow the agent was designed for no longer runs the same way. Teams keep agents alive because "it still mostly works" — while the definition of "works" quietly migrated. An agent optimized for a workflow that no longer exists is not an asset. It is maintenance debt with a per-call inference bill.
The base model is being deprecated. Providers set end-of-life dates. When the underlying model retires, every agent built on it migrates or shuts down. Migration means re-evaluating against the new model on your existing eval set — not assuming behavior transfers cleanly. It rarely does.
The business context shifted. A pivot. An acquisition. A process redesign. The workflow this agent was built to accelerate no longer exists in its original form, and rebuilding the agent around the new shape is more expensive than starting over.
Nobody can explain what it does. The most expensive trigger. The original author is gone. Documentation is thin or missing. The agent runs because the agent runs. Continuing to operate a system whose behavior nobody owns is a slow-burn incident waiting for a stakeholder to pull on the wrong thread.
| Signal | Urgency | First step | Trap to avoid |
|---|---|---|---|
| Eval scores drifted >15% below baseline | High — start planning within the sprint | Run successor eval side-by-side on the same eval set | Tuning the old agent instead of comparing to a successor |
| Underlying model reaching end-of-life | Fixed deadline — do not miss it | Shadow new-model agent against production traffic now | Assuming model upgrade = behavior-compatible rollout |
| Original author gone, behavior undocumented | Medium — before the next production incident | Trace log audit to reconstruct dependency map | Letting it run because nothing has broken yet |
| Business process it served no longer exists | High — every call is waste and risk | Owner sign-off, then schedule retirement sprint | "It still mostly works" keeping it alive indefinitely |
| Use case absorbed by a new agent | Moderate — plan during successor rollout | Map which callers depend on the retiring agent | Running both agents in parallel without a cutover date |
Software has a vocabulary for shutdown. Agents inherit it.
Software engineering has a developed vocabulary for turning services off: deprecation notices, shadow periods, contract versioning, sunset dates. Patterns for notifying downstream consumers and handling clients who missed the migration window.
Agents need the same vocabulary. The parallel is closer than it looks.
A production agent has a public interface (the prompts it accepts, the outputs it produces), downstream consumers (orchestrators, calling agents, humans wired to its behavior), internal state (memory, embeddings, learned patterns), and credentials granting access to external systems. These map nearly one-for-one onto a microservice's API contract, client applications, database state, and service account permissions.
The analogy breaks in one direction that matters: an agent's "API" — the system prompt — is rarely version-controlled with the rigor applied to an HTTP endpoint. That gap is exactly where retirement goes wrong. Downstream systems depending on a specific output format, a specific tool call pattern, or a specific persona behavior have no contract to reference. When the agent disappears, they break silently. The orchestrator falls back to a default value. Nobody notices for two weeks. By then the trail is cold.
Archive repo, delete code, close ticket
Credentials left active in secrets manager indefinitely
Calling agents break with no warning and no migration path
System prompt deleted — institutional knowledge gone permanently
Edge cases discovered in production die with the prompt
Vector store orphaned, storage billing continues forever
Audit documents every consumer, every dependency, every credential
Every credential explicitly revoked on retirement day, logged in manifest
Successor registered in the tool catalog before cutover, callers updated
System prompt versioned, hashed, archived as a frozen contract
Few-shot examples and edge cases extracted to a permanent eval dataset
Vector store archived or deleted under data retention policy — confirmed
Each phase depends on the previous one being complete. Skip one and the cost shows up later.
The retirement pipeline borrows from microservice patterns and adapts them for the shape of an agent. Order matters.
Skip knowledge extraction before the shadow period and you lose institutional memory the moment the prompt is wiped. Run the hard shutdown before credential revocation and you mint a ghost identity with live permissions on Day 0. Each phase absorbs failure modes the next one cannot reach. The order is not arbitrary.
Memory is not a dependency map. Trace logs are.
Query observability — not the team's recollection — for every consumer of this agent: orchestrators, calling agents, webhooks, humans hitting it directly via API. Most teams discover at least one caller they had forgotten about. The audit is what catches the silent ones.
API keys, OAuth tokens, service accounts, federated trust relationships. They live in secrets managers, but rarely in one place. Enumerate the complete set now. You will need it intact during Phase 5 — and you will not get a second chance to find one you missed.
Vector stores, fine-tuning datasets, cached embeddings, long-term memory. Flag PII. Flag retention policies. Each store gets one of three decisions: archive, delete, migrate. Abandonment is not an option — orphaned vector stores keep billing and keep storing whatever they were storing.
For each caller: what breaks on retirement day, and what replaces it? Not every consumer gets a like-for-like successor. Some route to a different agent. Some make direct API calls. Some get nothing. Document the decision either way. Ambiguity here is how production goes down at 3am.
This is the prompt contract. Hash it. Tag it. Archive it. Other agents and workflows may have been designed against its output format, its tool call patterns, or its persona. The archived contract is what you reference when something breaks six months after the agent is gone.
Anything added reactively to the system prompt is institutional knowledge — patterns the team discovered in production, edge cases that surprised them. Pull these into a permanent eval set before shutdown. The successor benchmarks against it. You stop rediscovering the same lessons from scratch.
Implicit dependencies do not live in code. They live in the divergence.
The shadow period is the safest transition mechanism in the retirement toolkit, and the most frequently skipped.
The mechanism: before the old agent goes dark, the successor runs in parallel. Same inputs, parallel outputs. The successor's responses are logged but not acted on. You're comparing behavior distributions — not running an A/B test on user experience.
How to run it: route production traffic to both agents simultaneously. Collect both outputs. Score equivalence — semantic similarity on a shared eval set works better than exact string matching for natural-language outputs. The goal is confirming the successor handles the same failure modes, declines the same request categories, and produces outputs in the format downstream systems are parsing.
Shadow mode finds the contracts no documentation captured. One platform team running a shadow period for an invoice processing agent discovered that a downstream orchestrator was extracting a specific field — confidence_score — from the retiring agent's structured output. The successor did not include that field. No automated test had flagged it. The orchestrator degraded gracefully, falling back silently to a default value. The divergence only appeared in the comparison delta between the two outputs. That is exactly what shadow periods are for. The implicit contract that never made it into a schema.
Output equivalence scoring is not pass/fail on a single eval run. It is a distribution check: run the eval set daily across the shadow period, track the percentile distribution of semantic similarity scores, and look for the score to stabilize — not just to be above threshold on a single day. Instability at day 14 is a signal the successor is still sensitive to prompt phrasing in ways the retiring agent was not.
Downstream owners need a contract they can plan against, not a heads-up.
Stakeholders need a structured deprecation notice that answers four questions without requiring a follow-up: what is changing, when it stops, what the migration path is, who to contact when something breaks.
Borrow from the API deprecation playbook. Announce at least 30 days before shutdown for anything with integration dependencies. For any hard integration — another agent calling this one, a webhook pointed at the endpoint — 30 days is the floor, not the ceiling. If the shutdown will break a calling agent whose owning team needs to update their tool config, test it, and ship the change, 14 days is not lead time. It is a forced incident.
One stakeholder is consistently underestimated: the data and compliance team. They approve the data disposition plan before shutdown begins, not after. Vector stores containing PII have retention and deletion requirements that can extend the retirement timeline by weeks. Find this out in Phase 1, not the day before cutover.
| Stakeholder | Lead time | Channel | What they need |
|---|---|---|---|
| Platform team leads | 30 days | Written deprecation notice | Successor agent, migration guide, sunset date |
| Calling agents / orchestrators | 30 days | Tool catalog update + notice | New tool name, API contract diff, cutover date |
| End users (if directly exposed) | 14 days | In-product notice or email | What changes, when it changes, what replaces it |
| Data / compliance team | 30 days | Async ticket | Data deletion plan, retention confirmation, PII scope |
| On-call / SRE team | 7 days | Runbook update | Updated alert routing, removed dashboards, retired endpoints |
The phase is short. The window between cutover and revocation is where ghost agents are born.
By this point, the successor is live, stakeholders have acknowledged the deprecation, and the data disposition plan is signed off. The hard shutdown is the shortest phase. All that remains is execution.
Every item below executes on the same day the agent stops receiving traffic. The most common failure mode at this stage: completing the cutover and parking the credential revocation as a follow-up ticket. That gap is the entire mechanism by which ghost agents come into existence. Close it the same day or do not call the agent retired.
NHIs now outnumber human identities 45:1. Informal retirement compounds that ratio with no cleanup mechanism.
The security dimension of agent retirement gets less airtime than the operational one. It is the more urgent dimension.
Machine identities grew from roughly 50,000 per enterprise in 2021 to 250,000 by 2025 — a 400% increase in four years [7]. Non-human identities now outnumber human users by 45-to-1 in the modern enterprise [7]. In cloud-native environments the ratio hits 144-to-1 [7]. The structural problem: organizations apply real lifecycle discipline to human identity (offboarding checklists, IAM reviews, periodic access audits) and almost none of it to machine identity.
An agent retired informally — code deleted, repo archived, ticket closed — does not lose its credentials. The service account it used to hit the CRM is still provisioned. The API key for the webhook integration is still active. The OAuth token scoped to the data warehouse connector is still valid. The agent is gone. The blast radius is not.
Security teams call these ghost identities: credentials from systems that no longer exist, still live, nobody watching. They are a quiet privilege escalation surface. An attacker who finds a valid scoped API key with read access to customer records does not care that it belonged to an agent retired six months ago. They care that it works.
GitGuardian's 2026 State of Secrets Sprawl report put a number on the remediation gap: nearly 70% of credentials confirmed as valid in 2022 were still valid in January 2025, and still above 64% when retested in January 2026 [6]. Nearly three years. Credentials that had outlived the systems they were issued for by years, because nobody revoked them when those systems shut down.
CoSAI's Workstream 4 — published in April 2026 — establishes this as the primary agent lifecycle governance gap: agent decommissioning must include explicit procedures for the disposition of credentials, conversation history, and persistent state, not just the shutdown of the compute process [8]. That is the industry standard emerging now. The teams who ignored it in 2025 are already living with the cleanup.
Each one shows up in the post-mortem of the retirement that skipped it.
Ghost credentials are live attack surface. The window between cutover and revocation is the entire failure mode. Close it the same day.
Prompt contracts have downstream dependencies that surface months after retirement. You need the record to diagnose breakage and to benchmark whatever replaces it.
Shadow mode surfaces parsing dependencies, field extractions, and format assumptions that no code review catches. For integrated agents, it is the only reliable way to find them.
Tool versioning is one of the leading causes of production agent failure. Downstream owners need lead time to update, test, and deploy. 14 days is a forced incident.
Builders are optimistic about their agent's dependencies and scope. Retirement needs someone who will ask uncomfortable questions about what breaks and who knows it.
Not every agent that looks dead should be shut down. The distinction matters.
The underlying model is reaching end-of-life and migration changes behavior materially
The use case it served no longer exists — not just slow, but genuinely gone
The agent's behavior is undocumented and no current team member can explain what it does
A successor covers the same scope with better eval scores and lower cost
The business process it automated has been redesigned beyond what prompt-tuning can address
Model upgrade that preserves behavior: this is a rollout, not a retirement. Validate output equivalence, but keep the agent identity
Performance below baseline but recoverable with prompt updates or eval-driven tuning
Low usage agents: low traffic is not zero value. Check what breaks before cutting it
The agent is called by one or two systems that are easy to update — route the callers, don't terminate the agent
Compliance hold: if a regulation requires you to retain the agent's behavior for audit, archive it but do not delete the prompt or eval artifacts
How long should the shadow period last for a production agent?
14 days minimum for internal tools with no external integrations. 30 days for customer-facing agents or anything touching financial data, regulated workflows, or PII. Extend if output divergence sits above 5% on day 14 — do not cut over until you understand what is diverging. Some divergence is expected. Unexplained divergence is a blocker. Track the P50 and P10 similarity scores daily, not just the mean — tail failures are where the real contracts hide.
Do I need this process if I am just upgrading the model version?
No. A model upgrade is a rollout, not a retirement. Retirement applies when the agent identity itself is being permanently decommissioned. That said, a model upgrade that materially changes system prompt behavior is a prompt contract break and needs version management of its own — even without a full retirement. Rollouts can be rolled back. Retirements cannot. The distinction is what makes the rules different.
What counts as a prompt contract break?
Any change that causes a downstream consumer — human or automated — to see meaningfully different behavior. Added tools. Removed constraints. Changed response format. Changed persona. Modified output schema. Breaking changes get versioned and communicated with a deprecation timeline, exactly like a breaking API change. If you are unsure, treat it as a break and communicate proactively. The cost of over-communicating is low. The cost of under-communicating is an incident.
What happens to the vector store when an agent retires?
Nothing — and that is the problem. It does not auto-delete when the agent shuts down. You explicitly delete the namespace, archive the source documents if required, and confirm deletion under your retention policy. If the store contains PII or regulated data, that decision needs sign-off from data and compliance before shutdown begins, not after. Document it in the retirement manifest in Phase 1, before the operational pressure of cutover week.
My retiring agent is called by other agents — who owns those dependency updates?
The owner of the retiring agent. Map all calling agents from trace logs before shutdown. For each, update their tool config to call the successor or remove the tool call entirely. Verify in shadow mode that calling agents work against the successor before hard cutover. Do not assume callers will adapt on their own — they will fail silently and the resulting incidents will be attributed to something else, usually for weeks, before anyone traces the chain back to the retirement.
How do we handle an agent that was provisioned informally and has no documented credentials?
Run an IAM audit against the agent's identity before doing anything else. Search secrets managers, environment variable stores, CI/CD configs, and infra-as-code for the agent's service account name or client ID. GitGuardian found 24,008 unique secrets in MCP config files alone in 2025 — config-based credential storage is common. [6] If the credential inventory is genuinely unrecoverable, scope an audit of every system the agent had access to and rotate credentials there regardless of whether you find the specific token. Better to over-revoke than to leave an unknown surface live.
Is there an industry standard for agent decommissioning?
CoSAI's Workstream 4, published in April 2026, defines agentic identity lifecycle requirements including decommissioning procedures for credentials, conversation history, and persistent state. [8] SCIM protocol extensions for agentic identity are also in experimental development to standardize lifecycle events across enterprise identity systems. These are emerging standards, not yet widely enforced — but they represent the floor, not the ceiling. Organizations auditing their AI governance in 2026 will be measured against them.
Every agent fleet that scaled fast in 2024 and 2025 is now carrying retirement debt. The agents are there. The credentials are live. The calling systems have implicit dependencies on output formats nobody wrote down. The playbook above does not eliminate that debt — but it stops you from creating more of it, and it gives you a repeatable process for working through what already exists.
Start with the manifest. The manifest forces the inventory. The inventory forces the decisions. By the time you hit Phase 5, you are executing — not discovering.
Why production inference bills always exceed estimates — and the Finance-Engineering governance framework for per-agent budgets, model routing, context compression, and cost forecasting without capability degradation.
46% of AI proofs of concept never ship. The gap is not technical. It is structural: PoC culture rewards experimentation and punishes shipping. A 90-day decision gate, an operational owner, and an incentive rewrite — or pilot purgatory wins again.
Third-party MCP servers run inside your agent's reasoning loop with privileged tool access. Most teams added them without a review process. A 0-100 scorecard across provenance, scope, code, network, and runtime — gated in CI before they ship.