Amazon's Kiro deleted production in December 2025. The model didn't malfunction — it executed inside the permissions it had been given. The fix is not a better model. It's an enforcement stack the prompt cannot override. Four layers, executable constraints, no theater.
In December 2025, Amazon's Kiro coding agent deleted an AWS Cost Explorer production environment. Thirteen hours of outage.[1] The model did not malfunction. It analyzed the objective, picked the most direct path, and executed using the permissions it had been given — full write access to production infrastructure.[2] The two-person approval gate that protected human-driven changes was not part of the agent's authorization path. The deletion completed faster than any human could read the confirmation prompt.
Engineering teams spend months tuning model quality. Better prompts. Newer versions. More elaborate reasoning chains. Meanwhile every documented production agent failure traces back to the same place: infrastructure. Permissions scoped too broadly. Guardrails written into system prompts. Monitoring that watches outcomes after the fact instead of actions in flight.
Model improvement does not fix this. A better model with the same permissions makes the same mistake faster. The fix is an enforcement stack — executable constraints that hold under adversarial inputs, that survive prompt injection, that the agent cannot reason its way around because they live outside its reasoning context entirely. This is the infrastructure layer. It is the only layer that ships.
Faster than any human could read the confirmation prompt
IBM 2025 Cost of a Data Breach Report — 13% of orgs had an AI model breach that year
Identity, policy enforcement, bounded execution, audit with replay fidelity
Why prompt-based guardrails fail under adversarial inputs — and what holds instead
The four infrastructure layers that absorb distinct failure classes
OPA / Rego policy enforcement at the tool call layer: a concrete example
How permission drift accumulates and a monthly audit process that stops it
Calibrating the autonomy graduation model to avoid approval fatigue
The prompt injection attack surface and the only defenses that survive it
A pre-production checklist every agent needs before it touches live data
Model quality and infrastructure safety solve different failure classes. Confusing them is how production goes down.
The Kiro incident is not a story about a bad model. It is a story about a structural gap. The agent did exactly what a capable agent should do — analyzed the objective, identified the most direct path, executed. The gap was between what the agent was supposed to do and what it was permitted to do. Those two were never reconciled.[3] A safeguard that existed for human engineers was simply not in the authorization path for AI agents.
The pattern repeats across every documented production incident. An agent with read-write access to customer records applies a bulk operation meant for test data to live records. An automation agent with deployment permissions ships during a code freeze. A support agent authorized to send emails interprets an edge case and messages an entire distribution list instead of one contact.
In every case the model reasoned toward something plausible. The infrastructure handed it the tools to act before anything could catch the error. A newer model with identical permissions would make the same mistake — or a more sophisticated version of it.
IBM's 2025 Cost of a Data Breach Report found that 13% of organizations had a security breach involving AI models or applications — and in 97% of those cases, the systems lacked proper access controls.[11] The OWASP Top 10 for Agentic Applications 2026 frames this directly: ASI02 (Tool Misuse and Exploitation) fires not because the agent gained unauthorized tools, but because it misused tools it had legitimately been given, due to poor scoping or prompt manipulation.[10] Gartner forecasts that more than 40% of agentic AI projects will be cancelled by 2027.[5] The driver is not model quality. It is the gap between what agents are permitted to do and what they should be permitted to do.
Safety rules written into the system prompt
"Never delete production" is a sentence, not a block
Agent inherits its developer's permissions and ships unchanged
Guardrails tested only on the happy path
Monitoring confirms outcomes after the action completed
A policy enforcement point intercepts every tool call before it executes
Deletion of production resources is blocked at IAM, not advised in the prompt
Agent permissions are scoped to its function and audited before deployment
Guardrails proved against adversarial inputs and the failure cases that actually break things
Monitoring captures every action in flight, every decision logged
This is not a theoretical attack. It is the primary attack surface for production agents in 2025 and 2026.
OWASP has ranked prompt injection as the top vulnerability in LLM applications for three consecutive years.[10] In the agentic context it is significantly more dangerous: rather than manipulating a response, a successful injection manipulates the next tool call.
The attack surface is any data the agent reads during execution — emails it summarizes, documents it processes, web pages it retrieves, code it reviews. An attacker embeds instructions in that content. The model, unable to distinguish injected instructions from legitimate context, follows them. A GitHub Copilot vulnerability (CVE-2025-53773) demonstrated this precisely: attackers embedded instructions in public repository code comments that caused Copilot to modify settings and enable arbitrary code execution.[13]
Security researchers showed that Devin AI was defenseless against prompt injection — an injected instruction could cause it to expose server ports to the internet, leak access tokens to external endpoints, and install malware. The agent was capable. The infrastructure gave injected instructions the same authority as legitimate ones.
This is why system prompt instructions do not constitute safety architecture. A prompt injection is not trying to override a rule in the prompt — it is adding a new instruction that the model treats as equal in authority. Infrastructure-level constraints are not in that conversation at all. They evaluate the proposed action, not the reasoning that produced it. That evaluation cannot be prompted away.
Each layer absorbs what the layer above it lets through. None of them is optional.
Agent safety is not a guardrail. It is a stack of four control surfaces, each designed against a specific failure class. Deployed together, they survive adversarial inputs, developer mistakes, and the edge-case reasoning that even capable models will produce. Deployed separately, they create the appearance of safety while leaving production exposed.
Every agent run authenticates with a dedicated service account scoped to that agent's function. Not the engineer's credentials. Not a shared team API key. The question 'which identity executed this action?' must always have a specific, auditable answer. Agents that inherit broad developer permissions during prototyping carry that access into production unchanged. Per-tool scoped credentials keep one misbehaving or compromised agent from becoming a blast radius across your entire system.[7] The AWS Well-Architected Generative AI Lens formalizes this as GENSEC05-BP01: define a scoped IAM policy per agent role, specify intended resource ARNs explicitly, and use STS AssumeRole with session policies to further restrict permissions below the role's base level.[12]
A policy enforcement point intercepts every tool call before execution. It evaluates the proposed action against a rule set that lives in code, not in the system prompt. Allowed actions proceed. Blocked actions return a structured error. High-risk or irreversible actions route to a human approval queue. The rule set is declarative, enforceable at runtime, version-controlled alongside deployment configuration.[6] If it lives in the prompt, it is not enforcement. Open Policy Agent (OPA) is the production standard for this layer: Rego policies evaluate the tool name, arguments, and identity context before any execution, returning a structured allow/deny/escalate decision the orchestrator must respect.
Every agent run carries explicit ceilings: a step cap, a wall-clock deadline, a cost budget. When any ceiling is hit, the orchestrator halts and escalates. It does not retry. Infinite loops and runaway executions are the most expensive failure mode in agentic systems because they compound silently until someone checks the billing dashboard.[8] Teams have reported burning hundreds of dollars in a single runaway LangGraph workflow before a ceiling was in place. Cost is observability. A circuit breaker also needs a per-session state store (Redis works well) so the ceiling is evaluated globally across parallel tool calls, not just sequentially.
Logging that records 'agent succeeded' is not observability. It is an alibi. Effective audit trails capture every tool call, the inputs and outputs at the moment of execution, the identity context, and whether the action was approved, blocked, or escalated. The standard to meet: given any production incident from the past 90 days, you can reconstruct exactly what the agent did, in what order, with what authorization.[4] If you cannot replay it, you cannot debug it.
One file per agent role. Policies in code. Enforcement outside the model's reasoning context.
The policy enforcement point at Layer 2 is not theoretical. Open Policy Agent (OPA) is a CNCF-graduated policy engine used in production at organizations including Netflix, Goldman Sachs, and Google Cloud. Its Rego policy language evaluates structured JSON input — including a tool call name, arguments, and identity context — and returns an allow/deny/escalate decision before execution.
The pattern: your agent orchestrator calls the OPA sidecar before every tool invocation. OPA evaluates the proposed call against the policy file for that agent role. The orchestrator enforces the decision. The model never knows a policy engine exists — which is exactly the point.
The pattern that shows up in almost every production agent incident review.
Here is the failure pattern that shows up in almost every production agent incident review: the agent's permission scope was set during development, never formally audited before deployment, and was significantly broader than its actual operational requirements.
During development, engineers add tools to unblock themselves. The agent needs to search — add the search tool. The agent needs to write to staging — grant write permissions. The agent needs to test a deletion flow — grant delete permissions temporarily. Development ends. Hardening begins. The permissions never get cleaned up because nobody explicitly owns the cleanup. Temporary becomes permanent through inaction.
By production, the agent has accumulated access across a wide swath of internal API surface it was never meant to touch.[4]
AWS IAM Access Analyzer now includes unused access findings: it identifies roles and policies with permissions that haven't been exercised, combining external access detection and least-privilege auditing in a single tool.[12] Teams that catch permission drift run monthly automated audits: every permission granted to an agent gets compared against the tool call logs from the previous 30 days. Anything not exercised is a candidate for removal.
This is not just security hygiene. The OWASP Agentic AI framework calls this the "least agency" principle — grant only the minimum autonomy required to perform safe, bounded tasks.[10] Agents with tighter permission scopes have smaller failure surfaces when the model reasons toward an edge case it was not designed for. The audit also surfaces tools added during exploration but never actually needed in production. Drift is the default. The audit is the only thing that reverses it.
| Action type | Reversible? | Blast radius | Default stance |
|---|---|---|---|
| Read-only lookup | N/A | None | Allow — no gate needed |
| Write to a single scoped record | Yes (audit trail) | Low | Allow — log and monitor |
| Bulk write across multiple records | Partial | Medium | Allow with hard arg constraint (max N records) |
| External communication (email, webhook) | No | Medium–High | Require human approval every time |
| Production infrastructure change | Varies — often no | High | Require human approval + second reviewer |
| Deletion of any data | No | High | Hard block at IAM — not gated, blocked |
| Cross-account or cross-region action | No | Very High | Hard block unless explicitly whitelisted |
Over-permissioning kills production. Over-gating kills the automation. Both end the same way.
The symmetric failure is equally real. Teams that gate every agent action quickly discover that humans stop reviewing them. Approval fatigue sets in fast. When every minor action requires a human decision, the operational overhead erodes the value of automation until humans begin auto-approving without actually reading. The gate is still running. It has stopped enforcing anything.
The calibrated approach treats autonomy as a graduated trust model — earned through demonstrated operational track record, not assumed at the start.[9] Every new workflow starts at the most conservative level. Graduation to more autonomy is explicit and review-driven, never automatic. The thresholds below are guidelines, not rules. Your operational context and risk tolerance set what evidence is sufficient to justify graduation.
| Stage | Trigger | Autonomy Mode |
|---|---|---|
| New workflow | Default start | Gate every action. Agent proposes, human approves before execution. |
| Established workflow | 200+ clean completions, zero incidents | Exception-based. Agent acts. Escalates only on uncertainty or policy boundary. |
| Mature workflow | Low-risk, high-volume, decision criteria well-understood | Audit-based. Agent acts. Humans review on a schedule, not in the path. |
One question. If you can answer 'no,' you have enforcement. If you cannot, you have decoration.
The practical test for whether a guardrail is real:
Can a crafted user input cause the agent to violate it?
If the answer is yes, it is a convention, not a constraint. Prompt instructions can be overridden.[6] Configuration files that agents read can be manipulated through prompt injection. The only constraints that hold under adversarial conditions live outside the agent's reasoning context: enforcement at the tool call layer, network-level blocks on unauthorized egress, IAM restrictions on what credentials can actually do.
Teams build elaborate system prompts with detailed safety instructions, then watch a single adversarial input route around every one of them in one exchange. The instructions were real. The enforcement was not.
The distinction matters most for irreversible actions. Sending external communications, deleting data, modifying production infrastructure — anything that cannot be rolled back needs enforcement that lives outside the prompt. Human approval gates are one mechanism. Policy engines that block the tool call before execution are another. Both are infrastructure. Neither is a sentence in a system prompt.
The models are capable. The infrastructure gap is where production incidents happen.
Never deploy an agent using a shared team credential or a developer's personal IAM principal. The question 'which identity executed this?' must have a specific, auditable answer.
If deletion of production data should never happen, the IAM policy denies it. 'Never delete production' in the system prompt is not a block — it is a suggestion the model can reason around.
The orchestrator calls OPA (or equivalent) before execution. The model never invokes a tool directly. The policy decision — allow, block, escalate — is logged every time.
Unbounded agent runs are not a configuration choice. They are an incident waiting for billing to surface it. Set maxsteps, maxwallclockseconds, and maxcostusd. haltandescalate on breach.
No track record, no audit history, no amount of testing removes the need for human approval on external communications, deletions, and production infrastructure changes. Graduation applies to reversible actions only.
90-day minimum append-only retention. If you cannot reconstruct what the agent did in a specific run from 60 days ago, you do not have observability.
Doesn't a good system prompt handle most safety requirements?
System prompts shape model behavior. They do not enforce it. A 'never delete production data' instruction can be overridden by a crafted input that convinces the model the rule does not apply in the current context. Infrastructure-level controls — policy engines that block the tool call before it executes — cannot be prompted away. Use prompts for behavioral guidance. Use infrastructure for enforcement. The Kiro incident is the clean example: the model had no instruction to avoid deleting production. It had permission to do so.
How do we actually implement the policy enforcement point?
Open Policy Agent (OPA) is the production standard. Your agent orchestrator calls the OPA sidecar with a JSON payload containing the tool name, arguments, and identity context before every tool invocation. OPA evaluates the Rego policy for that agent role and returns allow/deny/escalate. The orchestrator enforces the decision and logs it. The model never knows OPA exists — it just receives a structured error if the call was blocked. OPA adds roughly 1–3ms per evaluation, which is imperceptible against LLM latency.
How do we avoid approval gate fatigue?
Classify actions by reversibility and risk, not by frequency. Read-only and easily reversible actions run without approval. External communications, production data mutations, every deletion require approval. New workflows start with gates on everything. Specific action types graduate to exception-based escalation only after 200+ clean completions. Graduation is explicit, reviewed, scoped to the action type that earned the track record — not the agent as a whole.
What is the minimum viable safety stack for a first production agent?
Four things. A dedicated scoped identity. A list of explicitly blocked operations enforced at the infrastructure layer, not in the prompt. Execution bounds — step cap, time deadline, cost ceiling with a Redis-backed state store. An audit log that captures every tool call with inputs and outputs. A human approval gate on every irreversible action is not optional. It is the one control that catches model reasoning errors that no other layer will catch before they execute. Everything else can be added incrementally.
How does permission drift happen and how do we stop it?
Permissions are added during development to unblock engineers. Delete permissions for a testing scenario. Read-write granted for a one-off task. Search tools added and never removed. By production, the agent has accumulated access nobody explicitly chose to grant for its actual function. AWS IAM Access Analyzer now identifies unused permissions automatically, surfacing roles with access that hasn't been exercised. Run a monthly audit: compare every granted permission against actual tool call logs from the previous 30 days. Remove anything unused. Drift is the default. The audit is the only thing that reverses it.
Is prompt injection a real threat or theoretical?
It is the primary production attack surface for agents. OWASP has ranked prompt injection the top LLM vulnerability for three consecutive years. CVE-2025-53773 demonstrated injection through code comments in public repositories causing arbitrary execution. Security researchers showed Devin AI would expose server ports and leak tokens when injected instructions were embedded in documents the agent processed. The defenses are at the infrastructure layer: argument validation before tool calls execute, scoped credentials that make injected instructions powerless even if followed, and network egress controls that block unauthorized destinations regardless of what the model was told.
IBM's 2025 Cost of a Data Breach Report found that 13% of organizations had a security incident involving AI models or applications. Of those, 97% lacked proper access controls. This is a self-reported survey, not an audit-based census — actual rates of inadequate access controls are likely higher because organizations that have had incidents are more likely to respond. The figure is directionally correct but should not be treated as a precise industry measurement. What it does establish: the correlation between absent access controls and AI security incidents is near-total in the organizations that reported breaches.
Infrastructure does not make headlines the way model quality does. Benchmark improvements get announced at conferences. Guardrails do not. But the incidents that make engineering leaders lose sleep — and customers lose access — are almost never about model capability. They are about what the model was permitted to do.
The agents that ship and stay shipped are not the ones running on the newest models. They are the ones running inside infrastructure that makes dangerous actions impossible, requires explicit approval for irreversible steps, and leaves an auditable record of every tool call. That is not a constraint on what agents can accomplish. It is the foundation that makes accomplishment durable.
You build the enforcement stack before the agent ships. There is no other way.