The agent your team shipped last quarter is running on intentions you never wrote down. Not the prompt — the prompt exists. What does not exist is the behavioral spec: the document that defines what the agent must never do, what success looks like in a testable form, and what the agent should do when it encounters something outside its design space. Without it, every production audit becomes a retroactive argument about whether the agent was helpful. That argument cannot be resolved. Helpful was never defined.
This is not a theoretical gap. Support teams that deploy agents without behavioral specs commonly surface the absence during quarterly reviews — the first structured occasion to ask whether the agent is doing the right thing. By then, behavioral drift has accumulated across thousands of interactions. The agent resolves cases it should escalate; escalates cases it should resolve. Both directions feel plausible to the model. Neither was documented as wrong.
Agentic engineering has shifted the primary deliverable for platform and product leaders. The model is capable. The deployment infrastructure increasingly handles execution-layer safety. What most teams are missing is the behavioral specification — the artifact that sits between business intent and enforcement machinery, that translates what the team wants the agent to do into something a policy engine can enforce and an eval suite can measure.
Most platform leads are writing prompts and calling them specs. They are not the same artifact. The prompt describes what the agent should try. The spec defines the contract it must keep.
90%
AgentSpec, ICSE 2026 — without degrading task performance on valid requests[1]
60% of those traced to rate limits — not model failure. Behavioral correctness failures are harder to measure without a spec (Datadog, 2026)[3]
Text-level alignment does not transfer to tool-call safety — empirically confirmed. Enforcement must live outside the model's context (Agent Behavioral Contracts, 2026)[2]
Three Artifacts Teams Mistake for a Behavioral Spec
Each one documents intent. None of them enforces it.
The prompt describes what you want the agent to try to do. It lives in the model's context, which means it is subject to the same probabilistic execution as everything else in that context. Adversarial inputs can route around prompt instructions. Novel scenarios cause the model to infer that a rule does not apply to the current case. Teams write detailed safety instructions in system prompts and discover, under the right input conditions, that none of them held. The prompt is advisory. It is not enforced.
The AGENTS.md or CLAUDE.md file is organizational context — better than a prompt because it provides structural framing rather than just task instruction.[5] But it remains advisory. The agent reads it. Nothing intercepts a tool call and checks whether the proposed action conflicts with the AGENTS.md. Research at ICSE 2026 confirms the structural gap: text-level safety alignment does not transfer to tool-call safety.[2] An instruction that lives in the agent's input context cannot enforce itself at the execution layer.
Documentation of observed behavior is the most dangerous substitute. When a team writes down how the agent currently behaves and calls that the spec, they are encoding drift rather than defining intent. Three months of production behavior contains edge cases the model handled in ways nobody explicitly designed — and some of those are now in the spec. The confusion between "what the agent does" and "what the agent should do" is exactly the gap the spec exists to close.
All three artifacts are useful. None of them is a behavioral spec. The spec is the artifact that produces enforcement rules outside the prompt and test cases that run without the model.
Describes what the agent should try to do
Lives in the model's context — subject to reasoning override
Safety compliance probabilistic, varies by input
Cannot be tested independently of the model
Written once; rarely revisited after deployment
Defines what the agent must not do and how to verify what it did
Names enforcement mechanisms for each invariant — outside the model's context
Invariants enforced deterministically at the infrastructure layer
Generates test cases that run in CI without the model
Versioned alongside deployment config, reviewed before every major release
Four Parts. Most Teams Write One.
The one they write — scope — is the least load-bearing. The three they skip are where production failures originate.
- 1
Scope: What the Agent Is, and What It Is Explicitly Not
Scope has two sides. Most teams write the first — what the agent handles. The second is where production incidents originate: what the agent is not for, stated explicitly. A support agent scoped to 'handle customer inquiries' will attempt to handle legal disputes, account closure requests, and billing fraud claims — because those are customer inquiries. A support agent scoped to 'handle order status, return requests under 30 days, and shipping inquiries — not account-level changes, not payment disputes, not legal or regulatory questions' knows where its boundary is and can route cleanly when it reaches it. The out-of-scope list is more operationally important than the in-scope list. Every exclusion should name an escalation path.
- 2
Invariants: Hard Limits with Named Enforcement Mechanisms
An invariant is a rule that must hold regardless of model reasoning. The test: if a sufficiently sophisticated input could convince the model to infer an exception to this rule, it is not an invariant — it is a preference. Real invariants cover actions with irreversible or high-risk consequences: deleting data, sending external communications, modifying production infrastructure, processing transactions above a threshold. Each invariant in the spec must name its enforcement mechanism — not just the rule. 'Never process a refund over $500 without human approval' is incomplete. 'Never process a refund over $500 — enforcement: policy engine blocks the issue_refund tool call when amount > 500' is a spec entry. If you cannot name the mechanism, the invariant is aspirational. Fix the enforcement before shipping.
- 3
Success Criteria: Testable Assertions, Not Subjective Goals
A success criterion is only valid if you can write a test case for it. 'Respond helpfully' is not a criterion — no test case exists. 'When the customer requests a return on an order under 30 days old, confirm the return and provide a return label link — no agent escalation' is a criterion. It names an input condition, an expected output condition, and a side-effect constraint. One test case per criterion, minimum. Run them on every deployment.[4] When the eval suite catches a regression before a production incident, the spec has earned its cost. The criterion is the spec. The test case is the proof.
- 4
Failure Modes: Default-Safe Actions at the Edges
Failure modes define what the agent does when its design assumptions break down: ambiguous input, out-of-scope request, tool failure, PII detected in input, execution budget exceeded. The answer for each is a default-safe action — the thing the agent does when the scenario falls outside its design space. Default-safe almost always resolves to one of three options: ask one clarifying question (for genuine ambiguity), escalate to a human (for out-of-scope or high-risk cases), or halt and report (for tool failure and budget breach). It is never 'try an alternative approach.' That instruction hands the edge case back to the model's judgment — which is precisely what the failure mode section exists to prevent.
behavioral-spec.yaml# behavioral-spec.yaml
# Version this file. Review before every major deployment.
agent:
id: support-resolution-v3
version: "2026-04-01"
scope:
in_scope:
- "Order status inquiries"
- "Returns and refunds — orders under 30 days"
- "Shipping delay inquiries"
out_of_scope:
- "Account closures"
- "Payment method changes"
- "Legal or regulatory inquiries"
- "Billing disputes over $1,000"
invariants:
# Every rule names its enforcement mechanism — not just the rule.
- rule: "Never process a refund over $500 without human approval"
enforcement: "policy_engine: block tool_call=issue_refund when amount > 500"
- rule: "Never retain customer PII in conversation logs beyond the session"
enforcement: "infrastructure: strip_pii_filter applied on all log writes"
- rule: "Never attempt account-level changes"
enforcement: "tool_allowlist: account_mutation tools not provisioned to this agent"
success_criteria:
# Written as testable assertions — input condition, expected output, side effects.
- scenario: "Return request, order under 30 days"
input_condition: "customer requests return, order.age_days < 30"
expected_output: "confirms return, provides return label URL"
side_effects: "tool_call=issue_return executed, no escalation"
evaluation: "automated"
- scenario: "Return request, order over 30 days"
input_condition: "customer requests return, order.age_days >= 30"
expected_output: "presents 30-day policy, offers manager escalation"
side_effects: "no tool_call=issue_return, escalation path offered"
evaluation: "automated"
- scenario: "Out-of-scope: account closure"
input_condition: "customer requests account closure"
expected_output: "acknowledges, routes to human queue"
side_effects: "tool_call=escalate_to_human executed"
evaluation: "automated"
failure_modes:
on_ambiguous_request:
action: "Ask one clarifying question. If still ambiguous after one exchange: escalate."
on_out_of_scope:
action: "Acknowledge. Route to human queue. Do not attempt resolution."
on_tool_failure:
action: "Halt. Inform customer of delay. Escalate immediately."
on_pii_detected_in_input:
action: "Process normally. Flag in audit log. Do not echo PII back."
on_budget_exceeded:
action: "Halt. Escalate to oncall-support channel."
measurement:
eval_frequency: "every_deployment"
behavioral_drift_review: "weekly"
escalation_rate_target: "< 15%"The Invariants vs. Preferences Line Determines What Gets Enforced
Most teams only specify preferences. They discover they needed invariants when something irreversible happens.
The most consequential decision in a behavioral spec is the classification of each rule: invariant or preference.
Invariants cover failure modes where model reasoning toward the wrong edge case produces damage that cannot be undone. They must hold regardless of input sophistication. Their enforcement mechanism lives outside the model's context — in a policy engine, an IAM rule, an output filter, a tool allowlist. The agent cannot argue its way around an invariant because the invariant does not run inside the agent's reasoning context.
Preferences cover behavioral tradeoffs where getting it wrong is recoverable. The agent prefers to resolve rather than escalate. It prefers concise responses over exhaustive ones. Preferences live in the prompt and get measured through evals. When a preference consistently fails in production, the response is either a prompt revision or a promotion of that preference to an invariant — if the failure mode turns out to be irreversible.
The failure mode for most behavioral specs is under-specification of invariants. Teams write preferences because they are natural to articulate — you know what you want the agent to try. Invariants require thinking about failure modes rather than capabilities. That thinking is uncomfortable. It forces specificity about what the agent must never do, which requires naming scenarios the team would rather not encounter. The spec is the forcing function for that conversation.
| Type | What it covers | How it's enforced | Where it lives |
|---|---|---|---|
| Invariant | Actions with irreversible or high-risk consequences | Policy engine, IAM rule, output filter, tool allowlist | Infrastructure — outside the model's context |
| Preference | Behavioral tradeoffs with recoverable costs | Prompt instruction and eval weighting | System prompt and eval suite |
| Scope boundary | What the agent is not for | Escalation routing, out-of-scope classifier | Orchestration layer — checked before agent receives the request |
A Success Criterion You Can't Test Is Decoration
The threshold question for every criterion before it enters the spec.
The test for a success criterion: can you write a test case for it without running the model? If the only way to evaluate whether the criterion holds is to read the model's output and make a judgment call, the criterion is not specific enough to be in a spec. It is an intention masquerading as a standard.
"Respond helpfully" fails the test. "Respond concisely" fails. "When the customer requests a return on an order placed within 30 days, confirm the return and issue a return label — no escalation" passes. It names the input condition, the expected output condition, and the expected side effects. An automated eval can check: was the return label issued, was there no escalation call? That check runs in CI without human judgment on every deployment.
McKinsey's QuantumBlack team notes the same principle: evaluate full trajectories, not just final output. Tool choice correctness, argument validity, step count, cost, and policy compliance are all measurable properties of the execution path.[4] Each success criterion in a behavioral spec should map to at least one trajectory checkpoint — not just whether the answer looked correct, but whether the agent reached it via the authorized path.
Behavioral drift is what specs exist to catch. Drift happens gradually: the model changes between deployments, the input distribution shifts, edge cases accumulate. A spec with no eval coverage cannot catch drift. A spec with eval coverage catches it at the deployment boundary, where the fix is a config change rather than an incident at 2am.
Behavioral Spec Writing Principles
Every invariant names its enforcement mechanism
Not just the rule — the infrastructure control that enforces it. If you cannot name the mechanism, the invariant is aspirational. The enforcement must exist before the spec ships.
Every success criterion has at least one test case
Test cases run in CI without the model. If evaluating a criterion requires human judgment on model output, rewrite it as a checkable assertion with named input and output conditions.
Every failure mode maps to one default-safe action
Not 'try harder.' Not 'use best judgment.' Clarify once, escalate, or halt. Failure modes that give the agent discretion are edge cases waiting to escalate into incidents.
Scope boundaries name what the agent is not for
The out-of-scope list is operationally more important than the in-scope list. Every exclusion maps to an escalation path. If the path is not named, the boundary is not real.
The spec versions alongside deployment configuration
When the deployment changes, spec review is part of the deploy gate — not a separate artifact that drifts from the running system.
The Spec Without Enforcement Is Still Just Documentation
Writing it is the start. Wiring it into runtime controls is what separates a spec from a description.
AgentSpec (ICSE 2026) operationalizes the enforcement layer: a DSL where each constraint is a three-tuple — a triggering event, a predicate, and an enforcement action.[1] The framework intercepts agent execution in real time, checks each proposed action against the constraint set, and blocks or corrects before the action reaches the tool layer. The constraints live outside the agent's context. The agent cannot reason its way around them because they run on a different plane entirely.
This is the implementation target for each invariant in a behavioral spec. The spec says 'never process a refund over $500 without approval.' The enforcement layer wires that rule into the policy engine as a triggered block on the tool call. The spec says 'escalate on out-of-scope requests.' The orchestration layer routes the call before it reaches the model. Each invariant is only as real as the enforcement mechanism it names.
The checklist below is the pre-deployment gate for the spec. A spec that has not produced enforcement rules and test cases has not finished its job.
Pre-Deployment Behavioral Spec Review
Scope section names what the agent is not for — explicit exclusions with escalation paths
Each invariant names the infrastructure control enforcing it, not just the rule
Success criteria written as: given [input condition], expected [output], side effects [tool calls]
At least one test case per success criterion — runs in CI without human judgment
Failure modes cover: ambiguous input, out-of-scope request, tool failure, PII detection, budget exceeded
Each failure mode maps to one action: clarify once, escalate, or halt — no agent discretion
Invariants wired into runtime enforcement: policy engine, tool allowlist, or output filter
Spec versioned in the same repository as the agent's deployment configuration
Spec reviewed by someone outside the team that wrote it
Behavioral drift review scheduled — weekly for active production agents
What is the difference between a behavioral spec and a system prompt?
A system prompt shapes what the model tries to do. A behavioral spec defines what it must not do, what success looks like in a testable form, and what happens at the edges — producing enforcement rules and test cases. The prompt is advisory. The spec generates controls that run outside the model's context. Teams that conflate them discover the gap when an adversarial input overrides a prompt instruction, or when they need to audit agent behavior and have nothing concrete to audit against.
Who should own the behavioral spec — product or engineering?
Both, on different sections. Product owns scope and success criteria — what the agent is for, and what working looks like in business terms. Engineering owns invariants and failure modes — what must never happen, and what the system does when things go wrong. The spec is the document that forces that negotiation before deployment. Teams that skip it tend to discover the disagreement during a production incident instead.
Can we write the spec after the agent is already in production?
You can, but it costs more than writing it first. Post-hoc spec writing requires reverse-engineering intent from observed behavior — and observed behavior includes drift you may not want to preserve. The more useful approach: write the spec based on original intent, then measure the gap between the spec and current behavior. The gap is your remediation backlog. It will almost always be longer than expected.
What tools exist for writing and enforcing behavioral specs?
AgentSpec (ICSE 2026) introduced a DSL expressing constraints as trigger-predicate-enforcement triples, enforced at runtime — the closest published standard for executable behavioral specifications.[1] For evaluation, frameworks like Braintrust, Phoenix (Arize), and LangSmith support programmatic test case execution against defined criteria. CLAUDE.md and AGENTS.md files provide organizational context but remain advisory. The current gap: tooling that connects the written spec to both the enforcement layer and the eval pipeline in a single workflow. Most teams are still wiring this by hand.
How often should the behavioral spec be reviewed?
Review before every major deployment — that is the minimum gate. Beyond that, schedule a weekly behavioral drift check for active production agents: compare current escalation rate, resolution rate, and eval pass rate against the baselines defined in the spec. When any metric drifts materially from target, investigate before the next deployment, not after the next incident.
Platform leaders who have shipped agents without behavioral specs are in the majority position — most production agent deployments run on intentions nobody wrote down. That is not an indictment. It is a description of where the tooling and the discipline currently sit relative to each other.
The spec changes the conversation between product and engineering in ways that matter beyond safety. Success criteria force product to define working in terms that engineering can test. Invariants force engineering to tell product which actions are off-limits and why. Failure modes force both teams to agree on what the agent does when neither of them is watching. None of that negotiation happens automatically. The spec makes it happen before production, not during an incident.
The model already knows how to reason toward a goal. The spec tells it which goals are out of bounds, what the boundary is, and what to do when it reaches the edge. Write that down. Wire it in. The agents your team ships next quarter will have something last quarter's didn't: a contract.
- [1]Wang, Poskitt et al. — AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents(arxiv.org)↩
- [2]Agent Behavioral Contracts: Formal Specification and Runtime Enforcement for Reliable Autonomous AI Agents(arxiv.org)↩
- [3]Datadog — State of AI Engineering 2026(datadoghq.com)↩
- [4]QuantumBlack, AI by McKinsey — Evaluations for the agentic world(medium.com)↩
- [5]Red Hat Developer — Vibes, specs, skills, and agents: The four pillars of AI coding(developers.redhat.com)↩