Create the software factory itself: model access, orchestration, agent runtime, prompt/version management, evals, observability, deployment flows, guardrails, and reusable internal components.
Karpathy’s four coding-agent principles are useful, but production agents need scoped edits, test-gaming controls, trust-boundary calibration, and calibrated reporting.
Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.
Most teams architect for capability and optimize for cost after the invoice lands. Here is the playbook for building cost constraints in from day one: task profile audits, three-tier routing, and synthetic benchmarking before your first deploy.
Most production agents run on intentions nobody wrote down. Here is how to write the behavioral spec — scope, invariants, testable success criteria, and failure modes — that translates business intent into something your infrastructure can enforce.
Amazon's Kiro deleted production in December 2025. The model didn't malfunction — it executed inside the permissions it had been given. The fix is not a better model. It's an enforcement stack the prompt cannot override. Four layers, executable constraints, no theater.
Four agents coordinate. The trace backend shows 3 to 10 orphaned root spans, no causal thread. The model is not the failure. Context propagation is. Five gaps, the minimal code to close each, and the build order that actually ships.
Valid JSON, clean dashboards, no alerts — and the agent's reasoning depth dropped 67% between two model updates. Three detection layers catch what HTTP error rates structurally cannot: execution fingerprinting, semantic drift, and user-signal triangulation.
How to apply semantic versioning and consumer-driven contract testing to AI agent system prompts — treating prompts as versioned API contracts with explicit breaking change classification, agent manifests, and CDC-style registration for multi-agent production systems.
Most teams promote to multi-agent before proving the single agent. Three gates — observability, override readiness, behavioral consistency — decide whether orchestration is earned or inherited. Skip them and a $3.50 task becomes a $47,000 incident.
Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
Train once, control the weights, call it sovereignty. Twelve months later the model is confidently wrong about pricing, policy, and headcount. The playbook for when to retrain, what to retrain on, and how to validate without breaking live agents.
Two LangChain agents burned $47K in eleven days. The model worked. The budget math didn't. Multi-agent cost is a heavy-tailed distribution, monitoring is structurally too late, and only synchronous SDK-level enforcement stops the spiral.
Most agent failures return HTTP 200. The dashboard stays green while the reasoning chain quietly compounds the wrong premise. Here is the triage runbook, the failure-mode field guide, and the postmortem template that survives non-deterministic systems.
SRE runbooks assume one process, one stack trace, one bad line. Agent failures are distributed across dozens of reasoning steps — the wrong premise gets laundered through 33 more calls before the user sees it. Here is the taxonomy, the triage, the postmortem.
Billing anomaly alerts run on a 24–48 hour lag. The retry loop is already an invoice by the time anyone sees it. The control that catches it is per-session, in-process, and lives in the orchestration layer — profiled envelope, 3x P95 trip, defined degradation.
The dashboard goes green while the model invents a refund policy. Status codes are not a quality signal for generative output. The fix is an eval stack: CI gates, judge models, sampled production scoring, and a dataset that compounds with every failure.
Every dismiss, modify, and escalate is a labeled training signal. Most teams log it as a debug artifact and move on. Here is the audit schema, the weekly tuner, and the human approval gate that turn that signal into thresholds that converge in eight weeks.
One orchestrator decomposes the question. Four subagents work the threads in isolation. Synthesis weighs the evidence. The brief lands in twenty minutes — not because the model is faster, but because the topology stopped wasting wall-clock on serial wait.
Your private /deploy shortcut saves you twenty minutes a day and helps exactly one person. Plugins move the same workflow into a parameterized package every team installs in minutes. Here is the full lifecycle — skill, context files, MCP wiring, marketplace.
Roughly nine in ten skill files fail one of five basic checks. The body is rarely the problem. The description is — that 100-token blurb is the only thing the agent reads when deciding whether to load you. Engineer it, or stay invisible.