Engineering directors burn 45 minutes every morning reconstructing a picture five tools could have assembled. Replace the loop: five parallel collectors, one orchestrator, a confidence score, a 90-second RED/AMBER/GREEN brief. Triage out of working memory, into code.
Four signal layers, scored monthly per service, produce a fragility register that names your next outage weeks before it happens. Size is not risk. Neglect is risk. The heat map measures neglect.
The dashboard goes green while the model invents a refund policy. Status codes are not a quality signal for generative output. The fix is an eval stack: CI gates, judge models, sampled production scoring, and a dataset that compounds with every failure.
Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
Valid JSON, clean dashboards, no alerts — and the agent's reasoning depth dropped 67% between two model updates. Three detection layers catch what HTTP error rates structurally cannot: execution fingerprinting, semantic drift, and user-signal triangulation.
Amazon's Kiro deleted production in December 2025. The model didn't malfunction — it executed inside the permissions it had been given. The fix is not a better model. It's an enforcement stack the prompt cannot override. Four layers, executable constraints, no theater.
Detection tells you something is wrong. The four-step diagnostic pipeline — behavioral telemetry, failure clustering, root cause attribution, eval generation — tells you what failed, why, and how to stop it from shipping again. Most teams build partial detection and stop there.