Roughly nine in ten skill files fail one of five basic checks. The body is rarely the problem. The description is — that 100-token blurb is the only thing the agent reads when deciding whether to load you. Engineer it, or stay invisible.
Latency, error rate, and token cost stay green while LLM output quality degrades for weeks. The infrastructure layer cannot see semantic failure. Sampled evals, prompt hash drift, and distribution alerts are the signals that catch it before users do.
How to apply semantic versioning and consumer-driven contract testing to AI agent system prompts — treating prompts as versioned API contracts with explicit breaking change classification, agent manifests, and CDC-style registration for multi-agent production systems.