Roughly nine in ten skill files fail one of five basic checks. The body is rarely the problem. The description is — that 100-token blurb is the only thing the agent reads when deciding whether to load you. Engineer it, or stay invisible.
Why the description field — not the body — is the only selection signal the agent sees at startup
How to write descriptions that survive user phrasing variation and competing skill pressure
Three-tier progressive disclosure: metadata, body, references — and how to tier them correctly
Ambiguity policy design: clarify vs. assume, decided per action class in advance
Output format contracts: serving the human reader and the downstream tool at once
Skill shadowing: the performance cliff that appears when skill libraries grow past ~20 skills
Hook exit code semantics: why exit 1 is a silent non-enforcer and exit 2 is the only real block
Pre-ship test protocol: a number to ship against, not a feeling
Your skill works on your machine. With your phrasing. On the one task you tested. A teammate tries it with different wording and the skill never fires. Or it fires when it shouldn't. Or it runs and emits output that looks fine and quietly breaks the next step in the pipeline.
That's the default state. Around 90% of skill files in shared repos and team configs miss one of five basic checks: a vague description, a body past 500 lines, reference files nested more than a level deep, no documented ambiguity strategy, no pre-ship test protocol. The official docs cover the file format. They spend almost no time on the engineering decisions that decide whether the skill ever runs in production.[1]
One field controls all the others. The description. Get it wrong and the body never loads — the agent didn't pick your skill, and nothing inside it ever ran. The rest of this guide covers what flows from that fact: metadata engineering for reliable triggering, three-tier progressive disclosure for token discipline, an ambiguity policy that survives real users, output formats that serve humans and the next tool in the chain, and a test protocol that gives you a number to ship against instead of a feeling.
Selection happens before the body exists in the agent's world. Lose that round and nothing else runs.
The docs say it. They don't dwell on it. The description is the only thing the agent reads when deciding whether to load your skill.[1] The body, the instructions, the templates, the validation scripts — none of it exists in the agent's context until after the description has already won or lost.
At startup, every installed skill contributes roughly 100 tokens of metadata to the system prompt. Hundreds of skills compete in that budget. The agent scans the descriptions and picks the best match for the current task. A vague description loses to a precise competitor. Missing trigger vocabulary loses to a description that mirrors the user's actual phrasing. A first-person blurb loses to a third-person one because the system prompt's voice gets confused.[2]
There's no algorithmic routing. No embeddings, no intent classifier, no pattern matching at the code level.[7] The system formats all available skill descriptions into the Skill tool's context and lets the model make the call. The model is reading descriptions the same way it reads everything else — as text. Write the description for a reader, not a regex.
Library card catalog. Nobody reads the book to decide whether to pull it off the shelf. The catalog card decides. Write the card.
"Helps with documents" — matches every task and none. Loses every selection round.
"I can process PDFs for you" — first person collides with the system prompt's voice and degrades selection accuracy
"Processes data" — no trigger vocabulary, no when. The agent has nothing to match against
"Useful tool for developers" — names the audience, not the capability. Audience is not a trigger
"Extracts text and tables from PDF files, fills forms, merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction."
"Generates descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes."
"Analyzes Excel spreadsheets, creates pivot tables, generates charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files."
"Deploys the application to production via the CI pipeline. Use when the user says deploy, ship, release, or push to prod."
"Generates commit messages" tells the agent the capability. "Use when reviewing staged changes or the user asks for commit help" tells it the trigger condition. The agent matches on both. Drop either and you lose half the selection signal.
If your users say "deploy" and your description says "provision infrastructure," the skill is invisible to half of them. The author's vocabulary is the wrong vocabulary. Use the words that arrive in the chat box.
Longer descriptions eat more of the shared metadata budget and risk truncation. The constraint is not stylistic — it is the budget every other installed skill is also pulling from. Pack maximum signal into minimum space.
The body is loaded after selection, not before. Any 'when to use this skill' line buried in the body had zero effect on whether the skill ran. Trigger conditions in the body are dead text.
Research consistently finds agents err on the side of not using skills when unsure.[12] An under-specified description means the skill won't trigger when it should. "For reports" reliably loses to "Use when the user asks to generate a performance summary, weekly report, or dashboard export." Being explicit about edge cases and adjacent phrasing pays off.
Authors test with their own phrasing because it's the phrasing they have. Confirmation bias built into the harness. Recruit a teammate, capture their natural request, run it cold. The gap between author phrasing and user phrasing is where the activation rate dies.
Performance degrades as skill libraries grow. Skill shadowing is the failure mode nobody talks about until their tenth production incident.
Add 20 skills to a team's Claude Code setup and something counterintuitive happens: selection gets worse, not just slower. Recent research studying agent skill libraries at scale found performance degrading by up to 21% when scaling from a small set of focused skills to a 202-skill library.[12] The model starts having trouble determining which skill is worth loading, and potentially helpful skills go unused.
The mechanism is called skill shadowing: when two or more skills have overlapping description vocabulary, they compete for the same trigger queries and split the selection signal. The agent picks one, ignores the other, or picks neither. All three outcomes are wrong.
The fix isn't fewer skills. It's cleaner description boundaries.
| Symptom | Root cause | Fix |
|---|---|---|
| Skill A triggers when Skill B should, and vice versa | Overlapping trigger vocabulary — both descriptions match the same user phrasing | Differentiate via domain nouns: "Excel .xlsx files" vs "CSV flat files" beats two vague "spreadsheet" descriptions |
| Neither skill triggers on queries both were supposed to catch | Competing descriptions split the signal — model is uncertain, picks neither | Merge the two skills if they serve the same workflow, or make one explicitly call the other |
| New skill installed, existing high-performing skill stops triggering as reliably | New skill's description shadows the existing one's vocabulary | Audit descriptions when adding skills — search for vocabulary overlap before shipping |
| Skill activates correctly in isolation, fails in the team config with 30 other skills | Skill library competition — works when it's the only candidate, loses against a more precisely worded competitor | Test descriptions against the full installed skill set, not in isolation |
Every token spent on a skill is a token unavailable for reasoning. Tier the load.
The context window is a shared resource. Your skill competes with the system prompt, conversation history, every other installed skill's metadata, and the actual user request. Every token spent on skill content is a token taken from reasoning.[6]
Production skills load in three tiers, on demand. Tier collapse — pulling everything in upfront — is the failure mode.
Tier 1 — Metadata. Always loaded. ~100 tokens per skill. The name and description from the YAML frontmatter. This is the only fragment the agent sees at selection time.[1]
Tier 2 — Entry point. Loaded when the skill wins selection. The markdown body of SKILL.md. Target under 500 lines. Holds the workflow, decisions, pointers to references. Past 500 lines and the agent starts reading partial chunks and missing context.
Tier 3 — References. Loaded on demand. Templates, examples, API docs, scripts. A skill with 2,000 lines of API reference in a Tier 3 file costs zero tokens on every task that does not touch the API. That's the leverage point.
Note: when using skills through the Claude Agent SDK rather than the CLI, allowed-tools frontmatter is ignored — tool access is controlled via the allowedTools option on query(). The SDK loads skills via settingSources (user and project), not via the frontmatter fields the CLI reads.[13]
treemy-skill/
├── SKILL.md
│ ├── # Frontmatter — Tier 1, always loaded
│ └── # Body — Tier 2, loaded on selection, capped at 500 lines
├── reference/
│ ├── api-docs.md
│ ├── schema-definitions.md
│ └── troubleshooting.md
├── templates/
│ ├── output-template.md
│ └── report-format.json
├── examples/
│ ├── simple-case.md
│ └── complex-case.md
└── scripts/
├── validate.py
└── transform.shAsking too much makes the skill annoying. Assuming too aggressively destroys output. Both kill adoption.
Real users don't write specifications. They say "fix the tests" when they mean the three failing integration tests in the auth module. They say "make a report" when they mean a quarterly revenue breakdown by region, in markdown.
The skill needs a written ambiguity policy. Two failure modes pull in opposite directions. Always-ask kills the experience — the skill feels slow and pedantic, users stop reaching for it. Always-assume produces wrong output that takes longer to repair than to redo from scratch. The fix is not a single rule. It is a per-class-of-action decision documented in the skill itself, so the agent doesn't re-derive it under pressure.
| Action class | Default | Why |
|---|---|---|
| Destructive (deploy, delete, overwrite) | Always confirm before acting | The action is irreversible. The cost of a wrong assumption is hours of recovery, sometimes more. |
| Output format unspecified | Pick a default. State the assumption. | Format preferences correct in seconds. Blocking on format choice wastes the user's turn. |
| Scope unclear ("fix the tests") | Take the narrowest reasonable scope. Report what you did. | Fix the three failing tests and surface what was done. Asking which tests when the answer is obvious from context wastes a round trip. |
| Multiple valid approaches | Pick the conventional one. Name the choice. | Methodology debates are not what the user came for. Reversibility is. |
| Missing required parameter | Suggest the most likely value, ask for confirmation | "Did you mean the staging environment?" beats "Which environment?" The first is a one-word answer. The second restarts the conversation. |
| Domain terminology | Use the project's glossary. Do not negotiate it. | If the codebase calls it a "widget," use that. Renaming on the fly creates drift between the skill output and the rest of the system. |
Human reader makes a decision. Downstream tool ingests the result. Optimize for one and the other becomes copy-paste glue.
Skill output is rarely the final destination. It flows into two consumers: a human deciding the next move, and a downstream tool that needs to parse what came out. Pick one and the other pays a tax.
Human-formatted output the next tool can't parse forces manual copy-paste — every chained workflow grinds. Machine-formatted output the human can't scan forces a formatter pass before any decision. The right shape serves both. The pattern below is plan-validate-execute: the skill writes a structured plan a human can read at a glance, then validates it, then applies it. The intermediate artifact is the contract.
Lead with the answer. The first line is the conclusion, not the reasoning chain.
Use structured formatting — headers, lists, tables — so the eye finds what it needs without reading every word.
Mark what changed, what was skipped, what needs attention. The reader is looking for exceptions.
Keep status messages consistent. Always "3 files updated, 1 skipped" — never sometimes "Updated 3 files" and sometimes "Done." Drift in status format breaks scripts and reader trust.
Emit structured data — JSON, YAML — for any output another tool will read.
Use plan-validate-execute. Write a changes.json, validate it, then apply. The plan is the audit trail.
Keep stdout clean. Diagnostics go to stderr or a log file. Mixing them poisons every parser downstream.
Return exit codes that mean something. 0 for success, 1 for handled errors, 2 for unrecoverable failures. The next tool reads exit codes; missing exit code semantics fail silently.
Migrations are a narrow bridge with cliffs. Code reviews are open terrain. Mixing the constraint level kills both.
Common failure: every instruction written as an absolute rule. Different parts of a skill need different constraint levels. Database migrations need exact commands. Code reviews need direction, not a script. Over-constrain the flexible work and the agent ignores context that matters. Under-constrain the fragile work and the agent reasons toward something that breaks production.[6]
A useful mental model. Some tasks are a narrow bridge with cliffs on both sides — one safe path, deviation means failure, write the exact command. Other tasks are an open field — many paths reach the goal, the agent's judgment about which route to take is usually better than the author's hardcoded choice. Match the constraint to the terrain. Mismatch is where rigidity meets recklessness.
A corollary: the most elaborate skill files are often the least reliable. Teams that pour effort into exhaustive instructions discover the skill is too rigid for the variety of inputs it actually meets. The skills that survive in production are narrowly scoped with sparse instructions — one thing, done clearly, no attempt to cover every adjacent case. Scope discipline beats coverage ambition.
The decision to build a skill is the first place to make the wrong call. If the agent already handles the task, the skill is decoration.
| Situation | Right choice | Why |
|---|---|---|
| You keep pasting the same 10-line instruction block into chat | Skill (Tier 2 body) | The repetition is the signal. Extract it. The body loads only when needed — zero startup cost when the task is irrelevant. |
| A section of CLAUDE.md has grown into a multi-step procedure | Skill (move it out) | CLAUDE.md is always in context. A procedure that belongs in one task type doesn't need to burn tokens on every other task. |
| A fact or constraint applies globally across all tasks | CLAUDE.md entry, not a skill | CLAUDE.md is the right place for global context. A skill that merely states a policy will often not be selected at all. |
| The agent already handles the task correctly without any intervention | No abstraction — test first | Writing a skill for a problem that doesn't exist adds selection noise and description budget overhead. Run the baseline eval first. |
| The task involves highly volatile external APIs or frequently updated docs | Tier 3 reference files, updated independently | Baking volatile information into the skill body means every API change requires a skill edit. Keep volatile content in reference files the body can point to. |
| The workflow spans multiple distinct domains (e.g., code review + deployment) | Two separate skills, invoked in sequence | One skill trying to cover two domains will have a blurred description that wins selection inconsistently for both. |
Every field has a failure mode when omitted or misused. Here's the audit.
| Field | When to use | Failure mode |
|---|---|---|
| name | Always. Becomes the slash command. Lowercase. Hyphens only. | Spaces or uppercase break invocation. Vague names like "helper" never get reached for. |
| description | Always. The trigger surface. Carry both what and when. | First person. No trigger vocabulary. Either one and the skill is invisible. |
| disable-model-invocation | Side-effect skills: deploy, delete, send messages, mutate state. | Left off for destructive operations. The agent now auto-triggers a deletion. This is how skills delete production. |
| user-invocable | Set to false for background knowledge skills the user should not call directly. | Hidden from users who should be able to call it. Discoverability dies in a config. |
| allowed-tools | Skill needs specific tools without per-call approval friction. CLI only — ignored by the SDK. | Bash(*) when Bash(git *) would have done. Permission scope wider than function. Drift starts here. SDK users must use allowedTools on query() instead. |
| context | Set to fork when the skill should run in isolation from the parent context. | Forking a skill that only emits reference content. Pointless context split, no actionable task. |
| agent | A specific subagent type (Explore, Plan) fits the task better than general-purpose. | Research tasks running general-purpose when Explore's read-only toolset would have prevented half the failure surface. |
| mode | Set to true for skills that modify Claude's behavior or context globally (e.g., a "verbose debugging mode"). Appears in a dedicated Mode Commands section in the skill list. | Used on skills that do narrow tasks — it groups them in the wrong place and can confuse selection. |
First-time plugin authors write hooks that feel like enforcement but aren't. Exit code 1 is not a block.
Claude hooks communicate via exit code. The semantics are not what Unix tradition suggests.
Exit code 0 means proceed normally. Any JSON output on stdout is parsed for structured control fields. Exit code 2 is the only code that blocks — it tells Claude Code to halt the current action, and the content on stderr gets fed back to Claude as an error message to reason about. Any other exit code, including 1, logs an error and lets the tool through anyway.[14]
Every first-time plugin author writes a PreToolUse hook, tests it, sees "BLOCKED" in the log, then watches Claude execute the call regardless. Because 1 is not block. Only 2 is. The hook ran. The enforcement did not.
A second sharp edge: PostToolUse cannot block anything. The tool already ran. Exit code 2 on a PostToolUse hook shows stderr to Claude — it does not undo what executed. Use Pre for guards. Use Post for cleanup and context injection.
| Exit code | Behavior | When to use |
|---|---|---|
| 0 | Proceed. JSON on stdout is parsed for control fields (permissionDecision, additionalContext, etc.). | Normal flow. Optionally emit structured JSON to influence behavior. |
| 2 | Block (PreToolUse, PermissionRequest, UserPromptSubmit, and others). Stderr fed to Claude as error context. | Policy enforcement. Any hook meant to stop an action must use exit 2, not exit 1. |
| 1 (or any other non-zero) | Non-blocking error. Logs to transcript and debug log. Execution continues. | Unexpected failures you want to surface without halting the workflow. Not for policy enforcement. |
Five gates between draft and production. Skip any of them and the activation rate is whatever the universe gives you.
Writing evals before writing extensive documentation is the highest-leverage move in skill engineering.[4] It forces the question every author wants to skip — does the skill solve a real problem or an imagined one. If the agent already handles the task without the skill, you don't need the skill. The eval surfaces that fact in minutes instead of months.
The protocol below is two halves. Automated checks catch structural problems — descriptions that fail selection, files past the line cap, references nested too deep. Observational testing catches behavioral problems the harness can't see — the skill firing on adjacent tasks, ignoring a reference file, re-reading the same section four times because the body is structured wrong. Run both. One without the other ships blind.
Pick three concrete tasks the skill is supposed to handle. Run the agent on those tasks with no skill loaded. Document the specific failures. That's the baseline. If the agent already succeeds without the skill, the skill is decoration — cut it before you write the body.
The description is the single largest failure point. Test at least five phrasings, including phrasings from people who do not know the skill exists. They say 'make a report,' not 'invoke the report-generator skill.' If your eval set only contains phrasings the author would use, the activation rate measurement is theater.
A skill that works perfectly when it's the only installed skill may lose selection against a more precisely worded competitor in the team config with 25 other skills. Always run trigger evals against the full skill library the skill will actually compete in.
Opus over-interprets verbose instructions. Haiku under-interprets sparse ones. A skill tuned to one model and deployed across all of them is going to misfire somewhere. Pin the matrix.
Watch how the agent actually uses the skill in practice. The order of file reads tells you whether the structure matches the workflow. Files the agent never opens are dead weight. Sections re-read repeatedly are candidates for promotion to SKILL.md — the agent is signaling that the content belongs higher in the load order.
Use one Claude instance to author the skill. Use a separate instance to test it on real tasks. The author instance knows what it intended; the test instance reveals what is actually missing. Iterate between them until the test instance handles every scenario without intervention.
After shipping 54 plugins in skillstack, one metric predicts adoption better than any other: activation rate — the percentage of relevant user queries that actually trigger the skill. A careful first author lands around 8/10. Not a failure. A starting point. The gap from 8/10 to 10/10 closes through iteration on the description field, never the body.
The loop: 10-15 positive trigger queries, 5-10 negatives, run them against the model, count the misses, rewrite the description to close each specific gap, run again. plugin-dev's run_eval.py collapses that into one command. Without the harness, every description edit is a coin flip — you cannot tell whether the change helped, hurt, or did nothing.
Author tests with author phrasing — confirmation bias is the harness
Teammates discover the failures in production, not pre-ship
Silent misses: skill never fires, user concludes it does not exist
No baseline to grade description edits against — every change is a guess
Diverse trigger queries catch the phrasings the author would never have written
Activation rate score gives a concrete target to push against — 8/10, then 9/10, then 10/10
Negative queries prove the skill stays out of unrelated tasks
Every description edit has a measurable delta — iterate with evidence, not vibes
A single SKILL.md is a function. A plugin — skills, hooks, MCP extensions, scripts — is a system. The threshold is real. A new failure class lives on the other side. The most common one costs hours the first time it hits.
Claude hooks communicate via exit code. Exit code 0 means proceed. Exit code 2 means block the tool call. Exit code 1 logs an error and lets the tool through anyway. Every first-time plugin author writes a PreToolUse hook, tests it, sees "BLOCKED" in the log, watches Claude execute the call regardless. Because 1 is not block. Only 2 is. The hook ran. The enforcement did not.[14]
Ideation — frame the problem, scope it, write the success criteria before any code
Research — survey prior art, competing skills, the pattern library you should not reinvent
Architecture — skill graph, hook design, MCP surface decided up front, not retrofitted
Composition — write each skill using the skill-foundry patterns
Hooks — PreToolUse and PostToolUse with the correct exit codes (2 blocks, not 1)
Validation — schema check, integration tests, edge cases run before merge
Evaluation — activation rate plus output quality, both with numeric thresholds
Documentation — README, examples, changelog as part of shipping, not after
Reviewed across hundreds of skill files. These are the failure modes that ship in the most polished-looking work.
Trigger conditions buried in the markdown body
The body loads after selection. Any "when to use this skill" guidance inside the body had zero effect on whether the skill ran. The selection decision was already made on the description alone. Move every trigger condition into the description field. The body is for after the skill has already won.
Listing options without picking a default
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image" — the agent picks arbitrarily, and that arbitrary pick may not match what the team standardized on. Name one default. Mention alternatives only for specific edge cases. "Use pdfplumber for text extraction. For scanned PDFs requiring OCR, switch to pdf2image with pytesseract." One default, narrow exceptions, no debate.
Reference chains: SKILL.md to file A to file B to file C
When the agent follows a chain of references, it starts using partial reads on deeply nested files — head -100, mid-file skips, dropped sections. Critical content goes missing because it lived two hops from the entry point. Every reference file links directly from SKILL.md. If reference-a.md needs something from reference-b.md, link both from SKILL.md instead.
Voodoo constants in scripts
A TIMEOUT = 47 or MAX_RETRIES = 5 without justification is a maintenance trap. The agent cannot reason about whether the value fits the current situation, so it either leaves the trap intact or rewrites it on a guess. Document the reasoning beside the value: "Three retries balances reliability vs. latency — most intermittent failures resolve by the second retry." The constant becomes legible. Drift becomes auditable.
Writing for one model, deploying against all of them
Instructions that land cleanly for Opus may confuse Haiku. Instructions explicit enough for Haiku may push Opus into over-literal interpretation. The matrix is real. Test every model in the deployment target, and tune the level of detail per tier. Author bias toward your daily-driver model is the default failure mode here.
Does the allowed-tools frontmatter field work when using the Claude Agent SDK?
No. The allowed-tools frontmatter field is only enforced by the Claude Code CLI. When running skills through the SDK, tool access is controlled via the allowedTools option on query(). Any allowed-tools lines in your SKILL.md frontmatter are silently ignored in SDK contexts. Define tool permissions in your SDK configuration, not in the skill file, when deploying to SDK-based applications.
How do I know if my skill is being shadowed by another skill in the library?
The diagnostic is simple: run your positive trigger queries with only your skill installed, then run the same queries with the full team skill library installed. If the activation rate drops in the full config, a competing skill is shadowing yours. Fix: diff the descriptions, find the overlapping vocabulary, and sharpen the domain specificity. "Spreadsheet" loses to "Excel .xlsx" and "CSV flat files" as distinct, non-overlapping terms.
The patterns above are not abstract. They are the patterns I extracted while building skillstack. Three artifacts implement them directly. One install adds the whole stack.
Skill authoring framework with a numeric quality gate, not a feeling
47 reference files, 25 utility scripts, 17 templates, 23 worked examples. Includes analyze_skill.py — run it on any SKILL.md and get a 0–100 quality score before shipping.
/plugin marketplace add viktorbezdek/skillstack/plugin install skill-foundry@skillstackClaude Code plugin authoring toolkit — ideation through evaluation, with numbers
8 skills, 4 scripts, the full plugin lifecycle. 109 trigger evals plus 24 output evals across 133 test cases. Activation rate stops being a guess.
/plugin marketplace add viktorbezdek/skillstack/plugin install plugin-dev@skillstackPart of skillstack-workflows
6-phase gated workflow from idea to shipped plugin — gates, not vibes
Orchestrates plugin-ideation → plugin-research → plugin-architecture → skill-foundry → plugin-validation → plugin-evaluation. Each phase has an exit gate. You do not advance until the previous gate clears.
/plugin marketplace add viktorbezdek/skillstack/plugin install skillstack-workflows@skillstackEvery rule from this guide working together. Read it as the audit, not as inspiration.
Every principle from this guide is in that file. The description carries both what (generates a changelog) and when (user asks for changelog, release notes, what changed, what shipped). The workflow takes the narrowest reasonable interpretation when input is ambiguous. The output format reads cleanly for the human and parses cleanly for the next tool. And the template loads only when the workflow reaches the render step — Tier 3 reference, zero token cost on every other path.[3]
The test that matters is not whether the skill works when you type /generate-changelog v2.0.0 v2.1.0. It is whether it fires when a teammate types "what shipped this week" into the chat without knowing the skill exists. Design for that phrasing. Test for it. The activation rate is the only number that says yes.
The skills that get used in production are not the ones with the most thorough instructions. They are the ones whose descriptions win selection against everything else competing for that 100-token slot — including the 20 other skills in your team's config. Engineer the description. Keep the library small and the boundaries sharp. Everything else is downstream of those two decisions.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.