Your skill works on your machine. With your phrasing. On the one task you tested. A teammate tries it with different wording and the skill never fires. Or it fires when it shouldn't. Or it runs and emits output that looks fine and quietly breaks the next step in the pipeline.
That's the default state. Around 90% of skill files in shared repos and team configs miss one of five basic checks: a vague description, a body past 500 lines, reference files nested more than a level deep, no documented ambiguity strategy, no pre-ship test protocol. The official docs cover the file format. They spend almost no time on the engineering decisions that decide whether the skill ever runs in production.[1]
One field controls all the others. The description. Get it wrong and the body never loads — the agent didn't pick your skill, and nothing inside it ever ran. The rest of this guide covers what flows from that fact: metadata engineering for reliable triggering, three-tier progressive disclosure for token discipline, an ambiguity policy that survives real users, output formats that serve humans and the next tool in the chain, and a test protocol that gives you a number to ship against instead of a feeling.
The Description Is the Product. The Body Is Inventory.
Selection happens before the body exists in the agent's world. Lose that round and nothing else runs.
The docs say it. They don't dwell on it. The description is the only thing the agent reads when deciding whether to load your skill.[1] The body, the instructions, the templates, the validation scripts — none of it exists in the agent's context until after the description has already won or lost.
At startup, every installed skill contributes roughly 100 tokens of metadata to the system prompt. Hundreds of skills compete in that budget. The agent scans the descriptions and picks the best match for the current task. A vague description loses to a precise competitor. Missing trigger vocabulary loses to a description that mirrors the user's actual phrasing. A first-person blurb loses to a third-person one because the system prompt's voice gets confused.[2]
Library card catalog. Nobody reads the book to decide whether to pull it off the shelf. The catalog card decides. Write the card.
"Helps with documents" — matches every task and none. Loses every selection round.
"I can process PDFs for you" — first person collides with the system prompt's voice and degrades selection accuracy
"Processes data" — no trigger vocabulary, no when. The agent has nothing to match against
"Useful tool for developers" — names the audience, not the capability. Audience is not a trigger
"Extracts text and tables from PDF files, fills forms, merges documents. Use when working with PDF files or when the user mentions PDFs, forms, or document extraction."
"Generates descriptive commit messages by analyzing git diffs. Use when the user asks for help writing commit messages or reviewing staged changes."
"Analyzes Excel spreadsheets, creates pivot tables, generates charts. Use when analyzing Excel files, spreadsheets, tabular data, or .xlsx files."
"Deploys the application to production via the CI pipeline. Use when the user says deploy, ship, release, or push to prod."
Description Field: Hard Rules
Include both WHAT and WHEN. One without the other selects unreliably.
"Generates commit messages" tells the agent the capability. "Use when reviewing staged changes or the user asks for commit help" tells it the trigger condition. The agent matches on both. Drop either and you lose half the selection signal.
Mirror the vocabulary the user will actually type.
If your users say "deploy" and your description says "provision infrastructure," the skill is invisible to half of them. The author's vocabulary is the wrong vocabulary. Use the words that arrive in the chat box.
Stay under 200 characters in Claude Code, under 1024 on the API.
Longer descriptions eat more of the shared metadata budget and risk truncation. The constraint is not stylistic — it is the budget every other installed skill is also pulling from. Pack maximum signal into minimum space.
Trigger conditions live in the description, never in the body.
The body is loaded after selection, not before. Any 'when to use this skill' line buried in the body had zero effect on whether the skill ran. Trigger conditions in the body are dead text.
Test against the phrases real users type, not the ones the author would use.
Authors test with their own phrasing because it's the phrasing they have. Confirmation bias built into the harness. Recruit a teammate, capture their natural request, run it cold. The gap between author phrasing and user phrasing is where the activation rate dies.
Three Tiers, or Your Context Window Pays the Bill
Every token spent on a skill is a token unavailable for reasoning. Tier the load.
The context window is a shared resource. Your skill competes with the system prompt, conversation history, every other installed skill's metadata, and the actual user request. Every token spent on skill content is a token taken from reasoning.[6]
Production skills load in three tiers, on demand. Tier collapse — pulling everything in upfront — is the failure mode.
Tier 1 — Metadata. Always loaded. ~100 tokens per skill. The name and description from the YAML frontmatter. This is the only fragment the agent sees at selection time.[1]
Tier 2 — Entry point. Loaded when the skill wins selection. The markdown body of SKILL.md. Target under 500 lines. Holds the workflow, decisions, pointers to references. Past 500 lines and the agent starts reading partial chunks and missing context.
Tier 3 — References. Loaded on demand. Templates, examples, API docs, scripts. A skill with 2,000 lines of API reference in a Tier 3 file costs zero tokens on every task that does not touch the API. That's the leverage point.
Production Skill File Layout
treemy-skill/
├── SKILL.md
│ ├── # Frontmatter — Tier 1, always loaded
│ └── # Body — Tier 2, loaded on selection, capped at 500 lines
├── reference/
│ ├── api-docs.md
│ ├── schema-definitions.md
│ └── troubleshooting.md
├── templates/
│ ├── output-template.md
│ └── report-format.json
├── examples/
│ ├── simple-case.md
│ └── complex-case.md
└── scripts/
├── validate.py
└── transform.shClarify vs. Assume: Pick a Default Per Class of Action
Asking too much makes the skill annoying. Assuming too aggressively destroys output. Both kill adoption.
Real users don't write specifications. They say "fix the tests" when they mean the three failing integration tests in the auth module. They say "make a report" when they mean a quarterly revenue breakdown by region, in markdown.
The skill needs a written ambiguity policy. Two failure modes pull in opposite directions. Always-ask kills the experience — the skill feels slow and pedantic, users stop reaching for it. Always-assume produces wrong output that takes longer to repair than to redo from scratch. The fix is not a single rule. It is a per-class-of-action decision documented in the skill itself, so the agent doesn't re-derive it under pressure.
| Action class | Default | Why |
|---|---|---|
| Destructive (deploy, delete, overwrite) | Always confirm before acting | The action is irreversible. The cost of a wrong assumption is hours of recovery, sometimes more. |
| Output format unspecified | Pick a default. State the assumption. | Format preferences correct in seconds. Blocking on format choice wastes the user's turn. |
| Scope unclear ("fix the tests") | Take the narrowest reasonable scope. Report what you did. | Fix the three failing tests and surface what was done. Asking which tests when the answer is obvious from context wastes a round trip. |
| Multiple valid approaches | Pick the conventional one. Name the choice. | Methodology debates are not what the user came for. Reversibility is. |
| Missing required parameter | Suggest the most likely value, ask for confirmation | "Did you mean the staging environment?" beats "Which environment?" The first is a one-word answer. The second restarts the conversation. |
| Domain terminology | Use the project's glossary. Do not negotiate it. | If the codebase calls it a "widget," use that. Renaming on the fly creates drift between the skill output and the rest of the system. |
SKILL.md## Ambiguity policy
Applied per action class. Never re-derived in the moment.
1. **Destructive actions**: Confirm before deploying, deleting, or
overwriting. Show the exact diff or the exact target. No exceptions.
2. **Scope**: Take the narrowest reasonable interpretation.
If "fix the tests" could mean 3 tests or 30, fix the 3 failing
ones and report back. The user can widen if needed.
3. **Format**: Default to markdown. State the assumption inline:
"Generating in markdown — say the word for JSON or HTML."
4. **Missing parameter**: Suggest the most likely value, then act.
"Deploying to staging (the most recent target). Say 'production'
if you meant prod."Output Has Two Consumers. Both or Neither.
Human reader makes a decision. Downstream tool ingests the result. Optimize for one and the other becomes copy-paste glue.
Skill output is rarely the final destination. It flows into two consumers: a human deciding the next move, and a downstream tool that needs to parse what came out. Pick one and the other pays a tax.
Human-formatted output the next tool can't parse forces manual copy-paste — every chained workflow grinds. Machine-formatted output the human can't scan forces a formatter pass before any decision. The right shape serves both. The pattern below is plan-validate-execute: the skill writes a structured plan a human can read at a glance, then validates it, then applies it. The intermediate artifact is the contract.
For the human reading it
Lead with the answer. The first line is the conclusion, not the reasoning chain.
Use structured formatting — headers, lists, tables — so the eye finds what it needs without reading every word.
Mark what changed, what was skipped, what needs attention. The reader is looking for exceptions.
Keep status messages consistent. Always "3 files updated, 1 skipped" — never sometimes "Updated 3 files" and sometimes "Done." Drift in status format breaks scripts and reader trust.
For the next tool in the chain
- ✓
Emit structured data — JSON, YAML — for any output another tool will read.
- ✓
Use plan-validate-execute. Write a changes.json, validate it, then apply. The plan is the audit trail.
- ✓
Keep stdout clean. Diagnostics go to stderr or a log file. Mixing them poisons every parser downstream.
- ✓
Return exit codes that mean something. 0 for success, 1 for handled errors, 2 for unrecoverable failures. The next tool reads exit codes; missing exit code semantics fail silently.
- [01]
Generate the plan as a structured artifact
bash# Skill emits the plan to disk — never executes inline python scripts/analyze.py input.pdf > changes.json - [02]
Validate the plan before anything mutates
bash# Validation runs against the plan, not against production python scripts/validate.py changes.json # Errors are specific enough to act on: # "Field 'signature_date' not found. Available: customer_name, order_total" - [03]
Execute only after validation passes
bash# Mutation happens only when the plan is verified python scripts/execute.py changes.json --output result.pdf - [04]
Verify the output as a separate step
bash# Independent verification — execute is not allowed to grade itself python scripts/verify.py result.pdf # Output: "OK: 12/12 fields populated, 0 validation errors"
Degrees of Freedom: Constrain the Fragile, Free the Open Field
Migrations are a narrow bridge with cliffs. Code reviews are open terrain. Mixing the constraint level kills both.
Common failure: every instruction written as an absolute rule. Different parts of a skill need different constraint levels. Database migrations need exact commands. Code reviews need direction, not a script. Over-constrain the flexible work and the agent ignores context that matters. Under-constrain the fragile work and the agent reasons toward something that breaks production.[6]
A useful mental model. Some tasks are a narrow bridge with cliffs on both sides — one safe path, deviation means failure, write the exact command. Other tasks are an open field — many paths reach the goal, the agent's judgment about which route to take is usually better than the author's hardcoded choice. Match the constraint to the terrain. Mismatch is where rigidity meets recklessness.
Frontmatter: One Field, One Job
Every field has a failure mode when omitted or misused. Here's the audit.
SKILL.md---
name: deploy-staging
description: >-
Deploys the application to the staging environment.
Use when the user says deploy, ship, push to staging,
or asks to test in a staging-like environment.
disable-model-invocation: true
allowed-tools: Bash(kubectl *), Bash(helm *), Read
context: fork
agent: general-purpose
---| Field | When to use | Failure mode |
|---|---|---|
| name | Always. Becomes the slash command. Lowercase. Hyphens only. | Spaces or uppercase break invocation. Vague names like "helper" never get reached for. |
| description | Always. The trigger surface. Carry both what and when. | First person. No trigger vocabulary. Either one and the skill is invisible. |
| disable-model-invocation | Side-effect skills: deploy, delete, send messages, mutate state. | Left off for destructive operations. The agent now auto-triggers a deletion. This is how skills delete production. |
| user-invocable | Set to false for background knowledge skills the user should not call directly. | Hidden from users who should be able to call it. Discoverability dies in a config. |
| allowed-tools | Skill needs specific tools without per-call approval friction. | Bash(*) when Bash(git *) would have done. Permission scope wider than function. Drift starts here. |
| context | Set to fork when the skill should run in isolation from the parent context. | Forking a skill that only emits reference content. Pointless context split, no actionable task. |
| agent | A specific subagent type (Explore, Plan) fits the task better than general-purpose. | Research tasks running general-purpose when Explore's read-only toolset would have prevented half the failure surface. |
Test the Skill Before You Ship It. Otherwise You're Shipping a Hope.
Five gates between draft and production. Skip any of them and the activation rate is whatever the universe gives you.
Writing evals before writing extensive documentation is the highest-leverage move in skill engineering.[4] It forces the question every author wants to skip — does the skill solve a real problem or an imagined one. If the agent already handles the task without the skill, you don't need the skill. The eval surfaces that fact in minutes instead of months.
The protocol below is two halves. Automated checks catch structural problems — descriptions that fail selection, files past the line cap, references nested too deep. Observational testing catches behavioral problems the harness can't see — the skill firing on adjacent tasks, ignoring a reference file, re-reading the same section four times because the body is structured wrong. Run both. One without the other ships blind.
- [01]
Three eval scenarios first, prose second
Pick three concrete tasks the skill is supposed to handle. Run the agent on those tasks with no skill loaded. Document the specific failures. That's the baseline. If the agent already succeeds without the skill, the skill is decoration — cut it before you write the body.
- [02]
Trigger evals across phrasings — author bias is the enemy
The description is the single largest failure point. Test at least five phrasings, including phrasings from people who do not know the skill exists. They say 'make a report,' not 'invoke the report-generator skill.' If your eval set only contains phrasings the author would use, the activation rate measurement is theater.
- [03]
Test on every model tier the skill ships against
Opus over-interprets verbose instructions. Haiku under-interprets sparse ones. A skill tuned to one model and deployed across all of them is going to misfire somewhere. Pin the matrix.
- [04]
Observe navigation, not just outcomes
Watch how the agent actually uses the skill in practice. The order of file reads tells you whether the structure matches the workflow. Files the agent never opens are dead weight. Sections re-read repeatedly are candidates for promotion to SKILL.md — the agent is signaling that the content belongs higher in the load order.
- [05]
Two-agent feedback loop: author and adversary
Use one Claude instance to author the skill. Use a separate instance to test it on real tasks. The author instance knows what it intended; the test instance reveals what is actually missing. Iterate between them until the test instance handles every scenario without intervention.
Pre-Ship Skill File Checklist
Description carries both WHAT and WHEN — never one without the other
Description in third person — no first-person voice collision
SKILL.md body under 500 lines — partial reads do not get to truncate it
Reference files linked exactly one level deep from SKILL.md — no chains
Destructive operations require explicit confirmation in the body
Ambiguity policy documented per action class — clarify vs. assume decided in advance
Output format works for the human reader and the next tool — both, not either
Three or more eval scenarios written, run, passing
Trigger evals run against five or more natural phrasings, including teammate-captured
Tested across every model tier the skill will deploy on — Haiku, Sonnet, Opus
No time-sensitive information baked into instructions — dated content is decay
Terminology consistent throughout — no synonym drift between sections
All file paths use forward slashes — cross-platform invariant
Validation scripts emit specific, actionable error messages — not just exit codes
Activation Rate Is the Number. Iterate the Description Against It.
After shipping 54 plugins in skillstack, one metric predicts adoption better than any other: activation rate — the percentage of relevant user queries that actually trigger the skill. A careful first author lands around 8/10. Not a failure. A starting point. The gap from 8/10 to 10/10 closes through iteration on the description field, never the body.
The loop: 10-15 positive trigger queries, 5-10 negatives, run them against the model, count the misses, rewrite the description to close each specific gap, run again. plugin-dev's run_eval.py collapses that into one command. Without the harness, every description edit is a coin flip — you cannot tell whether the change helped, hurt, or did nothing.
Author tests with author phrasing — confirmation bias is the harness
Teammates discover the failures in production, not pre-ship
Silent misses: skill never fires, user concludes it does not exist
No baseline to grade description edits against — every change is a guess
Diverse trigger queries catch the phrasings the author would never have written
Activation rate score gives a concrete target to push against — 8/10, then 9/10, then 10/10
Negative queries prove the skill stays out of unrelated tasks
Every description edit has a measurable delta — iterate with evidence, not vibes
evals/trigger_queries.yaml# plugin-dev trigger eval format. Positives must fire. Negatives must not.
positives:
- "run the eval suite"
- "measure activation rate"
- "check how often my skill triggers"
- "test trigger evals"
negatives:
- "write a changelog"
- "generate a diagram"
- "format this JSON"Plugins Are a Different Failure Class. Exit Code 1 Is the Trap.
A single SKILL.md is a function. A plugin — skills, hooks, MCP extensions, scripts — is a system. The threshold is real. A new failure class lives on the other side. The most common one costs hours the first time it hits.
Claude hooks communicate via exit code. Exit code 0 means proceed. Exit code 2 means block the tool call. Exit code 1 logs an error and lets the tool through anyway. Every first-time plugin author writes a PreToolUse hook, tests it, sees "BLOCKED" in the log, watches Claude execute the call regardless. Because 1 is not block. Only 2 is. The hook ran. The enforcement did not.
Plugin Lifecycle: Eight Phases (plugin-dev)
- ✓
Ideation — frame the problem, scope it, write the success criteria before any code
- ✓
Research — survey prior art, competing skills, the pattern library you should not reinvent
- ✓
Architecture — skill graph, hook design, MCP surface decided up front, not retrofitted
- ✓
Composition — write each skill using the skill-foundry patterns
- ✓
Hooks — PreToolUse and PostToolUse with the correct exit codes (2 blocks, not 1)
- ✓
Validation — schema check, integration tests, edge cases run before merge
- ✓
Evaluation — activation rate plus output quality, both with numeric thresholds
- ✓
Documentation — README, examples, changelog as part of shipping, not after
Anti-Patterns: Why Skills That Look Right Fail
Reviewed across hundreds of skill files. These are the failure modes that ship in the most polished-looking work.
Across hundreds of skill files in open-source repos and production codebases, the same failure patterns recur.[6] These are not hypothetical. They are the reasons skills break the moment they leave the author's machine.
A non-obvious one: the most elaborate skill files are often the least reliable. Teams that pour effort into exhaustive instructions, templates, and deep reference documentation discover the skill is too rigid for the variety of inputs it actually meets. The skills that survive in production are narrowly scoped with sparse instructions — one thing, done clearly, no attempt to cover every adjacent case. Scope discipline beats coverage ambition.
Trigger conditions buried in the markdown body
The body loads after selection. Any "when to use this skill" guidance inside the body had zero effect on whether the skill ran. The selection decision was already made on the description alone. Move every trigger condition into the description field. The body is for after the skill has already won.
Listing options without picking a default
"You can use pypdf, or pdfplumber, or PyMuPDF, or pdf2image" — the agent picks arbitrarily, and that arbitrary pick may not match what the team standardized on. Name one default. Mention alternatives only for specific edge cases. "Use pdfplumber for text extraction. For scanned PDFs requiring OCR, switch to pdf2image with pytesseract." One default, narrow exceptions, no debate.
Reference chains: SKILL.md to file A to file B to file C
When the agent follows a chain of references, it starts using partial reads on deeply nested files — head -100, mid-file skips, dropped sections. Critical content goes missing because it lived two hops from the entry point. Every reference file links directly from SKILL.md. If reference-a.md needs something from reference-b.md, link both from SKILL.md instead.
Voodoo constants in scripts
A TIMEOUT = 47 or MAX_RETRIES = 5 without justification is a maintenance trap. The agent cannot reason about whether the value fits the current situation, so it either leaves the trap intact or rewrites it on a guess. Document the reasoning beside the value: "Three retries balances reliability vs. latency — most intermittent failures resolve by the second retry." The constant becomes legible. Drift becomes auditable.
Writing for one model, deploying against all of them
Instructions that land cleanly for Opus may confuse Haiku. Instructions explicit enough for Haiku may push Opus into over-literal interpretation. The matrix is real. Test every model in the deployment target, and tune the level of detail per tier. Author bias toward your daily-driver model is the default failure mode here.
The Toolkit That Encodes These Patterns
The patterns above are not abstract. They are the patterns I extracted while building skillstack. Three artifacts implement them directly. One install adds the whole stack.
skill-foundry
Skill authoring framework with a numeric quality gate, not a feeling
47 reference files, 25 utility scripts, 17 templates, 23 worked examples. Includes analyze_skill.py — run it on any SKILL.md and get a 0–100 quality score before shipping.
- Philosophy-first: the mechanism behind the rule, not just the rule
- Anti-pattern catalog: the 12 failure modes that recur across hundreds of skill files
- analyze_skill.py: 0–100 quality score, no judgment calls
- 17 templates covering every pattern in this guide
/plugin marketplace add viktorbezdek/skillstack/plugin install skill-foundry@skillstackplugin-dev
Claude Code plugin authoring toolkit — ideation through evaluation, with numbers
8 skills, 4 scripts, the full plugin lifecycle. 109 trigger evals plus 24 output evals across 133 test cases. Activation rate stops being a guess.
- run_eval.py: activation-rate measurement in one command
- test_hook.sh: offline hook validation before deploy — exit code 2 means block, not 1
- scaffold_plugin.py: opinionated scaffold with the failure modes pre-mitigated
- Exit-code semantics baked into every hook template, not left to memory
/plugin marketplace add viktorbezdek/skillstack/plugin install plugin-dev@skillstackbuild-a-plugin
Part of skillstack-workflows
6-phase gated workflow from idea to shipped plugin — gates, not vibes
Orchestrates plugin-ideation → plugin-research → plugin-architecture → skill-foundry → plugin-validation → plugin-evaluation. Each phase has an exit gate. You do not advance until the previous gate clears.
- Phase gates enforce quality before the next phase runs — no skipping ahead
- Composes skill-foundry and plugin-dev into one orchestrated run
- Output is a complete plugin: evals, hooks, docs, ready to ship
/plugin marketplace add viktorbezdek/skillstack/plugin install skillstack-workflows@skillstackEvery Pattern in One Skill File
Every rule from this guide working together. Read it as the audit, not as inspiration.
SKILL.md---
name: generate-changelog
description: >-
Generates a formatted changelog from git history between two refs.
Use when the user asks for a changelog, release notes, what changed,
or a summary of recent commits.
allowed-tools: Bash(git *), Read, Write
---
# Generate Changelog
## Workflow
1. Determine the ref range:
- Two refs provided: use them directly
- One ref provided: use `ref..HEAD`
- None provided: last tag to HEAD
- State the assumption inline: "Generating changelog from v2.1.0 to HEAD"
2. Gather commits:
```bash
git log --oneline --no-merges $FROM..$TO
```
3. Categorize by conventional commit prefix:
- feat: -> Features
- fix: -> Bug Fixes
- perf: -> Performance
- docs: -> Documentation
- Other -> Maintenance
4. Render the output via the template in [templates/changelog.md](templates/changelog.md).
Loaded only when this step runs — Tier 3 reference, on demand.
5. Write to CHANGELOG.md. Append. Never overwrite.
## Ambiguity policy
- **No refs specified**: default to last tag..HEAD, state the assumption.
- **Scope unclear**: current branch only, narrowest reasonable interpretation.
- **Format unspecified**: markdown, grouped by category.Every principle from this guide is in that file. The description carries both what (generates a changelog) and when (user asks for changelog, release notes, what changed, what shipped). The workflow takes the narrowest reasonable interpretation when input is ambiguous. The output format reads cleanly for the human and parses cleanly for the next tool. And the template loads only when the workflow reaches the render step — Tier 3 reference, zero token cost on every other path.[3]
The test that matters is not whether the skill works when you type /generate-changelog v2.0.0 v2.1.0. It is whether it fires when a teammate types "what shipped this week" into the chat without knowing the skill exists. Design for that phrasing. Test for it. The activation rate is the only number that says yes.
The skills that get used in production are not the ones with the most thorough instructions. They are the ones whose descriptions win selection against everything else competing for that 100-token slot. Engineer the description. Everything else is downstream of that one decision.
- [1]Anthropic — Claude Code Skills Documentation(code.claude.com)↩
- [2]Anthropic — Agent Skills Best Practices(platform.claude.com)↩
- [3]agentskills.io — Agent Skills Open Standard Specification(agentskills.io)↩
- [4]Towards Data Science — How to Build a Production-Ready Claude Code Skill(towardsdatascience.com)↩
- [5]Benjamin Abt — Agent Skills Standard: GitHub Copilot(benjamin-abt.com)↩
- [6]Bibek Poudel — The SKILL.md Pattern: How to Write AI Agent Skills That Actually Work(bibek-poudel.medium.com)↩
- [7]DeepWiki — 8.1 SKILL.md Format Specification(deepwiki.com)↩
- [8]Viktor Bezdek — SkillStack — Battle-tested skills for Claude Code(github.com)↩
- [9]Viktor Bezdek — Plugin Dev — Claude Code Plugin Authoring Toolkit(github.com)↩
- [10]Viktor Bezdek — Skill Foundry — Skill Authoring Framework(github.com)↩
- [11]Viktor Bezdek — build-a-plugin — End-to-End Plugin Authoring Workflow(github.com)↩