A practical guide to understanding the usage paths that make AI app costs spike after the prototype reaches real traffic.
An AI app bill is rarely surprising because the model is expensive. It is surprising because nobody knows which workflow created the spend. The invoice says tokens. The product says import, summarize, draft, classify, retry, embed, cache, and batch. Until those two views are joined, cost work is guesswork.
This article stays because cost is one of the four production pillars and the current market still treats it as an afterthought. OpenAI's current Batch API guide advertises a 50 percent cost discount compared with synchronous APIs and completion within 24 hours for latency-tolerant jobs. Anthropic's prompt-caching docs list 5-minute cache writes at 1.25 times base input price, 1-hour writes at 2 times, and reads at 0.1 times. OpenTelemetry's GenAI semantic conventions define attributes for provider, model, input tokens, output tokens, cache tokens, tool calls, and related fields. Those are not trivia. They are the basis of a cost ledger.
The practical move is to stop reading cost by provider first. Read it by workflow. A workflow has an owner, user promise, latency budget, model path, cache behavior, retry policy, and fallback. Once those are visible, the bill starts naming architectural mistakes.
Provider totals do not tell the product team what to fix.
The first cost table should be boring: workflow name, request count, input tokens, output tokens, cached input tokens, cache writes, retries, batch jobs, tool calls, median latency, P95 latency, and owner. If that sounds too detailed, that is the point. Without those fields the team cannot tell whether the bill came from real usage, duplicate retries, long context, failed tool loops, or background work that should have been batched.
A useful ledger separates synchronous user work from offline work. If a user waits for a draft, latency matters and synchronous pricing may be appropriate. If the system tags old records overnight, batch or flex-style processing can be a product win. The user promise decides the cost path. Finance cannot infer that from an invoice.
Retries deserve their own column. A 429 or timeout with exponential backoff is normal. A retry loop that resubmits a long prompt five times is a design error. A retry loop around non-idempotent actions is worse because it can create both cost and correctness failures. Cost review should therefore sit next to reliability review, not after it.
OpenAI's Batch API guide describes a 50 percent discount compared with synchronous APIs for jobs that can wait.
Anthropic's docs list cache read tokens at one tenth of the base input token price.
Every expensive path needs a product or engineering owner who can change behavior, not just monitor spend.
| Field | Why it matters | Bad smell |
|---|---|---|
| Workflow name | Connects spend to user promise | Provider-only totals |
| Prompt and model version | Explains cost changes after releases | No version history for spikes |
| Input, output, cache read, cache write tokens | Separates context cost from generation cost | Only total tokens stored |
| Retry count and reason | Finds reliability bugs that inflate spend | Retries hidden in SDK logs |
| Batch or synchronous path | Makes latency tradeoffs explicit | Offline jobs billed as user-waiting work |
They only work when the product path is shaped for them.
Prompt caching pays when repeated context is stable and placed where the provider can reuse it. That means the prompt structure matters. Put durable instructions, schemas, tool definitions, policies, and long reference context before volatile user-specific pieces when the provider's caching rules reward stable prefixes. If every request shuffles the same paragraphs in a new order, the cache miss is an architecture bug.
Batch has a different constraint: the user cannot be waiting. Reports, nightly enrichments, backfills, bulk classification, offline eval runs, and migration jobs are natural candidates. Chat, checkout, live support, and interactive copilots usually are not. A product manager can make this call faster than a cost dashboard can. Ask whether the user promise includes immediacy.
The trap is optimizing cost while breaking trust. Moving a task to batch may save money and still fail if the user expects a result in the same session. Caching may reduce cost and still be wrong if the cached prefix includes stale policy text. Cost work has to preserve the product contract.
Provider totals reviewed after spend spikes
Prompt length discussed without workflow context
Retries treated as reliability-only noise
Batch considered only after finance complains
Per-workflow cost ledger reviewed with releases
Stable context arranged to improve cache behavior
Retry cost and retry cause tracked together
Latency-tolerant work designed for batch from the start
Log provider, model, prompt version, and workflow name for every important request.
Record input, output, cache-read, and cache-write token counts separately.
Track retry count, retry reason, and final outcome.
Separate user-waiting workflows from offline workflows.
Review whether stable prompt prefixes can improve cache hit rate.
Move latency-tolerant bulk work to batch where product promises allow it.
Set a per-workflow budget guard and owner.
Investigate cost spikes by release, not just by calendar date.
Add the workflow name at the boundary where the product action starts. Do not infer it later from endpoint names.
Input, output, cache read, cache write, and retry tokens answer different questions. Store them separately.
Move offline work to batch, stabilize cacheable prefixes, cut repeated context, and cap retries where the evidence points.
The ledger shows where money goes. It does not decide what value is worth paying for.
Some expensive workflows are worth it. A high-value expert workflow may justify a large context window, a stronger model, and human review. A background enrichment job for inactive records may not. The ledger gives the team the numbers needed to make that distinction.
This article should stay because it turns cost from a vague anxiety into an inspectable production surface. It connects reliability, evals, and product design: retries cost money, eval runs cost money, prompts carry cost, and user promises decide which optimizations are legal.
The review should also include a negative decision: which usage is intentionally not optimized yet. A team may decide to pay for a larger model on the first support draft because the downstream human-editing cost is higher than the token cost. That decision is healthy when it is written down with a workflow owner and a revisit trigger. It is unhealthy when the same spend hides in an invoice nobody can explain.
A cost ledger also changes roadmap arguments. Instead of debating whether AI is expensive in the abstract, the team can ask whether the renewal-risk workflow, import workflow, or nightly enrichment workflow is earning its budget.
The practical test is simple. If a builder cannot explain last week's AI spend by workflow, they are not managing cost. They are reading receipts.
Should I always use the cheapest model?
No. Use the cheapest path that meets the workflow's quality, latency, safety, and recovery requirements. The cheapest model can be expensive if it causes retries, escalations, or manual cleanup.
When should I use batch processing?
Use batch for work that can wait: offline evals, backfills, bulk classification, enrichment, and reports. Avoid it for workflows where the user promise is immediate feedback.
What is the first cost metric to add?
Cost per workflow. Total spend is useful for finance, but workflow-level cost tells builders which product path needs a design change.