Token Economics: Reading Your AI App Bill

An AI app bill is rarely surprising because the model is expensive. It is surprising because nobody knows which workflow created the spend. The invoice says tokens. The product says import, summarize, draft, classify, retry, embed, cache, and batch. Until those two views are joined, cost work is guesswork.

This article stays because cost is one of the four production pillars and the current market still treats it as an afterthought. OpenAI's current Batch API guide advertises a 50 percent cost discount compared with synchronous APIs and completion within 24 hours for latency-tolerant jobs. Anthropic's prompt-caching docs list 5-minute cache writes at 1.25 times base input price, 1-hour writes at 2 times, and reads at 0.1 times. OpenTelemetry's GenAI semantic conventions define attributes for provider, model, input tokens, output tokens, cache tokens, tool calls, and related fields. Those are not trivia. They are the basis of a cost ledger.

The practical move is to stop reading cost by provider first. Read it by workflow. A workflow has an owner, user promise, latency budget, model path, cache behavior, retry policy, and fallback. Once those are visible, the bill starts naming architectural mistakes.

Cost has to be charged to a workflow

Provider totals do not tell the product team what to fix.

The first cost table should be boring: workflow name, request count, input tokens, output tokens, cached input tokens, cache writes, retries, batch jobs, tool calls, median latency, P95 latency, and owner. If that sounds too detailed, that is the point. Without those fields the team cannot tell whether the bill came from real usage, duplicate retries, long context, failed tool loops, or background work that should have been batched.

A useful ledger separates synchronous user work from offline work. If a user waits for a draft, latency matters and synchronous pricing may be appropriate. If the system tags old records overnight, batch or flex-style processing can be a product win. The user promise decides the cost path. Finance cannot infer that from an invoice.

Retries deserve their own column. A 429 or timeout with exponential backoff is normal. A retry loop that resubmits a long prompt five times is a design error. A retry loop around non-idempotent actions is worse because it can create both cost and correctness failures. Cost review should therefore sit next to reliability review, not after it.

50%

OpenAI Batch discount

OpenAI's Batch API guide describes a 50 percent discount compared with synchronous APIs for jobs that can wait.

0.1x

Anthropic cache read multiplier

Anthropic's docs list cache read tokens at one tenth of the base input token price.

1 owner

Workflow accountability

Every expensive path needs a product or engineering owner who can change behavior, not just monitor spend.

Field	Why it matters	Bad smell
Workflow name	Connects spend to user promise	Provider-only totals
Prompt and model version	Explains cost changes after releases	No version history for spikes
Input, output, cache read, cache write tokens	Separates context cost from generation cost	Only total tokens stored
Retry count and reason	Finds reliability bugs that inflate spend	Retries hidden in SDK logs
Batch or synchronous path	Makes latency tradeoffs explicit	Offline jobs billed as user-waiting work

Reading the bill by workflow

The cost loop starts with workflow attribution, then separates context, generation, retry, cache, batch, and owner decisions before changing architecture.

Caching and batch are design choices, not billing tricks

They only work when the product path is shaped for them.

Prompt caching pays when repeated context is stable and placed where the provider can reuse it. That means the prompt structure matters. Put durable instructions, schemas, tool definitions, policies, and long reference context before volatile user-specific pieces when the provider's caching rules reward stable prefixes. If every request shuffles the same paragraphs in a new order, the cache miss is an architecture bug.

Batch has a different constraint: the user cannot be waiting. Reports, nightly enrichments, backfills, bulk classification, offline eval runs, and migration jobs are natural candidates. Chat, checkout, live support, and interactive copilots usually are not. A product manager can make this call faster than a cost dashboard can. Ask whether the user promise includes immediacy.

The trap is optimizing cost while breaking trust. Moving a task to batch may save money and still fail if the user expects a result in the same session. Caching may reduce cost and still be wrong if the cached prefix includes stale policy text. Cost work has to preserve the product contract.

Invoice reading

Provider totals reviewed after spend spikes
Prompt length discussed without workflow context
Retries treated as reliability-only noise
Batch considered only after finance complains

Cost engineering

Per-workflow cost ledger reviewed with releases
Stable context arranged to improve cache behavior
Retry cost and retry cause tracked together
Latency-tolerant work designed for batch from the start

AI cost ledger checklist

Log provider, model, prompt version, and workflow name for every important request.
Record input, output, cache-read, and cache-write token counts separately.
Track retry count, retry reason, and final outcome.
Separate user-waiting workflows from offline workflows.
Review whether stable prompt prefixes can improve cache hit rate.
Move latency-tolerant bulk work to batch where product promises allow it.
Set a per-workflow budget guard and owner.
Investigate cost spikes by release, not just by calendar date.

[01]
Tag every request with a workflow
Add the workflow name at the boundary where the product action starts. Do not infer it later from endpoint names.
[02]
Split the token bill
Input, output, cache read, cache write, and retry tokens answer different questions. Store them separately.
[03]
Change architecture, not only settings
Move offline work to batch, stabilize cacheable prefixes, cut repeated context, and cap retries where the evidence points.

Cost control does not replace product judgment

The ledger shows where money goes. It does not decide what value is worth paying for.

Some expensive workflows are worth it. A high-value expert workflow may justify a large context window, a stronger model, and human review. A background enrichment job for inactive records may not. The ledger gives the team the numbers needed to make that distinction.

This article should stay because it turns cost from a vague anxiety into an inspectable production surface. It connects reliability, evals, and product design: retries cost money, eval runs cost money, prompts carry cost, and user promises decide which optimizations are legal.

The review should also include a negative decision: which usage is intentionally not optimized yet. A team may decide to pay for a larger model on the first support draft because the downstream human-editing cost is higher than the token cost. That decision is healthy when it is written down with a workflow owner and a revisit trigger. It is unhealthy when the same spend hides in an invoice nobody can explain.

A cost ledger also changes roadmap arguments. Instead of debating whether AI is expensive in the abstract, the team can ask whether the renewal-risk workflow, import workflow, or nightly enrichment workflow is earning its budget.

The practical test is simple. If a builder cannot explain last week's AI spend by workflow, they are not managing cost. They are reading receipts.

Should I always use the cheapest model?

No. Use the cheapest path that meets the workflow's quality, latency, safety, and recovery requirements. The cheapest model can be expensive if it causes retries, escalations, or manual cleanup.

When should I use batch processing?

Use batch for work that can wait: offline evals, backfills, bulk classification, enrichment, and reports. Avoid it for workflows where the user promise is immediate feedback.

What is the first cost metric to add?

Cost per workflow. Total spend is useful for finance, but workflow-level cost tells builders which product path needs a design change.

Key terms in this piece

AI app coststoken economicsprompt cachingLLM bill

Sources

[1]OpenAI — OpenAI prompt caching guide(developers.openai.com)↩
[2]Anthropic — Anthropic prompt caching documentation(platform.claude.com)↩
[3]OpenAI — OpenAI Batch API guide(developers.openai.com)↩
[4]LangChain — Agent observability(langchain.com)↩
[5]OpenTelemetry — OpenTelemetry GenAI semantic conventions(opentelemetry.io)↩

Field

Why it matters

Bad smell

Workflow name

Connects spend to user promise

Provider-only totals

Prompt and model version

Explains cost changes after releases

No version history for spikes

Input, output, cache read, cache write tokens

Separates context cost from generation cost

Only total tokens stored

Retry count and reason

Finds reliability bugs that inflate spend

Retries hidden in SDK logs

Batch or synchronous path

Makes latency tradeoffs explicit

Offline jobs billed as user-waiting work

The practical test is simple. If a builder cannot explain last week's AI spend by workflow, they are not managing cost. They are reading receipts.

Cost has to be charged to a workflow

Caching and batch are design choices, not billing tricks

AI cost ledger checklist

Tag every request with a workflow

Split the token bill

Change architecture, not only settings

Cost control does not replace product judgment

Cost has to be charged to a workflow

Caching and batch are design choices, not billing tricks

AI cost ledger checklist

Tag every request with a workflow

Split the token bill

Change architecture, not only settings

Cost control does not replace product judgment