When your multi-agent pipeline has no cost parentage, every billing incident is a surprise you could have prevented. How to build a real-time agent cost attribution ledger using OTel baggage propagation, gateway enforcement, and rate-of-spend anomaly detection.
Your subagent retry loop just spent $2,100 in 11 minutes. You know this because the Anthropic billing dashboard crossed a threshold alert — not because your system caught it.
When you open the trace, every LLM call is attributed to one API key and one project. Which of your seven coordinated agents caused it? Which workflow triggered the loop? Which user intent started the chain? The gateway does not know. The observability tool does not know. Your post-mortem will say "estimated cost impact: $2,100" and your estimate of which agent is responsible will be a guess dressed as analysis.
This is not a hypothetical risk. In November 2025, a four-agent LangChain market research pipeline running A2A coordination entered an unintended loop between an Analyzer and a Verifier agent. They spent 11 days ping-ponging requests. Week 1: $127. Week 2: $891. Week 3: $6,240. Week 4: $18,400. Total: $47,000.[10] The team had monitoring dashboards. They did not have enforcement. The dashboard was green for 264 hours.
That is the attribution gap. It is not a reporting inconvenience — it is a production control failure.
The structural cause is not a missing dashboard filter. Orchestrators call subagents. Subagents make LLM calls. Each call hits the provider with the same API key and zero context about the originating workflow. The provider sees a cost pool. Your observability layer sees individual spans. Nobody sees the workflow that assembled them. Finance asks for a cost breakdown by feature or team. Engineering has one number and a spreadsheet built from deployment dates and gut feel.
Fix this before the next pipeline scales. The ledger is cheap. The post-mortems are not.
Why attribution context is dropped at the handoff boundary — and the structural fix
OTel baggage propagation as the attribution substrate (with runnable Python)
Gateway middleware that enforces per-agent budgets pre-forward, not post-hoc
Fair-share attribution for shared context windows and RAG retrieval layers
Rate-of-spend anomaly detection that catches loops in under 90 seconds
A chargeback architecture with self-serve finance drill-down
A pre-ship checklist and FAQ covering the production edge cases
Agentic models consume 5-30× more tokens per task than single-agent chat — Gartner, March 2026 [1]
Handoff-based swarm patterns generate 14,000+ tokens on multi-domain tasks vs ~9,000 for subagent patterns — AugmentCode [2]
Up from $1.2M in 2024 — a 480% increase in two years while attribution tooling has not kept pace [3]
Cost compounded week-over-week: $127 → $891 → $6,240 → $18,400. No budget ceiling, no enforcement. [10]
The problem is not that observability tools lack cost views. Attribution context was never injected into the call in the first place.
Orchestrators initialize with a user request. They call subagents — sometimes in parallel, sometimes sequentially, often both. Each subagent makes multiple LLM calls. Each call hits the gateway with the same API key and no context about the originating workflow, the user who triggered it, or the orchestrator that assembled the context.
Most observability platforms — Langfuse, LangSmith, Helicone — record cost at the span where the LLM call was made: the subagent span, not the originating workflow boundary.[5] The result is systematically misleading. Subagents look expensive. Orchestrators look cheap. The real cost driver — the workflow topology that created the call volume — stays invisible.
The handoff attribution problem compounds this. When a workflow spans six agents and context grows with each handoff, the final agent often carries 80% of the total token cost because it inherits everything the previous agents wrote. Naive per-call logging attributes that to the last agent rather than the orchestrator that assembled the context. You penalize the closer, not the architect.
This isn't a tooling gap you can patch with a dashboard filter. It's a structural gap: attribution context was never attached at call time because no orchestrator initialization step injected it. The gateway cannot reconstruct parentage retroactively. The OTel gen_ai semantic conventions define gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and gen_ai.request.model as standard span attributes[11] — but they say nothing about workflow parentage. That context has to travel via baggage, and somebody has to put it there.
Trace context propagation across process boundaries is the mechanism that makes per-agent cost rollup possible — and it is already in your stack.
OTel baggage travels with the trace context through process boundaries. A subagent spawned in a different process inherits the originating workflow's attribution tags automatically — as long as the orchestrator injected them at initialization.
The mechanism: the orchestrator sets baggage entries (workflow_id, user_id, agent_role, task_id) when the workflow starts. Every downstream span in every process that descends from that trace inherits the baggage. The gateway reads baggage headers from incoming requests and stamps attribution on the billing event before forwarding to the provider.
The BaggageSpanProcessor — a built-in OTel mechanism — automatically copies specified key-value pairs to all spans within a trace context. This means you do not have to thread attribution fields through every function call. Set the baggage once at the orchestrator; it flows through every agent invocation, every tool call, every subagent spawn downstream.[11]
Combined with the standard gen_ai.* span attributes — gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system, gen_ai.request.model — this gives you the full attribution record per span: model, token counts, workflow parentage, and agent role.[12]
This is not a new framework. OTel is already in most platform stacks. The gap is that nobody wired the orchestrator initialization to set attribution baggage before the first LLM call fires.
Attribution data is only useful if the gateway acts on it before the damage is done.
Costs recorded after the provider billing event — no way to halt in flight
Threshold alerts fire after cumulative spend crosses a fixed monthly cap
Budget decisions use prior month actuals; agentic volume growth makes them stale within weeks
Runaway retry loop runs for 264 hours before manual detection — $47K later [10]
Post-mortem says 'estimated cost impact' with a range spanning an order of magnitude
Attribution is a reporting feature owned by the finance team
Gateway reads attribution header before forwarding request to provider — enforcement is synchronous
Per-agent spend checked against dynamic threshold at call time; request rejected with HTTP 402 if over budget
Circuit-breaker halts agent when rate-of-spend exceeds 3× the 7-day rolling baseline
Runaway loop caught in under 90 seconds; workflow paused, alert routed to on-call
Finance self-serve dashboard with per-team, per-feature, per-workflow drill-down
Attribution is a runtime enforcement primitive owned by the platform team
LiteLLM's a2a_iteration_budgets parameter caps the number of LLM calls per agentic session using a session_id.[6] max_iterations sets the call-count ceiling; max_budget_per_session sets the dollar ceiling. When either limit is hit, LiteLLM rejects the next call before it reaches the provider. This is enforcement at the call-count and session-budget layer — the right mechanism, at too coarse a grain for multi-workflow platforms.
Cost enforcement at the per-agent, per-workflow level requires one more step: a gateway middleware that reads the propagated baggage headers, looks up the current session spend for that workflow_id in the cost ledger, and rejects the request before forwarding if the budget gate fails. The reject returns HTTP 402 to the calling agent — the same signal LiteLLM uses — so existing retry logic handles it without changes.
The enforcement call is synchronous and pre-forward — not a post-hoc alert. An alert system tells you what happened. A gateway enforcement plane stops what is happening.
RAG pipelines and memory layers inject content into context windows that all downstream agents read. If a retrieval step adds 4,000 tokens and three agents each process that context, naive attribution charges all 4,000 tokens to the agent that triggered retrieval — inflating that agent's cost and burying the retrieval infrastructure cost entirely.
The handoff attribution problem is worst when context grows through sequential agents. Consider a concrete example: Agent A produces a 2,000-token summary. Agent B reads that summary and produces a 3,000-token analysis. Agent C reads both and makes the final LLM call with 5,000 tokens of inherited context plus 1,000 of its own. Per-call attribution charges Agent C with 6,000 input tokens when it only contributed 1,000 of them.
The fair-share formula: each agent's attributable input tokens = their marginal context contribution + (shared context tokens ÷ number of agents that read it). Retrieval costs — embedding calls, vector search — are attributed separately as infrastructure spend, not per-agent LLM spend. This prevents the retrieval layer from hiding inside whichever agent triggered it.
No existing observability tool implements this. Langfuse records the full context at each span.[5] LangSmith does the same. The calculation has to happen in your ledger layer, using the agent-scoped marginal token counts you track per handoff rather than the cumulative context size the provider bills.
This is a limitation worth naming: fair-share calculation requires knowing which agents read which tokens. If your orchestrator does not track context composition per agent, you cannot compute marginal attribution. The alternative is to instrument at the context assembly step — when the orchestrator prepares the prompt for each subagent, record which tokens it added versus which it forwarded from a previous agent. That single instrumentation point unlocks the entire fair-share calculation.
| Scenario | Naive Attribution | Fair-Share Attribution | What You Miss |
|---|---|---|---|
| Agent A writes 2K tokens, Agent B inherits + adds 1K, Agent C inherits + adds 1K | Agent A: 2K, Agent B: 3K, Agent C: 4K — totals 9K | Agent A: 2K, Agent B: 1.67K, Agent C: 1.33K — totals 5K | Fair-share credits shared tokens proportionally; naive penalizes later agents |
| RAG retrieval adds 4K tokens shared across 3 agents | Attributed to whichever agent triggered retrieval | 4K ÷ 3 = 1.33K per agent, plus retrieval cost as infra spend | Retrieval layer costs stay invisible in naive models |
| Orchestrator sends identical context to 4 parallel agents | Each agent billed for full context | Shared context split 4 ways; each agent billed for its own output only | Parallel fan-out amplifies naive attribution error by agent count |
Dynamic rate-of-spend thresholds detect runaway loops in minutes. Static caps detect them on the billing cycle.
LiteLLM's monthly budget enforcement and Langfuse usage alerts share the same structural flaw: they trip after cumulative spend crosses a fixed number.[6] A retry loop that consumes the monthly cap in 45 minutes is indistinguishable from normal end-of-month spend until it hits the ceiling. The $47K incident ran for 264 hours before anyone noticed — because the billing dashboard showed cumulative spend climbing gradually, not a spike.[10]
The rate-of-spend formula: alert_threshold = rolling_avg_15min_spend(7d) × 3.0. If the 15-minute spend rate for a workflow exceeds three times its 7-day rolling average, trigger an alert and optionally pause the agent. This catches the acceleration signal before cumulative damage exceeds a meaningful amount.
Implementation: a lightweight aggregation job running every 60 seconds, reading from the cost ledger, computing per-workflow 15-minute spend rates, and comparing against stored 7-day baselines. The check_rate_of_spend_anomaly method in the CostLedger class above implements this check — call it from a background thread or a scheduled task.
Two detection mechanisms cover two failure modes: rate-of-spend anomaly detection catches spikes, and a weekly trend check catches drift. If the 7-day rolling average for a workflow grows faster than 15% week-over-week, flag it for review. A context window that grows by 200 tokens per run over 30 days will not trigger a spike alert — but it will show up in the trend check before it compounds.
Cost attribution without a chargeback interface just relocates the spreadsheet inside the platform team.
Attribution without chargeback is internal accounting with no consequences. Finance still builds a spreadsheet. Product teams still run workflows without skin in the game. The goal is a self-serve interface where teams can see their spend trajectory before the billing period closes — and a governance model where routing optimization requests flow through a defined process rather than ad-hoc Slack threads.[13][14]
The attribution taxonomy is the contract between the ledger and the finance system. workflow_id maps to a feature; a feature maps to a team; a team maps to a cost center. This mapping lives in a config file, not hardcoded in the gateway. Platform engineers own the schema. Product teams declare their feature identifiers at workflow registration time. Without this contract, the ledger produces attribution data that finance cannot consume without manual translation — which means they will not use it.
A typical B2B SaaS feature handling 100,000 requests monthly at Sonnet-class rates costs $270–$1,500 in direct LLM spend, before accounting for agentic multipliers. Agentic workflows running 5–30× more tokens per task push that range to $1,350–$45,000 per feature per month at scale.[1] Teams need to see that number before it lands on the invoice, not after.
workflow_id maps to a feature, feature maps to a team, team maps to a cost center. This mapping lives in a config file, not in the gateway. Platform engineers own the schema; product teams declare their feature identifiers at workflow registration time. The taxonomy is the contract between the ledger and the finance system — without it, attribution data cannot be consumed by finance without manual translation.
Sum attributable spend per workflow_id per day. Join against the taxonomy. Write to a time-series store. Keep both per-agent granularity (for debugging and post-mortems) and per-feature rollups (for chargeback). The aggregation job runs on whatever cadence finance needs — hourly for monitoring, daily for chargeback statements.
Return daily, weekly, and monthly spend by team, feature, and workflow — plus current spend vs. budget remaining, and a projected end-of-period total based on the current 7-day rate. Teams can see when they are trending toward overage before the period closes. This single endpoint eliminates most of the finance escalation requests that land on the platform team.
Product teams own their cost lines and can request model routing changes or context window optimizations. The platform team approves routing changes that affect shared infrastructure — a change that saves $2,000/month for one team but adds 50ms p99 latency for three others requires a different decision process than one that only affects the requesting team's workflow.
Not every multi-agent system needs a full attribution stack on day one. Here is the decision threshold.
Can I retrofit attribution to an existing pipeline without a rewrite?
Yes, if your agents call LLMs through a single gateway. Add baggage injection at the gateway entry point using a middleware layer — agents do not need to change. Cost is one middleware function and a schema migration in your cost store. The only hard dependency is that all LLM calls route through the same gateway. If some agents call providers directly, those calls are invisible to the ledger until you route them through the proxy. Start there before any baggage instrumentation work.
Does OpenTelemetry baggage survive async queue boundaries?
Not automatically. When an agent publishes a task to a queue, serialize the baggage context into the message payload using W3CBaggagePropagator.inject(carrier, context) and deserialize it on the consumer side with extract(carrier). The orchestrator_init.py code above shows both sides of this pattern with the _otel_carrier key. Without this, any async handoff drops the attribution thread and your cost records fragment into orphans with no rollup path.
What granularity is right — per-agent or per-step?
Per-agent for chargeback and budget enforcement. Per-step for debugging and optimization. Store both — the ledger is cheap. Step-level data is what you use in post-mortems; agent-level aggregation is what finance and product teams see. If you have to pick one, start with per-agent. Per-step attribution is useless if you cannot answer 'which agent owns this cost center.'
How do I attribute streaming responses where token counts arrive at stream end?
Buffer the attribution event. Stamp the start-of-stream with the attribution metadata and the request_id. When the stream closes and token counts are final, emit the cost event with both fields populated. Never approximate token counts mid-stream for billing or enforcement purposes — partial counts produce systematic undercharging until the stream completes, which then triggers a spike that looks like an anomaly but is deferred billing.
What happens to attribution when an orchestrator retries a failed subagent?
Each retry is a separate LLM call with the same baggage context. The ledger records all of them. Track retry count as a span attribute (agent.retry_count) so the finance report can distinguish baseline spend from retry overhead. A workflow that succeeds on the fourth attempt spent 4× the expected tokens on that step. That pattern — high retry counts on a specific agent — surfaces model reliability issues that pure cost monitoring misses entirely.
How should I backfill the 7-day rolling baseline for a new workflow?
Run a load test against the gateway for 30–60 minutes at realistic throughput before enabling rate-of-spend anomaly detection. Feed those records into the rolling window store with synthetic timestamps spread across the prior 7 days. This gives the anomaly detector a baseline to compare against from day one, rather than triggering false positives during the first week of production traffic when any spend looks like a spike against an empty window.
The ledger is not a reporting artifact that satisfies the next finance review. It is the enforcement substrate that makes autonomous agents safe to run at scale.
Every agentic system without per-agent attribution is making an implicit bet: nothing will go wrong before the monthly bill arrives. That bet holds for the first few months. It stops holding when the pipeline scales, context windows grow with each handoff, and the subagent topology picks up three more agents for a feature that shipped under deadline pressure.
The $47K incident ran for 11 days because the dashboard showed green.[10] Green dashboards do not mean your system is safe — they mean your monitoring has not caught the problem yet. A gateway enforcement plane shows red before the damage compounds.
Build the ledger before you need the post-mortem. The implementation is a middleware layer, a few hundred lines of Python, and an OTel baggage injection at orchestrator init. Platform teams that wire attribution into the enforcement plane stop writing post-mortems with sections called 'estimated cost impact.' Teams that treat cost attribution as an observability feature keep writing them.
When production agents fail, teams default to prompt tuning regardless of structural root cause. This MAST-based triage protocol gives engineering leaders three speed-ordered checks — 30 seconds, 5 minutes, 20 minutes — each routing to a different structural owner before anyone changes a line.
MAST's 14 agent failure modes cluster into 3 structural categories, each preventable at a different pre-production stage. This playbook maps them to 12 deployment gate questions with pass criteria and named ownership.
Why frontier model defaults bloat inference bills, and the per-task quality SLO framework that makes model selection explicit, testable, and owned — instead of inherited from prototype defaults.