Your subagent retry loop just spent $2,100 in 11 minutes. You know this because the Anthropic billing dashboard crossed a threshold alert — not because your system caught it.
When you open the trace, every LLM call is attributed to one API key and one project. Which of your seven coordinated agents caused it? Which workflow triggered the loop? Which user intent started the chain? The gateway does not know. The observability tool does not know. Your post-mortem will say "estimated cost impact: $2,100" and your estimate of which agent is responsible will be a guess dressed as analysis.
That is the attribution gap. It is not a reporting inconvenience — it is a production control failure.
The structural cause is not a missing dashboard filter. Orchestrators call subagents. Subagents make LLM calls. Each call hits the provider with the same API key and zero context about the originating workflow. The provider sees a cost pool. Your observability layer sees individual spans. Nobody sees the workflow that assembled them. Finance asks for a cost breakdown by feature or team. Engineering has one number and a spreadsheet built from deployment dates and gut feel.
Fix this before the next pipeline scales. The ledger is cheap. The post-mortems are not.
Agentic models consume 5-30× more tokens per task than single-agent chat — Gartner, March 2026 [1]
Handoff-based swarm patterns generate 14,000+ tokens on multi-domain tasks vs ~9,000 for subagent patterns — AugmentCode [2]
Up from $1.2M in 2024 — a 480% increase in two years while attribution tooling has not kept pace [3]
Volume driven by agentic workloads outpaces unit cost reductions — spend grows even as per-token cost collapses [4]
Cost Attribution Fails at the Handoff Boundary
The problem is not that observability tools lack cost views. Attribution context was never injected into the call in the first place.
Orchestrators initialize with a user request. They call subagents — sometimes in parallel, sometimes sequentially, often both. Each subagent makes multiple LLM calls. Each call hits the gateway with the same API key and no context about the originating workflow, the user who triggered it, or the orchestrator that assembled the context.
Most observability platforms (Langfuse, LangSmith, Helicone) record cost at the span where the LLM call was made — the subagent span — not at the originating workflow boundary.[5] The result is systematically misleading: subagents look expensive, orchestrators look cheap, and the real cost driver — the workflow topology that created the call volume — stays invisible.
The handoff attribution problem compounds this. When a workflow spans six agents and context grows with each handoff, the final agent often carries 80% of the total token cost because it inherits everything the previous agents wrote. Naive per-call logging attributes that to the last agent rather than the orchestrator that assembled the context. You penalize the closer, not the architect.
This is not a tooling limitation. It is a structural gap: attribution context was never attached at call time because no orchestrator initialization step injected it. The gateway cannot reconstruct parentage retroactively. Fix it at the source.
OpenTelemetry Baggage Is the Attribution Substrate
Trace context propagation across process boundaries is the mechanism that makes per-agent cost rollup possible — and it is already in your stack.
OpenTelemetry baggage travels with the trace context through process boundaries. A subagent spawned in a different process inherits the originating workflow's attribution tags automatically — as long as the orchestrator injected them at initialization.
The mechanism: the orchestrator sets baggage entries (workflow_id, user_id, agent_role, task_id) when the workflow starts. Every downstream span in every process that descends from that trace inherits the baggage. The gateway reads baggage headers from incoming requests and stamps attribution on the billing event before forwarding to the provider.
The OpenTelemetry gen_ai semantic conventions define the standard span attributes for AI workloads: gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.system, gen_ai.request.model.[7] Combined with baggage-propagated workflow context, these give you the full attribution record: model, token counts, workflow parentage, and agent role — all attached to the same span.
This is not a new framework. OTel is already in most platform stacks. The gap is that nobody wired the orchestrator initialization to set attribution baggage before the first LLM call fires.
orchestrator_init.pyfrom opentelemetry import baggage, context, trace
from opentelemetry.baggage.propagation import W3CBaggagePropagator
from opentelemetry.propagators.composite import CompositePropagator
from opentelemetry.trace.propagation.tracecontext import TraceContextTextMapPropagator
import httpx
# Set at orchestrator init — before any agent spawns.
# Every downstream span in every process inherits these tags.
def init_workflow_context(
workflow_id: str,
user_id: str,
task_id: str,
) -> context.Context:
ctx = context.get_current()
ctx = baggage.set_baggage("workflow_id", workflow_id, context=ctx)
ctx = baggage.set_baggage("user_id", user_id, context=ctx)
ctx = baggage.set_baggage("agent_role", "orchestrator", context=ctx)
ctx = baggage.set_baggage("task_id", task_id, context=ctx)
return ctx
def make_attributed_llm_call(
payload: dict,
ctx: context.Context,
gateway_url: str,
) -> dict:
tracer = trace.get_tracer(__name__)
propagator = CompositePropagator(
[TraceContextTextMapPropagator(), W3CBaggagePropagator()]
)
with tracer.start_as_current_span(
"llm.completion",
context=ctx,
attributes={
"gen_ai.system": "anthropic",
"gen_ai.request.model": payload.get("model", "unknown"),
},
) as span:
headers: dict[str, str] = {}
# Inject trace + baggage into outgoing HTTP headers.
# Gateway reads these and stamps attribution on the billing event.
propagator.inject(headers, context=context.get_current())
response = httpx.post(gateway_url, json=payload, headers=headers)
data = response.json()
# Standard gen_ai attributes — rolled up by the cost ledger.
span.set_attribute("gen_ai.usage.input_tokens", data.get("input_tokens", 0))
span.set_attribute("gen_ai.usage.output_tokens", data.get("output_tokens", 0))
return data
# Subagent in a separate process: extract baggage from incoming request headers.
def subagent_context_from_request(headers: dict) -> context.Context:
propagator = CompositePropagator(
[TraceContextTextMapPropagator(), W3CBaggagePropagator()]
)
ctx = propagator.extract(carrier=headers)
# Override agent_role for this subagent's spans.
ctx = baggage.set_baggage("agent_role", "subagent-retriever", context=ctx)
return ctxObservability Without Enforcement Is Just a Better Post-Mortem
Attribution data is only useful if the gateway acts on it before the damage is done.
Costs recorded after the provider billing event — no way to halt in flight
Threshold alerts fire after cumulative spend crosses a fixed monthly cap
Budget decisions use prior month actuals; agentic volume growth makes them stale within weeks
Runaway retry loop caught by billing alert or human escalation — median detection: 45 minutes
Post-mortem says 'estimated cost impact' with a range spanning an order of magnitude
Attribution is a reporting feature owned by the finance team
Gateway reads attribution header before forwarding request to provider
Per-agent spend checked against dynamic threshold at call time — enforcement happens pre-forward
Circuit-breaker halts agent when rate-of-spend exceeds 3× the 7-day rolling baseline
Runaway loop caught in under 90 seconds; workflow paused, alert routed to on-call
Finance self-serve dashboard with per-team, per-feature, per-workflow drill-down
Attribution is a runtime enforcement primitive owned by the platform team
LiteLLM's a2a_iteration_budgets parameter caps the number of LLM calls per agentic session.[6] That is enforcement at the call-count layer. Cost enforcement requires one more step: a gateway middleware that reads the propagated baggage headers, looks up the current session spend for that workflow_id, and rejects the request before forwarding if the budget gate fails.
The enforcement call is synchronous and pre-forward — not a post-hoc alert. The distinction matters. An alert system tells you what happened. A gateway enforcement plane stops what is happening.
gateway_middleware.py"""Gateway middleware: reads OTel baggage, enforces per-agent spend budgets.
Run this as a FastAPI middleware layer in front of your LLM provider proxy.
"""
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from decimal import Decimal
from collections import deque
import threading
PRICE_PER_INPUT_TOKEN = Decimal("0.000003") # adjust per model
PRICE_PER_OUTPUT_TOKEN = Decimal("0.000015")
@dataclass
class SpendRecord:
workflow_id: str
agent_role: str
user_id: str
task_id: str
model: str
input_tokens: int
output_tokens: int
cost_usd: Decimal
timestamp: datetime = field(default_factory=datetime.utcnow)
class CostLedger:
"""Per-workflow, per-agent spend tracker with budget enforcement."""
def __init__(self, budget_usd_per_workflow: Decimal = Decimal("5.00")):
self._budget = budget_usd_per_workflow
self._records: list[SpendRecord] = []
self._lock = threading.Lock()
# Rolling 7-day window for rate-of-spend baselines.
self._rolling_window: dict[str, deque] = {}
def record_and_enforce(
self,
workflow_id: str,
agent_role: str,
user_id: str,
task_id: str,
model: str,
input_tokens: int,
output_tokens: int,
) -> SpendRecord:
cost = (
Decimal(input_tokens) * PRICE_PER_INPUT_TOKEN
+ Decimal(output_tokens) * PRICE_PER_OUTPUT_TOKEN
)
record = SpendRecord(
workflow_id=workflow_id,
agent_role=agent_role,
user_id=user_id,
task_id=task_id,
model=model,
input_tokens=input_tokens,
output_tokens=output_tokens,
cost_usd=cost,
)
with self._lock:
current_total = sum(
r.cost_usd
for r in self._records
if r.workflow_id == workflow_id
)
if current_total + cost > self._budget:
raise BudgetExhaustedError(
f"Workflow {workflow_id} budget exhausted: "
f"${float(current_total + cost):.4f} > ${float(self._budget):.2f}"
)
self._records.append(record)
self._update_rolling_baseline(workflow_id, cost)
return record
def _update_rolling_baseline(self, workflow_id: str, cost: Decimal) -> None:
now = datetime.utcnow()
cutoff = now - timedelta(days=7)
if workflow_id not in self._rolling_window:
self._rolling_window[workflow_id] = deque()
window = self._rolling_window[workflow_id]
window.append((now, cost))
# Evict records older than 7 days.
while window and window[0][0] < cutoff:
window.popleft()
def check_rate_of_spend_anomaly(
self, workflow_id: str, multiplier: float = 3.0
) -> bool:
"""Returns True if 15-min spend > 3x the 7-day per-15min average."""
with self._lock:
window = self._rolling_window.get(workflow_id, deque())
if not window:
return False
now = datetime.utcnow()
cutoff_15m = now - timedelta(minutes=15)
cutoff_7d = now - timedelta(days=7)
spend_15m = sum(c for t, c in window if t >= cutoff_15m)
total_7d = sum(c for t, c in window if t >= cutoff_7d)
# Approximate number of 15-minute buckets in 7 days: 672
avg_15m = total_7d / Decimal("672")
return spend_15m > avg_15m * Decimal(str(multiplier))
def finance_report(self, workflow_id: str | None = None) -> list[dict]:
"""Finance-drillable output: per agent_id, workflow, cost, model, timestamp."""
with self._lock:
records = [
r for r in self._records
if workflow_id is None or r.workflow_id == workflow_id
]
return [
{
"workflow_id": r.workflow_id,
"agent_role": r.agent_role,
"user_id": r.user_id,
"task_id": r.task_id,
"model": r.model,
"input_tokens": r.input_tokens,
"output_tokens": r.output_tokens,
"cost_usd": float(r.cost_usd),
"timestamp": r.timestamp.isoformat(),
}
for r in records
]
class BudgetExhaustedError(Exception):
passRAG pipelines and memory layers inject content into context windows that all downstream agents read. If a retrieval step adds 4,000 tokens and three agents each process that context, naive attribution charges all 4,000 tokens to the agent that triggered retrieval — inflating that agent's cost and burying the retrieval infrastructure cost entirely.
The handoff attribution problem is worst when context grows through sequential agents. Agent A produces a 2,000-token summary. Agent B reads that summary and produces a 3,000-token analysis. Agent C reads both and makes the final LLM call with 5,000 tokens of inherited context plus 1,000 of its own. Per-call attribution charges Agent C with 6,000 input tokens when it only contributed 1,000 of them.
The fair-share formula: each agent's attributable input tokens = their marginal context contribution + (shared context tokens ÷ number of agents that read it). Retrieval costs — embedding calls, vector search — are attributed separately as infrastructure spend, not per-agent LLM spend. This prevents the retrieval layer from hiding inside whichever agent triggered it.
No existing observability tool implements this. Langfuse records the full context at each span.[5] LangSmith does the same. The calculation has to happen in your ledger layer, using the agent-scoped marginal token counts you track per handoff rather than the cumulative context size the provider bills.
This is a limitation worth naming: the fair-share calculation requires knowing which agents read which tokens. If your orchestrator does not track context composition per agent, you cannot compute marginal attribution. The alternative is to instrument at the context assembly step — when the orchestrator prepares the prompt for each subagent, record which tokens it added versus which it forwarded from a previous agent.
Static Monthly Caps Catch Fires After the Building Burns Down
Dynamic rate-of-spend thresholds detect runaway loops in minutes. Static caps detect them on the billing cycle.
LiteLLM's monthly budget enforcement and Langfuse usage alerts share the same structural flaw: they trip after cumulative spend crosses a fixed number.[6] A retry loop that consumes the monthly cap in 45 minutes is indistinguishable from normal end-of-month spend until it hits the ceiling.
The rate-of-spend formula: alert_threshold = rolling_avg_15min_spend(7d) × 3.0. If the 15-minute spend rate for a workflow exceeds three times its 7-day rolling average, trigger an alert and optionally pause the agent. This catches the acceleration signal before cumulative damage exceeds a meaningful amount.
Implementation: a lightweight aggregation job running every 60 seconds, reading from the cost ledger, computing per-workflow 15-minute spend rates, and comparing against stored 7-day baselines. The check_rate_of_spend_anomaly method in the CostLedger class above implements this check — call it from a background thread or a scheduled task.
Two detection mechanisms cover two failure modes: rate-of-spend anomaly detection catches spikes, and a weekly trend check catches drift. If the 7-day rolling average for a workflow grows faster than 15% week-over-week, flag it for review. A context window that grows by 200 tokens per run over 30 days will not trigger a spike alert — but it will show up in the trend check before it compounds into a meaningful cost increase.
The Platform Team Owns the Ledger. Product Teams Own Their Budget Lines.
Cost attribution without a chargeback interface just relocates the spreadsheet inside the platform team.
- [01]
Define the attribution taxonomy
workflow_id maps to a feature, feature maps to a team, team maps to a cost center. This mapping lives in a config file, not in the gateway. Platform engineers own the schema; product teams declare their feature identifiers at workflow registration time. The taxonomy is the contract between the ledger and the finance system.
- [02]
Aggregate the ledger
Sum attributable spend per workflow_id per day. Join against the taxonomy. Write to a time-series store. Keep both per-agent granularity (for debugging) and per-feature rollups (for chargeback). The aggregation job runs on whatever cadence finance needs — hourly for monitoring, daily for chargeback.
- [03]
Expose a self-serve interface
A simple API that returns daily, weekly, and monthly spend by team, feature, and workflow. Product teams query it without needing gateway access or platform team involvement. The interface shows current spend vs. budget remaining — not just historical totals. Teams can see when they are trending toward overage before the period closes.
- [04]
Run monthly chargeback reviews
Product teams own their cost lines and can request model routing changes or context window optimizations to reduce spend. The platform team approves routing changes that affect shared infrastructure — a change that saves $2,000/month for one team but adds 50ms p99 latency for three others requires a different decision process than a change that only affects the requesting team's workflow.
Attribution Checklist: Before You Ship a New Workflow
workflow_id injected at orchestrator initialization
agent_role set per agent, not per deployment — one deployment can run multiple roles
userid or tenantid propagated from the entry point through every baggage context
OTel baggage headers verified in a test trace before promoting to production
Per-agent budget cap configured in the gateway ledger
Rate-of-spend baseline established from a load test (7-day window requires backfill)
Retrieval costs attributed separately from generation costs in the taxonomy
Chargeback taxonomy entry created: workflow_id → feature → team → cost center
Alert routing confirmed — who gets paged on a rate-of-spend trigger
Shared context window marginal attribution documented for multi-agent paths
Can I retrofit attribution to an existing pipeline without a rewrite?
Yes, if your agents call LLMs through a single gateway. Add baggage injection at the gateway entry point using a middleware layer — agents do not need to change. Cost is one middleware function and a schema migration in your cost store. The only hard dependency is that all LLM calls route through the same gateway. If some agents call providers directly, those calls are invisible to the ledger until you route them through the proxy.
Does OpenTelemetry baggage survive async queue boundaries?
Not automatically. When an agent publishes a task to a queue, serialize the baggage context into the message payload and deserialize it on the consumer side. The OTel propagation API has extract and inject methods for exactly this case — use W3CBaggagePropagator.inject(carrier, context) when publishing, and extract(carrier) when consuming. Without this, any async handoff drops the attribution thread and your cost records fragment.
What granularity is right — per-agent or per-step?
Per-agent for chargeback and budget enforcement. Per-step for debugging and optimization. Store both — the ledger is cheap. Step-level data is what you use in post-mortems; agent-level aggregation is what finance and product teams see. If you have to pick one, pick per-agent first. Per-step attribution is useless if you cannot answer 'which agent owns this cost center.'
How do I attribute streaming responses where token counts arrive at stream end?
Buffer the attribution event. Stamp the start-of-stream with the attribution metadata and the request_id. When the stream closes and token counts are final, emit the cost event with both fields populated. Never approximate token counts mid-stream for billing or enforcement purposes — partial counts produce systematic undercharging until the stream completes, which then triggers a spike that looks like an anomaly but is just deferred billing.
What happens to attribution when an orchestrator retries a failed subagent?
Each retry is a separate LLM call with the same baggage context. The ledger records all of them. The retry count itself is worth tracking — store it as a span attribute (agent.retry_count) so the finance report can distinguish baseline spend from retry overhead. A workflow that succeeds on the fourth attempt spent 4× the expected tokens on that step; that pattern surfaces model reliability issues that pure cost monitoring misses.
The ledger is not a reporting artifact that satisfies the next finance review. It is the enforcement substrate that makes autonomous agents safe to run at scale.
Every agentic system without per-agent attribution is making an implicit bet: nothing will go wrong before the monthly bill arrives. That bet pays off for the first few months. It stops paying off when the pipeline scales, context windows grow with each handoff, and the subagent topology picks up three more agents for a feature that shipped under deadline pressure.
Build the ledger before you need the post-mortem. The implementation is a middleware layer, a few hundred lines of Python, and an OTel baggage injection at orchestrator init. The alternative is a post-mortem with a section called 'estimated cost impact' and a range that spans an order of magnitude.
Platform teams that wire attribution into the enforcement plane stop writing those sections. Teams that treat cost attribution as observability-only keep writing them.
- [1]Gartner — Gartner: Agentic models require 5-30x more tokens per task than standard generative AI chatbots(gartner.com)↩
- [2]AugmentCode — Multi-Agent Cost Compounding: Why 3 Agents Cost 10x(augmentcode.com)↩
- [3]Mavvrik AI — Average enterprise AI budget grew from $1.2M in 2024 to $7M in 2026(mavvrik.ai)↩
- [4]Navya AI — Token prices fell 280x over two years while total enterprise AI spend rose 320%(navyaai.com)↩
- [5]Langfuse — Langfuse token and cost tracking documentation(langfuse.com)↩
- [6]LiteLLM — LiteLLM agent iteration budgets documentation(docs.litellm.ai)↩
- [7]Uptrace — OpenTelemetry gen_ai semantic conventions for AI systems(uptrace.dev)↩
- [8]Braintrust — Best tools for tracking LLM costs in production (2026)(braintrust.dev)↩
- [9]TrueFoundry — Agentic Token Explosion: How to Attribute, Budget, and Control LLM Costs(truefoundry.com)↩