There's a specific moment every team building multi-agent systems hits: the first production failure they can't debug. The orchestrator sent a task to three workers. One worker called an MCP tool server. Something upstream produced garbage, and the final output is wrong. You open Jaeger. You see five root spans. None of them connect.
This isn't an instrumentation skill gap. It's a structural problem. Distributed tracing for multi-agent systems breaks at five specific points that microservice tracing never had to handle: MCP server boundaries, async queue handoffs, missing orchestrator span hierarchy, cost attribution at the wrong level, and head-based sampling that discards your most important failure traces.
The OpenTelemetry GenAI semantic conventions cover the LLM call itself well — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and tool spans. What they don't yet standardize is the agent-to-agent relationship layer. There's no gen_ai.agent.role attribute in the 2026 spec, no standard for orchestrator delegation depth, and no automatic context propagation across MCP servers.[1] You get solid single-agent traces and broken multi-agent traces — unless you close the gaps yourself.
This article covers exactly those five gaps and the minimal code required to close each one.
Key Takeaways
- ✓
MCP servers don't auto-propagate the W3C traceparent header — each server call appears as an isolated root span unless you inject the header manually at the call site.
- ✓
Async queue boundaries sever trace context silently unless you serialize the OTel propagator state into the message metadata before enqueue.
- ✓
The GenAI semantic conventions don't yet cover orchestrator-to-worker span attributes. Use the app.agent.* namespace to avoid conflicts with the eventual standard.
- ✓
Assign token cost attribution at the worker span level, not the orchestrator span level — so you can see per-agent cost, not just aggregate cost per request.
- ✓
Head-based sampling drops failures before they're visible. For multi-agent systems, tail-based or adaptive sampling is the only option that captures the traces you actually need.
Why Multi-Agent Tracing Breaks Differently
The structural gap between microservice tracing and agent tracing
Distributed tracing in a multi-agent system means preserving a single trace_id and coherent parent-child span hierarchy across N agent processes, tool calls, message queues, and MCP servers — so every reasoning step, delegation, and invocation appears as a connected causal graph rather than isolated fragments.[4]
Microservice tracing solves the same problem for HTTP services: propagate a traceparent header through every request, and the spans self-assemble into a tree. It works because HTTP clients propagate headers by default when instrumented, and service boundaries are explicit synchronous calls.
Multi-agent systems break this assumption in four ways. Some calls go through MCP servers that don't receive HTTP headers automatically. Some handoffs go through message queues where there are no headers at all. Agent frameworks may not emit the invoke_agent span structure the GenAI conventions specify. And when the system is under load, a head-based sampler will make its decision before it knows whether the trace will be interesting — discarding the slow, looping, or failing traces you most need.
The result teams consistently report: a 4-agent workflow that fails halfway produces 3–10 orphaned root spans in Jaeger or Zipkin with no way to reconstruct the causal chain.[1] Each agent did something. Nobody knows what triggered it.
The 5 Propagation Gaps
Where trace context breaks by default in multi-agent pipelines
Gap 1: MCP Server Boundaries
The most common source of orphaned root spans in 2026 agent systems
Model Context Protocol servers are the clearest example of the MCP boundary problem. When an agent invokes an MCP server to execute a tool, the MCP protocol doesn't automatically receive and propagate traceparent headers from the calling agent. Each MCP server operation appears as an isolated root span unless you manually inject the header at the invocation site.[1]
This produces the specific symptom teams report: a query to a multi-agent system that should produce one connected trace instead produces 3–10 orphaned root spans in Jaeger or Zipkin, with no way to join them in the dashboard. The orchestrator span shows the call was made. The MCP server span shows the tool ran. There's no parent-child relationship between them.
The fix requires two steps. Before making the HTTP call to the MCP server, inject the current OTel context into a carrier dictionary and pass that dictionary as request headers. One function, one extra argument.
mcp_tracing.pyfrom opentelemetry import trace, propagate
import httpx
tracer = trace.get_tracer("app.agent")
def call_mcp_tool(tool_name: str, args: dict) -> dict:
carrier = {}
propagate.inject(carrier) # serialize current traceparent + tracestate
with tracer.start_as_current_span(f"execute_tool {tool_name}") as span:
span.set_attribute("gen_ai.operation.name", "execute_tool")
span.set_attribute("tool.name", tool_name)
response = httpx.post(
f"{MCP_SERVER_URL}/tools/{tool_name}",
json=args,
headers=carrier, # propagate context across the MCP boundary
)
span.set_attribute("tool.http.status_code", response.status_code)
return response.json()Gap 2: Async Queue Context Loss
Why message-passing architectures sever trace context silently
Message queues are the second most common trace-severing boundary. When an orchestrator enqueues a task for a worker agent, the trace context — specifically the current trace_id and span_id — has no automatic path to the consumer side. The worker starts processing and emits its first span, which becomes a new root span with no connection to the orchestrator.
The gap is silent. No error is thrown. The queue sends and receives messages correctly. The worker completes its task. The traces are just permanently disconnected.[4]
The pattern for closing this gap is consistent across all queue implementations: serialize the OTel propagator state into the message metadata before enqueue, and extract it before creating any spans on the consumer side.
queue_tracing.pyimport json
from opentelemetry import propagate, trace
tracer = trace.get_tracer("app.agent")
# PRODUCER SIDE: serialize context into message
def enqueue_task(queue, task: dict) -> None:
carrier = {}
propagate.inject(carrier) # capture current trace context
message = {
"task": task,
"_trace_context": carrier, # attach alongside payload
}
queue.send_message(json.dumps(message))
# CONSUMER SIDE: restore context before creating any spans
def process_message(message_body: str) -> None:
message = json.loads(message_body)
carrier = message.get("_trace_context", {})
ctx = propagate.extract(carrier) # restore parent span context
with tracer.start_as_current_span(
"invoke_agent worker",
context=ctx, # this span becomes a child of the orchestrator span
) as span:
span.set_attribute("gen_ai.operation.name", "invoke_agent")
span.set_attribute("app.agent.role", "worker")
process_task(message["task"])Gap 3: Orchestrator Span Hierarchy
How to structure agent delegation spans before the standard arrives
The OpenTelemetry GenAI semantic conventions define invoke_agent spans for agent execution and execute_tool spans for tool invocations. As of early 2026, the conventions cover individual LLM calls, tool invocations, and session-level metrics with reasonable consistency. Major vendors — Datadog, Honeycomb, New Relic — already support them natively.[2]
What the spec doesn't yet cover is the relationship between orchestrator and worker agents. There's no standard gen_ai.agent.role attribute (orchestrator vs. specialist), no standard way to represent delegation depth, and no standard for cost attribution across a full agent DAG.[1] The OTel GenAI SIG is working on this, but multi-agent orchestration spans are experimental heading into 2026. If you start emitting spans with gen_ai.agent.role attributes today, you'll likely need to rename them when the standard stabilizes.
The practical approach: use the app.agent.* namespace for all orchestrator-specific attributes. This separates your custom attributes from the gen_ai.* namespace and won't conflict with whatever attribute names the spec eventually standardizes. The invoke_agent span name itself follows the GenAI convention and should stay as-is.[2]
For in-process agent calls, the startActiveSpan API handles parent-child relationships automatically — any spans created inside the callback become children of the current span without explicit parent references.[3] The propagation gaps described in Sections 1 and 2 only apply at process and service boundaries.
Gaps 4 & 5: Cost Attribution and Sampling
Two instrumentation decisions that compound when you get them wrong
Cost attribution is the span design decision most teams get wrong first. The natural instinct is to track token costs at the orchestrator span level — the span that initiates the whole workflow. But that gives you aggregate cost per request, not cost per agent call. When a 5-agent workflow runs over budget, you need to know which worker is expensive, not just that the request was.
The correct level is the invoke_agent span for each worker, where you record gen_ai.usage.input_tokens and gen_ai.usage.output_tokens on the LLM call spans nested beneath it. In a 50-step task with 5 agents and 10 tool calls each, a correctly instrumented trace produces 250+ spans.[6] Query them by app.agent.name and you get per-worker cost breakdown without post-processing.
Sampling is more consequential. Head-based sampling makes its decision at trace ingestion — before the trace has completed. For fast-completing requests, this is efficient. For multi-agent workflows that loop, retry, or cascade into failures, it will discard the precise traces you need to debug. A 1% head-based sample that drops all slow or failed traces is worse than no sampling — it produces false confidence that everything is running cleanly.[7]
Tail-based sampling decides after the trace completes. Configure it to retain 100% of traces with errors, traces that exceeded a latency threshold, or traces where total token cost exceeded your per-request budget. Traces that completed cleanly and cheaply get sampled aggressively. The ones that matter get kept.
Adaptive sampling adjusts the rate based on observed error rates and latency anomalies — recommended for production systems where traffic and failure modes vary.[6] More complex to configure than tail-based, but it substantially reduces storage costs at scale. Start with tail-based and migrate to adaptive once you have a clear picture of your failure distribution.
Without explicit propagation at MCP and queue boundaries — per tianpan.co analysis, Apr 2026
50-step task with 5 agents and 10 tool calls each — per Zylos Research, Apr 2026
At GPT-4o rates with 50K–100K tokens per failed loop — per TrackAI production analysis
Tail-based sampling should retain all error and latency-outlier traces; sample clean traces at 5–10%
Decision at trace ingestion — before completion
Fast, low memory overhead
Drops slow and failed traces at same rate as successful ones
Cannot filter on trace outcome (error, cost, latency)
Creates misleading baseline: only 'normal' traces sampled
Decision after trace completion — based on outcome
Higher memory overhead (buffers spans until decision)
Retains 100% of error and latency-outlier traces
Configurable: keep failures and budget overruns, sample clean traces
Accurate failure visibility at manageable storage cost
Build Order: The 4-Phase Path
Sequenced to deliver working traces at each phase, not just at completion
- 1
Phase 1: W3C Context Propagation at Every Service Boundary (Days 1–3)
Add propagate.inject(carrier) before every HTTP call to an external service — MCP tools, agent-to-agent HTTP calls, third-party APIs — and propagate.extract(carrier) at the start of every consumer: queue workers, HTTP handlers, webhook receivers. Don't build span hierarchies until propagation is working end-to-end. A clean tree built on broken propagation produces broken trees.
- 2
Phase 2: Agent Span Instrumentation with app.agent.* Attributes (Days 3–7)
Wrap each agent's main execution in an invokeagent span. Set genai.operation.name: invoke_agent, app.agent.role (orchestrator or worker), and app.agent.name. For in-process agent calls, the startActiveSpan API handles parent-child relationships automatically. For cross-process calls, Phase 1 propagation carries the context.
- 3
Phase 3: Per-Worker Cost Attribution (Days 7–10)
Confirm every LLM call span records genai.usage.inputtokens and genai.usage.outputtokens nested under the correct worker span, not floating under the orchestrator. If your framework emits these automatically (AG2, LangChain with OTel instrumentation), run a test trace with two workers and verify each worker span shows its own token breakdown.
- 4
Phase 4: Tail-Based Sampling Configuration (Days 10–14)
Configure the OTel Collector tail-based sampling processor with rules: retain 100% of traces with span errors, retain 100% of traces that exceeded your latency SLO, retain 100% of traces where aggregate token cost exceeded your per-request budget, and sample clean fast traces at 5–10%. Size the span buffer based on your P99 trace duration.
Production Trace Readiness Checklist
What to verify before claiming multi-agent observability
Multi-Agent Distributed Tracing Checklist
W3C traceparent header injected manually at every MCP server invocation
W3C traceparent header injected at every inter-agent HTTP call not handled by auto-instrumentation
Trace context serialized into message metadata before every queue enqueue
Trace context extracted from message metadata before creating any spans on consumer side
Every agent execution wrapped in an invoke_agent span with app.agent.role and app.agent.name attributes
genai.usage.inputtokens and output_tokens recorded on LLM call spans nested under the correct worker span
A test request with 3+ agents produces a single connected trace tree in your backend, not multiple root spans
Tail-based sampling configured to retain 100% of error traces and latency outliers
app.agent.* namespace used for custom attributes, not gen_ai.* (reserved for the semantic conventions spec)
Alert configured for orphaned root spans — a new root span mid-workflow indicates a propagation gap
Is OpenTelemetry auto-instrumentation enough for multi-agent systems?
For LLM calls, mostly yes — frameworks like AG2, CrewAI, and LangChain emit OTel-compliant spans natively or via instrumentation packages. For multi-agent coordination, no. Auto-instrumentation handles HTTP client calls for standard libraries but doesn't know about your message queue payloads, MCP server call sites, or agent delegation patterns. The split is clear: let auto-instrumentation handle LLM call spans and standard HTTP, and add manual propagation at queues, MCP boundaries, and cross-service agent calls.
What backend works best for multi-agent traces?
OpenTelemetry-first tools — Phoenix by Arize, Langfuse, SigNoz — emit standard OTel format and export to any compatible backend. If you're not already locked into a vendor, OTel-first gives you flexibility: instrument once, switch backends without changing agent code. Datadog and Honeycomb adopted the GenAI semantic conventions in 2025 and ingest agent spans without SDK changes. The choice should depend on whether you need the platform's higher-level features (session replay, LLM-as-judge scoring, A/B prompt comparison) or whether raw trace visibility is sufficient.
How do I trace parallel agents without creating a tangled span tree?
For genuinely parallel agents — workers that run concurrently without sequential dependency — use OTel span links rather than parent-child relationships. A span link connects two spans that are causally related but not in a direct parent-child hierarchy. The resulting trace view shows a DAG rather than a strict tree, which accurately represents the execution structure. For sequential agent chains where each step depends on the previous output, parent-child relationships are correct and produce cleaner waterfall views.
When will the GenAI semantic conventions cover multi-agent orchestration?
The OTel GenAI SIG (launched April 2024) is actively working on orchestrator-to-worker span semantics. There's no public timeline for stabilization. The safest path: use existing conventions for what they cover and use your own namespaced attributes for what they don't. Monitoring the opentelemetry/semantic-conventions GitHub repository for GenAI SIG activity gives you advance warning when orchestration attributes are being drafted.
The five propagation gaps don't require a complete observability overhaul. Three lines of Python close the MCP boundary gap. Four lines close the queue context gap. Namespace discipline costs nothing. Tail-based sampling is a Collector config change.
What's harder is knowing which gaps exist in your architecture before a production failure surfaces them. The checklist above is designed to catch them in staging, where debugging a disconnected trace tree is frustrating but not urgent. Finding out in production — while tracing a live failure across 4 agents, 20 steps, and 5 orphaned root spans — is the expensive version of the same lesson.
Run a test request through your system today. Count the root spans in your trace backend. If you see more than one, you know exactly where to start.
- [1]Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap (April 2026)(tianpan.co)↩
- [2]OpenTelemetry for AI Agents: Observability, Tracing, and the GenAI Semantic Conventions (Zylos Research, Feb 2026)(zylos.ai)↩
- [3]AG2 OpenTelemetry Tracing: Full Observability for Multi-Agent Systems (AG2, Feb 2026)(docs.ag2.ai)↩
- [4]Distributed Tracing for Agents: Tracing Multi-Agent Systems (Agent Patterns, Mar 2026)(agentpatterns.tech)↩
- [5]AgentTrace: Causal Graph Tracing for Root Cause Analysis in Deployed Multi-Agent Systems (arXiv 2603.14688)(arxiv.org)↩
- [6]AI Agent Observability: Tracing, Debugging, and the OpenTelemetry Standard (Zylos Research, Apr 2026)(zylos.ai)↩
- [7]How to Trace and Debug Multi-Agent Systems: A Production Guide (Future AGI, Mar 2026)(futureagi.com)↩
- [8]Production Observability for Multi-Agent AI with KAOS + OTel + SigNoz (HackerNoon, Mar 2026)(hackernoon.com)↩