Four agents coordinate. The trace backend shows 3 to 10 orphaned root spans, no causal thread. The model is not the failure. Context propagation is. Five gaps, the minimal code to close each, and the build order that actually ships.
Every team building multi-agent systems hits the same wall on the same day: the first production failure they cannot debug. The orchestrator dispatches a task to three workers. One worker calls an MCP tool server. Something upstream returns garbage. The final output is wrong. You open Jaeger. Five root spans. None of them connect.
This is not a skill gap. It's a structural one. Distributed tracing for multi-agent systems breaks at five specific points microservice tracing never had to handle: MCP server boundaries, async queue handoffs, missing orchestrator span hierarchy, cost attributed at the wrong level, and head-based sampling that throws away the failure traces you needed.
The OpenTelemetry GenAI semantic conventions (v1.41.0 as of May 2026) cover the LLM call cleanly — gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, tool spans. What they don't yet cover is automatic context propagation across MCP servers and message queues.[1] Single-agent traces are clean. Multi-agent traces are broken — until you close the gaps yourself.
This article maps the five gaps, the minimal code to close each, and the build order that produces a working trace tree at every phase.
Why MCP servers surface as isolated root spans and the two lines that fix it — plus the new params._meta standard (SEP-414) for transport-agnostic propagation
Async queue propagation: serialize context into message metadata before enqueue, extract before any span is created
The GenAI spec's invokeagent and invokeworkflow spans — CLIENT vs INTERNAL span kind, and why you still need the app.agent.* namespace
Cost attribution: attribute tokens at the worker span, not the orchestrator; a 50-step task produces 250+ spans and per-worker cost falls out without post-processing
Tail-based sampling: YAML config, memory sizing formula, and why head-based sampling produces false confidence on multi-agent workloads
Microservices propagated headers automatically. Agents do not. The default state is a disconnected tree.
Distributed tracing in a multi-agent system means preserving a single trace_id and a coherent parent-child span hierarchy across N agent processes, tool calls, message queues, and MCP servers — so every reasoning step, delegation, and invocation appears as one connected causal graph instead of orphaned fragments.[4]
Microservice tracing solved this for HTTP services: propagate a traceparent header on every request and spans self-assemble into a tree. It works because instrumented HTTP clients propagate headers by default and service boundaries are explicit synchronous calls.
Multi-agent systems break that assumption in four ways. Some calls go through MCP servers that don't receive HTTP headers automatically. Some handoffs go through message queues with no headers at all. Parallel agents running concurrently create span relationships that are DAGs, not trees — parent-child is the wrong model. And a head-based sampler decides before knowing whether the trace will matter, discarding the slow, looping, or failing traces that were the entire reason for instrumenting.
The consistent symptom: a 4-agent workflow that fails halfway produces 3–10 orphaned root spans in Jaeger or Zipkin with no way to reconstruct the causal chain.[1] Each agent did something. Nobody can name what triggered it.
Note: these are propagation gaps, not framework bugs. AG2, LangChain, and CrewAI emit OTel-compliant spans natively for LLM calls. The gaps live at the boundaries those frameworks don't own.
| Gap | Boundary Type | Default Behavior | Symptom in Backend |
|---|---|---|---|
| MCP server calls | HTTP/stdio, external process | traceparent not forwarded | Tool call as orphan root span |
| Async queue handoffs | Message metadata, cross-process | No context in payload | Worker spans as new root spans |
| Orchestrator hierarchy | In-process or cross-process | No invoke_agent span emitted | Flat span list, no DAG shape |
| Cost attribution level | Span attribute placement | Tokens on orchestrator span | Per-worker cost invisible |
| Head-based sampling | Sampler decision timing | Drops traces before completion | Error/slow traces missing from backend |
Map the boundary, name the failure mode, close the gap. None of them is optional.
The largest single source of orphaned root spans in 2026 agent systems — and the spec now has a standard fix.
When an agent invokes an MCP server to execute a tool, the MCP protocol doesn't receive or propagate the traceparent header from the calling agent. Every MCP server operation lands as an isolated root span unless you inject the header manually at the invocation site.[1]
Two things have changed in 2026 that make this easier but don't eliminate the gap. First, the OTel GenAI spec now defines MCP-specific span attributes: mcp.method.name, mcp.session.id, mcp.protocol.version, and gen_ai.tool.name — so MCP tool calls get proper semantic labeling when they do reach the backend.[10] Second, SEP-414 (merged into the MCP protocol spec) standardizes W3C trace context propagation via the params._meta field, locking down the traceparent, tracestate, and baggage key names so traces can correlate across SDKs and gateways.[11] But neither change makes propagation automatic. You still inject at the call site.
The fix is two lines: inject the current OTel context into a carrier dictionary, pass it as request headers. For HTTP transport, carriers become headers. For stdio transport (the common local MCP case), the carrier goes into params._meta — the protocol-standard location now that SEP-414 is merged.
Message-passing architectures lose the trace silently. The system keeps working. The tree does not.
Message queues are the second trace-severing boundary. When an orchestrator enqueues a task for a worker agent, the trace context — the current trace_id and span_id — has no automatic path to the consumer. The worker starts processing and emits its first span. That span becomes a new root span with no link to the orchestrator.
The gap is silent. No error fires. The queue sends and receives. The worker completes its task. The traces are permanently disconnected.[4]
The pattern is consistent across every queue implementation — SQS, Kafka, RabbitMQ, Redis Streams: serialize the OTel propagator state into the message metadata before enqueue, extract it before creating any spans on the consumer side. The propagator state is just a dict of strings (traceparent and optional tracestate), so it rides in any payload format.
invokeagent and invokeworkflow are now in the GenAI semconv. CLIENT vs INTERNAL matters. app.agent.* still fills the gaps.
The OpenTelemetry GenAI semantic conventions now define four agent operation types: create_agent, invoke_agent, invoke_workflow, and execute_tool.[9] As of Semantic Conventions v1.41.0 (May 2026), these are still labeled Development — not Stable — but Datadog, Honeycomb, and New Relic already ingest them without SDK changes.[2]
The distinction that matters for span hierarchy: invoke_agent span kind is CLIENT when the agent runs in a remote process (OpenAI Assistants API, AWS Bedrock Agents, any agent-over-HTTP pattern) and INTERNAL when it runs in-process (LangChain agents, CrewAI agents, AutoGen within the same process). The span kind determines how backends render parent-child relationships and calculate latency attribution. Getting it wrong produces misleading waterfall views.
For orchestration frameworks that treat a "crew" or "workflow" as distinct from individual agents — CrewAI is the canonical example — invoke_workflow is the right span for the top-level coordinator. Underneath it, individual agent invocations get invoke_agent.
What the spec still doesn't cover: per-agent cost attribution attributes, delegation depth, and a standard gen_ai.agent.role attribute (orchestrator vs. specialist). Use app.agent.* for those. When the spec adds them — which the GenAI SIG is actively working on — the migration is find-and-replace on attribute names, not re-instrumentation.[1]
| Span Operation | Span Kind | When to Use | Example |
|---|---|---|---|
| invoke_agent | CLIENT | Agent runs in a remote process, called over HTTP/API | AWS Bedrock Agents, OpenAI Assistants API |
| invoke_agent | INTERNAL | Agent runs in-process, within the same framework | LangChain agent, CrewAI agent, AutoGen in-process |
| invoke_workflow | INTERNAL | Top-level coordinator wraps multiple agent invocations | CrewAI crew, LangGraph graph execution |
| execute_tool | CLIENT or INTERNAL | Tool execution, including MCP tool calls | MCP server call, function call, API wrapper |
| create_agent | INTERNAL | Agent lifecycle: initialization or instantiation | Agent factory, dynamic agent provisioning |
Two design decisions that compound. Get them wrong and the rest of the stack is theater.
Cost attribution at the wrong level is the span design mistake most teams make on the first build. The instinct is to track token cost at the orchestrator span — the one that initiates the workflow. That gives you aggregate cost per request and nothing else. When a 5-agent workflow runs over budget, you need to know which worker is expensive.
The correct level is the invoke_agent span for each worker, with gen_ai.usage.input_tokens and gen_ai.usage.output_tokens recorded on the LLM call spans nested beneath it. A 50-step task with 5 agents and 10 tool calls each produces 250+ spans in a correctly instrumented trace.[6] Query by app.agent.name and per-worker cost is a single aggregation — no post-processing, no log correlation.
Sampling is more consequential. Head-based sampling decides at trace ingestion — before the trace has completed. For fast-completing requests, that's efficient. For multi-agent workflows that loop, retry, or cascade into failures, it discards the precise traces you needed most. A 1% head-based sample that drops every slow or failed trace is worse than no sampling. It produces false confidence: only normal traces in the sample, baseline skewed toward success.
Tail-based sampling decides after the trace completes. Configure the OTel Collector's tailsamplingprocessor to retain 100% of traces with errors, traces that exceeded a latency threshold, and traces where total token cost exceeded the per-request budget. Clean, cheap traces get sampled aggressively.[13]
Memory sizing matters. The processor buffers spans until decision_wait elapses. The formula: num_traces = expected_new_traces_per_sec × decision_wait_seconds × safety_factor. At 100 traces/sec and decision_wait: 10s with a 2× safety factor, you need num_traces: 2000 and roughly 40–100 MB of buffer RAM (at 20–50 KB per trace).[12] Under-size the buffer and the processor starts evicting traces before they're sampled — silently, with no error.
Decision at trace ingestion — before any span completes
Low memory overhead: no buffering required
Drops slow and failed traces at the same rate as successful ones
Cannot filter on trace outcome (error, cost, latency)
Baseline skewed toward success — false confidence on error rates
Decision after trace completion — based on actual outcome
Higher memory: buffers spans for decision_wait duration
Retains 100% of error and latency-outlier traces
Configurable: keep failures and budget overruns, sample clean traces
Accurate failure visibility at manageable storage cost
Without explicit propagation at MCP and queue boundaries — tianpan.co analysis, Apr 2026
50-step task, 5 agents, 10 tool calls each — Zylos Research, Apr 2026
GenAI + MCP semconv still labeled Development — use app.agent.* for orchestration-specific attributes
Tail-based sampling retains all error and latency-outlier traces; samples clean traces at 5–10%
Each phase ships a useful trace. None of them is the final trace. Sequence matters — later phases produce nothing if Phase 1 is skipped.
Add propagate.inject(carrier) before every HTTP call to an external service — MCP tools, agent-to-agent HTTP, third-party APIs. For stdio MCP calls, inject into params._meta per SEP-414. Add propagate.extract(carrier) at the start of every consumer: queue workers, HTTP handlers, webhook receivers. Do not build span hierarchies until propagation works end-to-end. A clean tree on broken propagation produces broken trees.
Wrap every agent's main execution in an invokeagent or invokeworkflow span. Set genai.operation.name to invokeagent, set span kind to CLIENT for remote agents and INTERNAL for in-process. Add app.agent.role (orchestrator or worker) and app.agent.name as custom attributes. For in-process agent calls, startActiveSpan handles parent-child automatically. For parallel workers, use span links not parent-child. If Phase 1 is skipped, this phase produces nothing.
Confirm every LLM call span records genai.usage.inputtokens and genai.usage.outputtokens nested under the correct worker invokeagent span — not floating under the orchestrator. If your framework emits these automatically (AG2, LangChain with OTel instrumentation), run a test trace with two workers and verify each worker span shows its own token breakdown. Set app.agent.budgetexceeded: true on any invoke_agent span that exceeds the per-agent cost threshold — this is what the tail sampler's high-cost policy filters on.
Configure the OTel Collector tailsampling processor: decisionwait of 10s, numtraces sized to expectedrate × waitseconds × 2 (safety factor). Policies: retain 100% of ERROR status spans, retain 100% of spans exceeding latency SLO (5s is a reasonable default for multi-agent tasks), retain 100% of traces with app.agent.budgetexceeded = true, sample clean traces at 5%. Place memorylimiter before tailsampling in the pipeline. Under-sized buffers evict traces before sampling — silently.
Production failure with orphaned spans. Three questions narrow the gap in under five minutes.
Opening Jaeger to a wall of disconnected root spans is the diagnostic equivalent of finding a stack trace with no line numbers. The gap is there — you just need to find where the context dropped.
Start with span count. A 4-agent workflow should produce one root span. If you see N > 1 root spans, you have N − 1 propagation gaps. The gaps are almost always at process boundaries. Count the boundaries in your architecture: each MCP server call, each queue consumer, each HTTP client call to another service. Check them in order.
Second: look at span attributes on the orphaned root span. A root span with app.agent.role: worker means the queue consumer extracted nothing. A root span with gen_ai.tool.name: <something> means an MCP call didn't receive the carrier header. An orphaned span with no gen_ai.* attributes means the agent framework emitted the span but the calling context was already lost upstream.
Third: check the sampling pipeline. If error traces are missing from the backend entirely — not just disconnected, but absent — head-based sampling is dropping them before they arrive. Switch to tail-based before investigating further. Debugging propagation with a 1% head-based sampler that preferentially drops slow or failing traces is the equivalent of debugging a production outage with logs disabled.
N root spans = N-1 propagation gaps. Identify boundaries before reading attributes.
The consumer started a new trace. Check that propagate.extract() runs before any span is created.
The MCP server received the call without a carrier. Check inject() at the call site, or params._meta for stdio.
Switch to tail-based before any other debugging. You cannot debug traces that don't exist in the backend.
In-process calls share the active context automatically. Manual inject/extract is only for process and service boundaries.
If any line is unchecked, your trace tree breaks somewhere you haven't seen yet.
Is OpenTelemetry auto-instrumentation enough for multi-agent systems?
For LLM calls, mostly yes — frameworks like AG2, CrewAI, and LangChain emit OTel-compliant spans natively or via instrumentation packages. For multi-agent coordination, no. Auto-instrumentation handles HTTP client calls for standard libraries. It doesn't know about your message queue payloads, your MCP server call sites with custom clients, or your agent delegation patterns. The split is clear: let auto-instrumentation handle LLM call spans and standard HTTP. Add manual propagation at queues, MCP boundaries, and cross-service agent calls. There's no shortcut at those boundaries.
What backend works best for multi-agent traces?
OTel-first tools — Phoenix by Arize, Langfuse, SigNoz — emit standard OTel format and export to any compatible backend. If you're not already locked into a vendor, OTel-first preserves the option to switch backends without changing agent code. Datadog and Honeycomb adopted the GenAI semantic conventions in 2025 and ingest agent spans without SDK changes. The choice is whether the platform's higher-level features — session replay, LLM-as-judge scoring, A/B prompt comparison — earn their cost, or whether raw trace visibility is sufficient.
How do I trace parallel agents without a tangled span tree?
Use OTel span links instead of parent-child relationships for genuinely parallel agents — workers running concurrently without sequential dependency. A span link connects two spans that are causally related but not in a direct hierarchy. The resulting view is a DAG, not a strict tree, which matches the actual execution structure. Wall-clock timestamps are insufficient to establish causality in parallel systems — clock skew can invert the apparent order of causally related events. Span links with explicit trace context are the only reliable mechanism for reconstructing execution order when agents run concurrently.
When will the GenAI semantic conventions cover multi-agent orchestration fully?
The OTel GenAI SIG has been actively working on orchestrator-to-worker span semantics since April 2024. As of Semantic Conventions v1.41.0 (May 2026), invokeagent, invokeworkflow, and MCP attributes are in the spec but still labeled Development. The safe path: use existing GenAI conventions for what they cover, app.agent.* for what they don't. Watch the opentelemetry/semantic-conventions repository for GenAI SIG PRs — that's where orchestration attributes will appear before they reach the public docs.
Does the MCP spec now handle trace context propagation automatically?
Not automatically, but the mechanism is now standardized. SEP-414 (merged into the MCP protocol spec) locks down the traceparent, tracestate, and baggage key names in params.meta, so different SDKs and gateways correlate traces consistently. You still inject at the call site — the spec tells you where to put the carrier, not that it gets there for you. FastMCP's MCPTool.calltool() injects automatically when an active OTel span exists, which is the closest to 'automatic' you'll get today.
The five gaps don't require a complete observability overhaul. Three lines of Python close the MCP boundary. Four lines close the queue context. The GenAI spec gives you the span names — you just need the correct span kind and the right namespace for custom attributes. Tail-based sampling is a Collector config change.
The harder problem is knowing which gaps exist in your architecture before a production failure surfaces them. The checklist above catches them in staging, where debugging a disconnected trace tree is annoying but not urgent. Finding out in production — tracing a live failure across 4 agents, 20 steps, and 8 orphaned root spans — is the expensive version of the same lesson.
Run a test request through your system today. Count the root spans. More than one means you already know where to start.