AI Agent Tracing: Debug Multi-Agent LLM Workflows in Production

What agent tracing captures, how it differs from regular APM tracing, and how to instrument LangChain, LangGraph, CrewAI, and the Vercel AI SDK in one line.

An LLM agent is a workflow where a language model decides what to do next. That decision can be a tool call, a sub-agent invocation, another LLM call with a different prompt, or a return value to the caller. Production agents usually combine all four, branch on intermediate results, retry on failure, and run sub-tasks in parallel. The whole thing is non-deterministic, which means debugging by re-running it does not always reproduce the bug.

Agent tracing captures this entire flow as a span tree. Each LLM call, tool invocation, and sub-agent run becomes a span with parent/child links, exact timing, model variant, token counts, cost, and inputs/outputs. When something goes wrong (the bill triples, p99 latency doubles, the eval score collapses) the trace shows exactly which step caused it.

Anatomy of an agent trace

A typical agent trace has four layers. Spanlens renders all four in a single waterfall view with the critical path highlighted.

Layer 1: Trace root

User-facing task

One span per user request. Holds end-to-end latency, total cost, and the trace ID. Where most dashboards start.

Layer 2: Agent steps

Classify, retrieve, plan, summarize

One span per logical step. Includes the agent state at entry and exit. For LangGraph, one span per node. For CrewAI, one span per crew task.

Layer 3: LLM calls

Provider request and response

One span per upstream call. Captures model, tokens, cost, latency, streaming details, and tool-use arguments. The most common source of cost and latency surprises.

Layer 4: Tool calls

External fetches, DB queries, code execution

One span per tool invocation triggered by the LLM. Captures the arguments the LLM produced and the return value the tool sent back. Often where bugs hide because the LLM looks reasonable but the tool output was wrong.

Critical path: the only span that matters for latency

For an agent that runs four steps in parallel and one step sequentially after, total wall-clock time depends on the slowest of the parallel four plus the sequential one. Optimizing the fastest parallel step has zero effect on total latency. The critical path identifies which spans actually drive total time.

Spanlens computes the critical path automatically on every trace. The view colors critical-path spans differently and lists them at the top of the trace detail. Most other tools render the waterfall but leave the critical-path calculation as a manual exercise.

Framework integrations

LangGraph

Native integration captures node executions and state transitions.

LlamaIndex

Drop-in handler instruments retrieval, synthesis, and query engines.

Vercel AI SDK

streamText and generateText calls become spans with tool details.

MCP (Model Context Protocol)

Capture MCP tool servers and the LLM that called them in one trace.

OpenAI Assistants

Threads, runs, and steps render as a parent/child span tree.

Anthropic + tool use

Multi-turn tool flows captured with per-tool spans.

Debugging checklist

When a production agent misbehaves, work the trace top-down.

1Start at the trace root. Is total latency in the expected range? If yes, the bug is functional, not performance.
2Look at the critical path. Which spans contributed to the bulk of wall-clock time? Are any of them retries?
3For the slowest LLM span, check the model variant and prompt version. Did either change recently?
4For a tool span with a wrong return value, capture both the LLM-generated arguments and the tool output. Was the LLM call reasonable but the tool wrong, or was the LLM hallucinating arguments?
5For a workflow that took the wrong branch, look at the LLM call that made the routing decision. What was the input state? Add this case to your eval dataset.
6For a cost spike, group spans by model and check whether one prompt version triggered a long-context retry.

FAQ

What is agent tracing?

Agent tracing captures every step of a multi-step LLM workflow as a hierarchical span tree. Each LLM call, tool invocation, and sub-agent run becomes a span with parent/child links. Tracing lets you see which step took the most time, which step called which tool, and where the workflow diverged from the expected path.

How is agent tracing different from regular distributed tracing?

Regular distributed tracing tracks HTTP and database spans. Agent tracing adds LLM-specific attributes (model, tokens, cost, prompt version), captures tool-use arguments and results inline, and computes critical path through non-deterministic flows where retries and parallel branches are common. OpenTelemetry semantic conventions for LLMs are still evolving, so agent-tracing tools usually layer their own attributes on top.

How do I trace a LangChain agent?

Three options. Spanlens drop-in: import the @spanlens/sdk LangChain callback handler and pass it to the executor — every chain, tool, and LLM call becomes a span. Proxy: route LangChain LLM calls through the Spanlens proxy URL and the trace is reconstructed server-side from request headers. OpenTelemetry: install the OTel instrumentation package and point the exporter at any OTLP/HTTP endpoint.

What is critical path in agent tracing?

Critical path is the longest dependency chain through a trace — the actual bottleneck, not just the longest single span. For a 12-step agent where steps run in parallel, the critical path is the path that determines total wall-clock time. Optimizing a non-critical-path span has zero effect on total latency. Spanlens highlights the critical path automatically; most other tools require manual analysis.

Can I trace agents built with CrewAI, LangGraph, or AutoGen?

Yes. LangGraph has a native Spanlens integration that captures node executions and state transitions. CrewAI and AutoGen work through the proxy or the SDK callback pattern. Multi-agent frameworks generally produce traces with one root span per agent task and child spans per LLM call.

How much overhead does agent tracing add?

For Spanlens, p99 ingestion overhead is under 3ms because logging happens async in a worker after the LLM response has already been streamed to the client. Tracing does not sit on the critical path. Span emission is fire-and-forget with a fallback queue if the ingest endpoint is briefly unreachable.

Trace your first agent in 60 seconds.

Start free →Agent tracing tutorial