LlamaIndex integration

LlamaIndex funnels every framework event through a single BaseCallbackHandler contract with a CBEventType discriminator. SpanlensCallbackHandler subclasses that base, maps each event to a Spanlens span type, and threads parent / child relationships via the per-event UUIDs LlamaIndex hands out, so the trace tree on /traces mirrors your RAG topology exactly: QUERY at the root, RETRIEVE / SYNTHESIZE / LLM / FUNCTION_CALL nested underneath.

Install

pip install "spanlens[llama-index]"
# pulls in llama-index-core>=0.10.0 alongside the SDK

bash

Minimal setup

import os
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from spanlens import SpanlensClient
from spanlens.integrations.llama_index import SpanlensCallbackHandler

client = SpanlensClient(api_key=os.environ["SPANLENS_API_KEY"])
handler = SpanlensCallbackHandler(client=client)

# Register globally — every query engine / agent created after this
# will route callbacks through the handler.
Settings.callback_manager.add_handler(handler)

documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()

response = query_engine.query("What is RAG?")

python

The handler is safe to share across concurrent queries — LlamaIndex tags every event with a unique UUID, so one handler instance per process is fine for parallel work.

What gets captured

CBEventType	Spanlens span	Default capture?
`QUERY`	`llama_index.query`, span_type `custom` at the trace root	yes
`LLM`	`llama_index.llm`, span_type `llm` with token counts + model	yes
`RETRIEVE` / `RERANKING`	`llama_index.retrieve`, span_type `retrieval` with node_count + top scores	yes
`EMBEDDING`	`llama_index.embedding`, span_type `embedding`	yes
`FUNCTION_CALL`	`llama_index.function_call`, span_type `tool`	yes
`AGENT_STEP` / `SUB_QUESTION` / `SYNTHESIZE`	`llama_index.*`, span_type `custom`	yes
`CHUNKING` / `NODE_PARSING` / `TEMPLATING`	not captured (preparation noise)	no

Override the ignore set or input / output truncation limits at construction time:

handler = SpanlensCallbackHandler(
    client=client,
    trace_name="my_rag_pipeline",     # default "llama_index_run"
    event_starts_to_ignore=[],         # capture everything, including chunking
    event_ends_to_ignore=[],
    max_input_bytes=32_768,            # default 16 KB
    max_output_bytes=32_768,
)

python

Trace tree shape

A typical query engine run produces a tree like this — QUERY wraps the whole call, RETRIEVE and the LLM call sit as siblings under it, and any embedding step lives as a child of the retrieval:

Trace: my_rag_pipeline  (1.8s)
└── llama_index.query                  (1.8s)
    ├── llama_index.retrieve           (320ms, 12 nodes, top_score=0.92)
    │   └── llama_index.embedding      (80ms, count=1)
    │
    └── llama_index.llm                (1.4s, gpt-4o-mini, 120/45 tokens, $0.0008)

text

Attaching to a long-lived trace

By default the handler opens a fresh trace on each top-level query and closes it when the run ends. To group multiple queries (every turn of a chat session, every step of a long agent loop) under one trace, pass an existing trace at construction — the handler will then leave its lifecycle entirely to the caller:

trace = client.start_trace(
    "chat-session",
    metadata={"user_id": user.id, "session_id": session_id},
)

handler = SpanlensCallbackHandler(client=client, trace=trace)
Settings.callback_manager.add_handler(handler)

for user_message in conversation:
    query_engine.query(user_message)

trace.end(status="completed")   # caller owns lifecycle when trace is passed in

python

Pairing with the proxy for accurate cost

The callback handler captures span structure and reads token counts from response.raw.usage on the OpenAI-compatible LLM backends LlamaIndex ships with. For models where usage is missing or unreliable on streaming, route the underlying LLM through the Spanlens proxy and the linked Request will always carry the authoritative cost:

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    api_base="https://api.spanlens.io/proxy/openai/v1",
    api_key=os.environ["SPANLENS_API_KEY"],
)

Settings.llm = llm

python

Now every LLM call lands as a Request in ClickHouse with the canonical cost, and the matching llama_index.llm span links to it via request_id.

Linking spans to prompt versions

To tag an LLM call inside the pipeline with a Spanlens prompt version, set the x-spanlens-prompt-version header on the underlying LLM client. With the proxy approach above, attach it as a default header:

from llama_index.llms.openai import OpenAI

llm = OpenAI(
    model="gpt-4o-mini",
    api_base="https://api.spanlens.io/proxy/openai/v1",
    api_key=os.environ["SPANLENS_API_KEY"],
    default_headers={"x-spanlens-prompt-version": "rag-system@7"},
)

python

The Request row now carries prompt_version_id, so the Prompt A/B view can compare versions on real query traffic.

Verifying the integration

Run one query through your engine.
Open /traces. A new trace appears with the configured trace_name (default llama_index_run).
Click into the trace. The waterfall mirrors the pipeline: query at the top, retrieve and llm children with their real start / end times.
On the llm row, the right panel shows prompt / completion token counts and computed cost. If request_id is set (proxy mode), the row links straight to the matching Request in /requests.

Troubleshooting

No spans show up

Confirm the handler is registered on Settings.callback_manager before you build the query engine or agent — LlamaIndex captures the callback list at construction time. If you build the engine first and add the handler later, that engine instance will not see it.

LLM spans missing token usage

Some LlamaIndex LLM backends omit usage on streaming responses or wrap it in a shape the handler can't introspect. The fix is to route that LLM through the Spanlens proxy (see Pairing with the proxy above); the proxy parses tokens from the raw stream and the linked Request always has them.

Chunking and templating events are too noisy

They are filtered out by default. If you turned them back on with event_starts_to_ignore=[] and want to silence them again, pass the defaults explicitly:

handler = SpanlensCallbackHandler(
    client=client,
    event_starts_to_ignore=["chunking", "node_parsing", "templating"],
    event_ends_to_ignore=["chunking", "node_parsing", "templating"],
)

python

Trace closes too early on background work

If your pipeline kicks off fire-and-forget work after the root query returns, the auto-managed trace will close before that work logs. Pass an external trace via the trace= argument and call trace.end() yourself when all work is done.

Next: RAG chatbot tutorial for a runnable example, or data model for what ends up in the database.