LlamaIndex integration
LlamaIndex funnels every framework event through a single BaseCallbackHandler contract with a CBEventType discriminator. SpanlensCallbackHandler subclasses that base, maps each event to a Spanlens span type, and threads parent / child relationships via the per-event UUIDs LlamaIndex hands out, so the trace tree on /traces mirrors your RAG topology exactly: QUERY at the root, RETRIEVE / SYNTHESIZE / LLM / FUNCTION_CALL nested underneath.
Install
pip install "spanlens[llama-index]"
# pulls in llama-index-core>=0.10.0 alongside the SDKbashMinimal setup
import os
from llama_index.core import Settings, VectorStoreIndex, SimpleDirectoryReader
from spanlens import SpanlensClient
from spanlens.integrations.llama_index import SpanlensCallbackHandler
client = SpanlensClient(api_key=os.environ["SPANLENS_API_KEY"])
handler = SpanlensCallbackHandler(client=client)
# Register globally — every query engine / agent created after this
# will route callbacks through the handler.
Settings.callback_manager.add_handler(handler)
documents = SimpleDirectoryReader("./data").load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine()
response = query_engine.query("What is RAG?")pythonThe handler is safe to share across concurrent queries — LlamaIndex tags every event with a unique UUID, so one handler instance per process is fine for parallel work.
What gets captured
| CBEventType | Spanlens span | Default capture? |
|---|---|---|
QUERY | llama_index.query, span_type custom at the trace root | yes |
LLM | llama_index.llm, span_type llm with token counts + model | yes |
RETRIEVE / RERANKING | llama_index.retrieve, span_type retrieval with node_count + top scores | yes |
EMBEDDING | llama_index.embedding, span_type embedding | yes |
FUNCTION_CALL | llama_index.function_call, span_type tool | yes |
AGENT_STEP / SUB_QUESTION / SYNTHESIZE | llama_index.*, span_type custom | yes |
CHUNKING / NODE_PARSING / TEMPLATING | not captured (preparation noise) | no |
Override the ignore set or input / output truncation limits at construction time:
handler = SpanlensCallbackHandler(
client=client,
trace_name="my_rag_pipeline", # default "llama_index_run"
event_starts_to_ignore=[], # capture everything, including chunking
event_ends_to_ignore=[],
max_input_bytes=32_768, # default 16 KB
max_output_bytes=32_768,
)pythonTrace tree shape
A typical query engine run produces a tree like this — QUERY wraps the whole call, RETRIEVE and the LLM call sit as siblings under it, and any embedding step lives as a child of the retrieval:
Trace: my_rag_pipeline (1.8s)
└── llama_index.query (1.8s)
├── llama_index.retrieve (320ms, 12 nodes, top_score=0.92)
│ └── llama_index.embedding (80ms, count=1)
│
└── llama_index.llm (1.4s, gpt-4o-mini, 120/45 tokens, $0.0008)textAttaching to a long-lived trace
By default the handler opens a fresh trace on each top-level query and closes it when the run ends. To group multiple queries (every turn of a chat session, every step of a long agent loop) under one trace, pass an existing trace at construction — the handler will then leave its lifecycle entirely to the caller:
trace = client.start_trace(
"chat-session",
metadata={"user_id": user.id, "session_id": session_id},
)
handler = SpanlensCallbackHandler(client=client, trace=trace)
Settings.callback_manager.add_handler(handler)
for user_message in conversation:
query_engine.query(user_message)
trace.end(status="completed") # caller owns lifecycle when trace is passed inpythonPairing with the proxy for accurate cost
The callback handler captures span structure and reads token counts from response.raw.usage on the OpenAI-compatible LLM backends LlamaIndex ships with. For models where usage is missing or unreliable on streaming, route the underlying LLM through the Spanlens proxy and the linked Request will always carry the authoritative cost:
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
api_base="https://server.spanlens.io/proxy/openai/v1",
api_key=os.environ["SPANLENS_API_KEY"],
)
Settings.llm = llmpythonNow every LLM call lands as a Request in ClickHouse with the canonical cost, and the matching llama_index.llm span links to it via request_id.
Linking spans to prompt versions
To tag an LLM call inside the pipeline with a Spanlens prompt version, set the x-spanlens-prompt-version header on the underlying LLM client. With the proxy approach above, attach it as a default header:
from llama_index.llms.openai import OpenAI
llm = OpenAI(
model="gpt-4o-mini",
api_base="https://server.spanlens.io/proxy/openai/v1",
api_key=os.environ["SPANLENS_API_KEY"],
default_headers={"x-spanlens-prompt-version": "rag-system@7"},
)pythonThe Request row now carries prompt_version_id, so the Prompt A/B view can compare versions on real query traffic.
Verifying the integration
- Run one query through your engine.
- Open /traces. A new trace appears with the configured
trace_name(defaultllama_index_run). - Click into the trace. The waterfall mirrors the pipeline:
queryat the top,retrieveandllmchildren with their real start / end times. - On the
llmrow, the right panel shows prompt / completion token counts and computed cost. Ifrequest_idis set (proxy mode), the row links straight to the matching Request in /requests.
Troubleshooting
No spans show up
Confirm the handler is registered on Settings.callback_manager before you build the query engine or agent — LlamaIndex captures the callback list at construction time. If you build the engine first and add the handler later, that engine instance will not see it.
LLM spans missing token usage
Some LlamaIndex LLM backends omit usage on streaming responses or wrap it in a shape the handler can't introspect. The fix is to route that LLM through the Spanlens proxy (see Pairing with the proxy above); the proxy parses tokens from the raw stream and the linked Request always has them.
Chunking and templating events are too noisy
They are filtered out by default. If you turned them back on with event_starts_to_ignore=[] and want to silence them again, pass the defaults explicitly:
handler = SpanlensCallbackHandler(
client=client,
event_starts_to_ignore=["chunking", "node_parsing", "templating"],
event_ends_to_ignore=["chunking", "node_parsing", "templating"],
)pythonTrace closes too early on background work
If your pipeline kicks off fire-and-forget work after the root query returns, the auto-managed trace will close before that work logs. Pass an external trace via the trace= argument and call trace.end() yourself when all work is done.
Next: RAG chatbot tutorial for a runnable example, or data model for what ends up in the database.