Tutorial: add observability to a RAG chatbot
Forty minutes. We start with a minimal RAG chatbot (Pinecone + OpenAI), add Spanlens, and end with a dashboard that shows each user question as a single trace with retrieval, generation, and per-step cost broken out.
What you will end up with
- Every chat turn logged to /requests with model, tokens, cost, latency.
- Every chat turn shown as one trace in /traces with two spans: retrieval (Pinecone) and generation (OpenAI).
- End-user grouping in /users via
x-spanlens-user. - Conversation grouping via
x-spanlens-session.
Starting point
This is a tiny TypeScript chatbot: an Express route that takes a question, embeds it, fetches relevant docs from Pinecone, and asks GPT-4o-mini for an answer. No observability yet.
// routes/chat.ts (BEFORE)
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'
const openai = new OpenAI()
const pinecone = new Pinecone()
const index = pinecone.index('kb')
export async function chat(req, res) {
const { question, userId, sessionId } = req.body
// 1. embed the question
const embedRes = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: question,
})
const vector = embedRes.data[0].embedding
// 2. retrieve
const matches = await index.query({ vector, topK: 5, includeMetadata: true })
const context = matches.matches.map(m => m.metadata?.text).join('\n\n')
// 3. generate
const completion = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Answer using the context.' },
{ role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
],
})
res.json({ answer: completion.choices[0].message.content })
}tsStep 1. Add Spanlens to the project
pnpm add @spanlens/sdkbashGet a project API key from /projects and add it to your env:
SPANLENS_API_KEY=sl_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxenvIn the same dashboard, click + Add provider key on the project card and paste your real OpenAI key. Then remove OPENAI_API_KEY from your .env file. Your provider key lives server-side from now on.
Step 2. Swap the OpenAI client
One-line change. The rest of the OpenAI calls stay the same.
// BEFORE
import OpenAI from 'openai'
const openai = new OpenAI()
// AFTER
import { createOpenAI } from '@spanlens/sdk/openai'
const openai = createOpenAI() // reads SPANLENS_API_KEY from envtsBoth openai.embeddings.create() and openai.chat.completions.create() now flow through the proxy. Open /requests: you should see two rows per chat turn, one for the embedding and one for the completion, each with cost, tokens, and full body.
Step 3. Group the two calls under one trace
Right now embedding and generation are independent rows. To see them as one user interaction, wrap the route in a trace and add a retrieval span for the Pinecone call. The retrieval span is what tells Spanlens which 120 ms of the 1.4 s came from the vector DB.
import { SpanlensClient, observe } from '@spanlens/sdk'
import { createOpenAI, withUser, withSession } from '@spanlens/sdk/openai'
import { Pinecone } from '@pinecone-database/pinecone'
const client = new SpanlensClient()
const openai = createOpenAI()
const pinecone = new Pinecone()
const index = pinecone.index('kb')
export async function chat(req, res) {
const { question, userId, sessionId } = req.body
const trace = client.startTrace({
name: 'rag-chat-turn',
metadata: { user_id: userId, session_id: sessionId },
})
try {
const headers = {
...withUser(userId).headers,
...withSession(sessionId).headers,
}
// 1. embed (LLM span happens automatically via the proxy)
const embedRes = await openai.embeddings.create(
{ model: 'text-embedding-3-small', input: question },
{ headers },
)
const vector = embedRes.data[0].embedding
// 2. retrieval span (Pinecone is not an LLM, so we wrap it ourselves)
const matches = await observe(
trace,
{ name: 'pinecone.query', spanType: 'retrieval', input: { topK: 5 } },
async () => index.query({ vector, topK: 5, includeMetadata: true }),
)
const context = matches.matches.map(m => m.metadata?.text).join('\n\n')
// 3. generate (LLM span happens automatically via the proxy)
const completion = await openai.chat.completions.create(
{
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: 'Answer using the context.' },
{ role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
],
},
{ headers },
)
res.json({ answer: completion.choices[0].message.content })
} finally {
await trace.end()
}
}tsThree things to notice in this diff:
- The two OpenAI calls did not need
observe()wrapping. The proxy itself emits LLM spans automatically. - The Pinecone call needed
observe()withspanType: 'retrieval'because it is not a Spanlens-aware client. withUserandwithSessionset request headers, which becomeuser_idandsession_idon each Request row.
Step 4. Verify in the dashboard
- Hit the route once with
{ question: 'What is the refund policy?', userId: 'u_123', sessionId: 's_abc' }. - Open /traces. A new trace appears titled
rag-chat-turnwith three child spans: embedding (LLM), pinecone.query (retrieval), and the completion (LLM). - Click the trace. The waterfall shows per-step time. The cost panel sums the two LLM calls.
- Open /users.
u_123shows up with one trace and the rolled-up cost.
Step 5. Add prompt versioning so you can A/B test
The system prompt is the part you will iterate on. Register it as a Spanlens prompt version so future tweaks show up as a comparable A/B in /prompts.
- Open /prompts, create a prompt named
rag-system, paste the system message as version 1. - Reference the version on the completion call with
withPromptVersion:
import { withPromptVersion } from '@spanlens/sdk/openai'
const completion = await openai.chat.completions.create(
{ ... },
{
headers: {
...withUser(userId).headers,
...withSession(sessionId).headers,
...withPromptVersion('rag-system@1').headers,
},
},
)tsNow ship version 2 of the prompt later, change the header to rag-system@2 for half your traffic, and the /prompts A/B view will show whether v2 is statistically better on cost, latency, and (with an evaluator) quality.
What you skipped that you might want later
- Evals. See Nightly evals tutorial to score every chat turn for helpfulness on a 0..1 scale.
- PII redaction. Use
x-spanlens-log-body=metaon requests where the body would carry user PII. Security has the full policy. - LangChain RAG. If you migrate to LangChain RetrievalQA or LangGraph, the callback handler covers all of this with a single line. See LangGraph integration.
Next tutorial: multi-step agent tracing.