Tutorial: add observability to a RAG chatbot

Forty minutes. We start with a minimal RAG chatbot (Pinecone + OpenAI), add Spanlens, and end with a dashboard that shows each user question as a single trace with retrieval, generation, and per-step cost broken out.

What you will end up with

Every chat turn logged to /requests with model, tokens, cost, latency.
Every chat turn shown as one trace in /traces with two spans: retrieval (Pinecone) and generation (OpenAI).
End-user grouping in /users via x-spanlens-user.
Conversation grouping via x-spanlens-session.

Starting point

This is a tiny TypeScript chatbot: an Express route that takes a question, embeds it, fetches relevant docs from Pinecone, and asks GPT-4o-mini for an answer. No observability yet.

// routes/chat.ts (BEFORE)
import OpenAI from 'openai'
import { Pinecone } from '@pinecone-database/pinecone'

const openai = new OpenAI()
const pinecone = new Pinecone()
const index = pinecone.index('kb')

export async function chat(req, res) {
  const { question, userId, sessionId } = req.body

  // 1. embed the question
  const embedRes = await openai.embeddings.create({
    model: 'text-embedding-3-small',
    input: question,
  })
  const vector = embedRes.data[0].embedding

  // 2. retrieve
  const matches = await index.query({ vector, topK: 5, includeMetadata: true })
  const context = matches.matches.map(m => m.metadata?.text).join('\n\n')

  // 3. generate
  const completion = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: 'Answer using the context.' },
      { role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
    ],
  })

  res.json({ answer: completion.choices[0].message.content })
}

Step 1. Add Spanlens to the project

pnpm add @spanlens/sdk

bash

Get a project API key from /projects and add it to your env:

SPANLENS_API_KEY=sl_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

env

In the same dashboard, click + Add provider key on the project card and paste your real OpenAI key. Then remove OPENAI_API_KEY from your .env file. Your provider key lives server-side from now on.

Step 2. Swap the OpenAI client

One-line change. The rest of the OpenAI calls stay the same.

// BEFORE
import OpenAI from 'openai'
const openai = new OpenAI()

// AFTER
import { createOpenAI } from '@spanlens/sdk/openai'
const openai = createOpenAI()  // reads SPANLENS_API_KEY from env

Both openai.embeddings.create() and openai.chat.completions.create() now flow through the proxy. Open /requests: you should see two rows per chat turn, one for the embedding and one for the completion, each with cost, tokens, and full body.

Step 3. Group the two calls under one trace

Right now embedding and generation are independent rows. To see them as one user interaction, wrap the route in a trace and add a retrieval span for the Pinecone call. The retrieval span is what tells Spanlens which 120 ms of the 1.4 s came from the vector DB.

import { SpanlensClient, observe } from '@spanlens/sdk'
import { createOpenAI, withUser, withSession } from '@spanlens/sdk/openai'
import { Pinecone } from '@pinecone-database/pinecone'

const client = new SpanlensClient()
const openai = createOpenAI()
const pinecone = new Pinecone()
const index = pinecone.index('kb')

export async function chat(req, res) {
  const { question, userId, sessionId } = req.body

  const trace = client.startTrace({
    name: 'rag-chat-turn',
    metadata: { user_id: userId, session_id: sessionId },
  })

  try {
    const headers = {
      ...withUser(userId).headers,
      ...withSession(sessionId).headers,
    }

    // 1. embed (LLM span happens automatically via the proxy)
    const embedRes = await openai.embeddings.create(
      { model: 'text-embedding-3-small', input: question },
      { headers },
    )
    const vector = embedRes.data[0].embedding

    // 2. retrieval span (Pinecone is not an LLM, so we wrap it ourselves)
    const matches = await observe(
      trace,
      { name: 'pinecone.query', spanType: 'retrieval', input: { topK: 5 } },
      async () => index.query({ vector, topK: 5, includeMetadata: true }),
    )
    const context = matches.matches.map(m => m.metadata?.text).join('\n\n')

    // 3. generate (LLM span happens automatically via the proxy)
    const completion = await openai.chat.completions.create(
      {
        model: 'gpt-4o-mini',
        messages: [
          { role: 'system', content: 'Answer using the context.' },
          { role: 'user', content: `Context:\n${context}\n\nQuestion: ${question}` },
        ],
      },
      { headers },
    )

    res.json({ answer: completion.choices[0].message.content })
  } finally {
    await trace.end()
  }
}

Three things to notice in this diff:

The two OpenAI calls did not need observe() wrapping. The proxy itself emits LLM spans automatically.
The Pinecone call needed observe() with spanType: 'retrieval' because it is not a Spanlens-aware client.
withUser and withSession set request headers, which become user_id and session_id on each Request row.

Step 4. Verify in the dashboard

Hit the route once with { question: 'What is the refund policy?', userId: 'u_123', sessionId: 's_abc' }.
Open /traces. A new trace appears titled rag-chat-turn with three child spans: embedding (LLM), pinecone.query (retrieval), and the completion (LLM).
Click the trace. The waterfall shows per-step time. The cost panel sums the two LLM calls.
Open /users. u_123 shows up with one trace and the rolled-up cost.

Step 5. Add prompt versioning so you can A/B test

The system prompt is the part you will iterate on. Register it as a Spanlens prompt version so future tweaks show up as a comparable A/B in /prompts.

Open /prompts, create a prompt named rag-system, paste the system message as version 1.
Reference the version on the completion call with withPromptVersion:

import { withPromptVersion } from '@spanlens/sdk/openai'

const completion = await openai.chat.completions.create(
  { ... },
  {
    headers: {
      ...withUser(userId).headers,
      ...withSession(sessionId).headers,
      ...withPromptVersion('rag-system@1').headers,
    },
  },
)

Now ship version 2 of the prompt later, change the header to rag-system@2 for half your traffic, and the /prompts A/B view will show whether v2 is statistically better on cost, latency, and (with an evaluator) quality.

What you skipped that you might want later

Evals. See Nightly evals tutorial to score every chat turn for helpfulness on a 0..1 scale.
PII redaction. Use x-spanlens-log-body=meta on requests where the body would carry user PII. Security has the full policy.
LangChain RAG. If you migrate to LangChain RetrievalQA or LangGraph, the callback handler covers all of this with a single line. See LangGraph integration.

Next tutorial: multi-step agent tracing.