Prompts

Store your prompt templates as named, versioned assets. Every time you tweak a prompt, Spanlens creates a new immutable version. Then compare versions side-by-side with real production metrics — average latency, error rate, and cost per call.

Why it matters

Prompts get edited constantly: a line added here, an example rewritten there, a tone shift on Friday afternoon. The unanswered question is always the same — is this actually better, or does it just feel better?

Plain .replace() edits in your codebase give you no answers. Previous versions are lost, you can't roll back, and you never learn which version actually costs less or fails less. Spanlens Prompts fixes that without forcing you to adopt a new runtime or template engine.

How it works

Versioning

Save a prompt under a name (e.g. chatbot-system) in the dashboard. Edit it later → a new version is auto-created with the next number. Old versions stay forever (immutable). No manual version bumps, no schema migrations.

chatbot-system
  ├─ v1  (2 weeks ago)  "You are a helpful assistant..."
  ├─ v2  (1 week ago)   "You are a helpful Korean-speaking assistant..."
  └─ v3  (yesterday)    "You are a Korean assistant. Be concise..."

text

Each version stores:

content — the template body (up to 100K chars)
variables — typed placeholders like {{userName}} with description and required flag
metadata — free-form JSON for tags (team, task type, model target, etc.)
project_id — optional project scope

A/B comparison on real traffic

Click a prompt in /prompts and you'll see a comparison table of every version that has received production traffic in the last 30 days:

Version	Samples	Avg latency	Error %	Avg cost	Total cost
v3	1,245	820ms	0.4%	$0.0012	$1.49
v2	3,102	1.2s	1.1%	$0.0018	$5.58
v1	890	1.4s	2.3%	$0.0023	$2.04

In this example v3 is 32% faster, has 1/5 the error rate, and costs 33% less per call than v2. That's a clear keep-v3, retire-v2 decision with actual numbers behind it.

Using it

Creating a prompt version via dashboard

Go to /prompts and click New prompt / version.
Enter a name (e.g. chatbot-system). Reusing a name → new version.
Paste the content. Save.

Creating via API

curl https://spanlens-server.vercel.app/api/v1/prompts \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "chatbot-system",
    "content": "You are a Korean assistant. Be concise.",
    "metadata": { "team": "growth", "tested": true }
  }'

bash

Response includes the auto-assigned version. See the full endpoint list below.

Fetching the comparison data

GET /api/v1/prompts/:name/compare?sinceHours=720

# returns per-version metrics:
#   { version, sampleCount, avgLatencyMs, errorRate, avgCostUsd, totalCostUsd }

bash

API reference

Method + Path	Description
`GET /api/v1/prompts`	List all prompts (latest version per name)
`GET /api/v1/prompts/:name`	Full version history for a prompt name
`GET /api/v1/prompts/:name/compare`	Per-version metrics for A/B comparison
`GET /api/v1/prompts/:name/:version`	Fetch one specific version
`POST /api/v1/prompts`	Create a new version (auto-increments version number)
`DELETE /api/v1/prompts/:name/:version`	Delete one version

Tagging requests with a prompt version

For the A/B table to fill up, each LLM request needs to declare which version it used. The SDK ships two ways to do that — pick whichever fits your call site.

Option 1 — `withPromptVersion()` per call

import { createOpenAI, withPromptVersion } from '@spanlens/sdk/openai'

const openai = createOpenAI()

const res = await openai.chat.completions.create(
  {
    model: 'gpt-4o-mini',
    messages: [
      { role: 'system', content: promptV3Content },
      { role: 'user', content: userMessage },
    ],
  },
  withPromptVersion('chatbot-system@3'),
)

Same helper exists on @spanlens/sdk/anthropic for Claude calls.

Option 2 — `observeOpenAI()` with promptVersion option

If you're already using agent tracing, just add one option:

import { observeOpenAI } from '@spanlens/sdk'

const res = await observeOpenAI(
  trace,
  { name: 'answer', promptVersion: 'chatbot-system@3' },
  (headers) => openai.chat.completions.create({ /* ... */ }, { headers }),
)

Accepted id formats

Format	Example	Notes
`name@version`	`chatbot-system@3`	Most common; explicit version pin
`name@latest`	`chatbot-system@latest`	Auto-resolves to the highest version server-side on every call
Raw UUID	`ae1c3c1e-99eb-...`	Use the `id` returned from POST `/api/v1/prompts`

Server-side the header value is looked up in prompt_versions scoped to your organization. Invalid / unknown values silently resolve to null (the request still succeeds, it just isn't linked to a version).

Limitations

Honest view of what the feature does not do yet:

No editor affordances. The create/edit form is a plain textarea — no diff view, no syntax highlighting, no variable autocomplete. Good enough for now; polish deferred to post-launch.
Comparison window is fixed at 30 days in the UI. The API accepts a sinceHours query parameter; we just haven't wired a UI picker yet.
No statistical-significance hints. If v1 has 5 samples and v2 has 5,000, both show up the same way in the table. Significance flags are on the roadmap.

Related: Savings (model substitution recommendations), Traces (agent span tree), /prompts dashboard.