Prompts
Store your prompt templates as named, versioned assets. Every time you tweak a prompt, Spanlens creates a new immutable version. Then compare versions side-by-side with real production metrics — average latency, error rate, and cost per call.
Why it matters
Prompts get edited constantly: a line added here, an example rewritten there, a tone shift on Friday afternoon. The unanswered question is always the same — is this actually better, or does it just feel better?
Plain .replace() edits in your codebase give you no answers. Previous versions are lost, you can't roll back, and you never learn which version actually costs less or fails less. Spanlens Prompts fixes that without forcing you to adopt a new runtime or template engine.
How it works
Versioning
Save a prompt under a name (e.g. chatbot-system) in the dashboard. Edit it later → a new version is auto-created with the next number. Old versions stay forever (immutable). No manual version bumps, no schema migrations.
chatbot-system
├─ v1 (2 weeks ago) "You are a helpful assistant..."
├─ v2 (1 week ago) "You are a helpful Korean-speaking assistant..."
└─ v3 (yesterday) "You are a Korean assistant. Be concise..."textEach version stores:
content— the template body (up to 100K chars)variables— typed placeholders like{{userName}}with description andrequiredflagmetadata— free-form JSON for tags (team, task type, model target, etc.)project_id— optional project scope
A/B comparison on real traffic
Click a prompt in /prompts and you'll see a comparison table of every version that has received production traffic in the last 30 days:
| Version | Samples | Avg latency | Error % | Avg cost | Total cost |
|---|---|---|---|---|---|
| v3 | 1,245 | 820ms | 0.4% | $0.0012 | $1.49 |
| v2 | 3,102 | 1.2s | 1.1% | $0.0018 | $5.58 |
| v1 | 890 | 1.4s | 2.3% | $0.0023 | $2.04 |
In this example v3 is 32% faster, has 1/5 the error rate, and costs 33% less per call than v2. That's a clear keep-v3, retire-v2 decision with actual numbers behind it.
Using it
Creating a prompt version via dashboard
- Go to /prompts and click New prompt / version.
- Enter a name (e.g.
chatbot-system). Reusing a name → new version. - Paste the content. Save.
Creating via API
curl https://spanlens-server.vercel.app/api/v1/prompts \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"name": "chatbot-system",
"content": "You are a Korean assistant. Be concise.",
"metadata": { "team": "growth", "tested": true }
}'bashResponse includes the auto-assigned version. See the full endpoint list below.
Fetching the comparison data
GET /api/v1/prompts/:name/compare?sinceHours=720
# returns per-version metrics:
# { version, sampleCount, avgLatencyMs, errorRate, avgCostUsd, totalCostUsd }bashAPI reference
| Method + Path | Description |
|---|---|
GET /api/v1/prompts | List all prompts (latest version per name) |
GET /api/v1/prompts/:name | Full version history for a prompt name |
GET /api/v1/prompts/:name/compare | Per-version metrics for A/B comparison |
GET /api/v1/prompts/:name/:version | Fetch one specific version |
POST /api/v1/prompts | Create a new version (auto-increments version number) |
DELETE /api/v1/prompts/:name/:version | Delete one version |
Tagging requests with a prompt version
For the A/B table to fill up, each LLM request needs to declare which version it used. The SDK ships two ways to do that — pick whichever fits your call site.
Option 1 — withPromptVersion() per call
import { createOpenAI, withPromptVersion } from '@spanlens/sdk/openai'
const openai = createOpenAI()
const res = await openai.chat.completions.create(
{
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: promptV3Content },
{ role: 'user', content: userMessage },
],
},
withPromptVersion('chatbot-system@3'),
)tsSame helper exists on @spanlens/sdk/anthropic for Claude calls.
Option 2 — observeOpenAI() with promptVersion option
If you're already using agent tracing, just add one option:
import { observeOpenAI } from '@spanlens/sdk'
const res = await observeOpenAI(
trace,
{ name: 'answer', promptVersion: 'chatbot-system@3' },
(headers) => openai.chat.completions.create({ /* ... */ }, { headers }),
)tsAccepted id formats
| Format | Example | Notes |
|---|---|---|
name@version | chatbot-system@3 | Most common; explicit version pin |
name@latest | chatbot-system@latest | Auto-resolves to the highest version server-side on every call |
| Raw UUID | ae1c3c1e-99eb-... | Use the id returned from POST /api/v1/prompts |
Server-side the header value is looked up in prompt_versions scoped to your organization. Invalid / unknown values silently resolve to null (the request still succeeds, it just isn't linked to a version).
Limitations
Honest view of what the feature does not do yet:
- No editor affordances. The create/edit form is a plain textarea — no diff view, no syntax highlighting, no variable autocomplete. Good enough for now; polish deferred to post-launch.
- Comparison window is fixed at 30 days in the UI. The API accepts a
sinceHoursquery parameter; we just haven't wired a UI picker yet. - No statistical-significance hints. If v1 has 5 samples and v2 has 5,000, both show up the same way in the table. Significance flags are on the roadmap.
Related: Savings (model substitution recommendations), Traces (agent span tree), /prompts dashboard.