Prompts
Store your prompt templates as named, versioned assets. Every time you tweak a prompt, Spanlens creates a new immutable version. Then compare versions side-by-side with real production metrics, average latency, error rate, and cost per call.
Why it matters
Prompts get edited constantly: a line added here, an example rewritten there, a tone shift on Friday afternoon. The unanswered question is always the same, is this actually better, or does it just feel better?
Plain .replace()edits in your codebase give you no answers. Previous versions are lost, you can't roll back, and you never learn which version actually costs less or fails less. Spanlens Prompts fixes that without forcing you to adopt a new runtime or template engine.
How it works
Versioning
Save a prompt under a name (e.g. chatbot-system) in the dashboard. Edit it later → a new version is auto-created with the next number. Old versions stay forever (immutable). No manual version bumps, no schema migrations.
chatbot-system
├─ v1 (2 weeks ago) "You are a helpful assistant..."
├─ v2 (1 week ago) "You are a helpful Korean-speaking assistant..."
└─ v3 (yesterday) "You are a Korean assistant. Be concise..."textEach version stores:
content, the template body (up to 100K chars)variables, typed placeholders like{{userName}}with description andrequiredflagmetadata, free-form JSON for tags (team, task type, model target, etc.)project_id, optional project scope
A/B comparison on real traffic
Click a prompt in /promptsand you'll see a comparison table of every version that has received production traffic in the last 30 days:
| Version | Samples | Avg latency | Error % | Avg cost | Total cost |
|---|---|---|---|---|---|
| v3 | 1,245 | 820ms | 0.4% | $0.0012 | $1.49 |
| v2 | 3,102 | 1.2s | 1.1% | $0.0018 | $5.58 |
| v1 | 890 | 1.4s | 2.3% | $0.0023 | $2.04 |
In this example v3 is 32% faster, has 1/5 the error rate, and costs 33% less per call than v2. That's a clear keep-v3, retire-v2 decision with actual numbers behind it.
Using it
Creating a prompt version via dashboard
- Go to /prompts and click New prompt / version.
- Enter a name (e.g.
chatbot-system). Reusing a name → new version. - Paste the content. Save.
Creating via API
curl https://server.spanlens.io/api/v1/prompts \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"name": "chatbot-system",
"content": "You are a Korean assistant. Be concise.",
"metadata": { "team": "growth", "tested": true }
}'bashResponse includes the auto-assigned version. See the full endpoint list below.
Fetching the comparison data
GET /api/v1/prompts/:name/compare?sinceHours=720
# returns per-version metrics:
# { version, sampleCount, avgLatencyMs, errorRate, avgCostUsd, totalCostUsd }bashAPI reference
| Method + Path | Description |
|---|---|
GET /api/v1/prompts | List all prompts (latest version per name) |
GET /api/v1/prompts/:name | Full version history for a prompt name |
GET /api/v1/prompts/:name/compare | Per-version metrics for A/B comparison |
GET /api/v1/prompts/:name/:version | Fetch one specific version |
POST /api/v1/prompts | Create a new version (auto-increments version number) |
POST /api/v1/prompts/:name/:version/rollback | Copy an older version's content as a new (latest) version. The old version is not modified, the version counter always increases. Returns the newly created version. |
DELETE /api/v1/prompts/:name/:version | Delete one version |
Tagging requests with a prompt version
For the A/B table to fill up, each LLM request needs to declare which version it used. The SDK ships two ways to do that, pick whichever fits your call site.
Option 1 — withPromptVersion() per call
import { createOpenAI, withPromptVersion } from '@spanlens/sdk/openai'
const openai = createOpenAI()
const res = await openai.chat.completions.create(
{
model: 'gpt-4o-mini',
messages: [
{ role: 'system', content: promptV3Content },
{ role: 'user', content: userMessage },
],
},
withPromptVersion('chatbot-system@3'),
)tsSame helper exists on @spanlens/sdk/anthropic for Claude calls.
Option 2 — observeOpenAI() with promptVersion option
If you're already using agent tracing, just add one option:
import { observeOpenAI } from '@spanlens/sdk'
const res = await observeOpenAI(
trace,
{ name: 'answer', promptVersion: 'chatbot-system@3' },
(headers) => openai.chat.completions.create({ /* ... */ }, { headers }),
)tsAccepted id formats
| Format | Example | Notes |
|---|---|---|
name@version | chatbot-system@3 | Most common; explicit version pin |
name@latest | chatbot-system@latest | Auto-resolves to the highest version server-side on every call |
| Raw UUID | ae1c3c1e-99eb-... | Use the id returned from POST /api/v1/prompts |
Server-side the header value is looked up in prompt_versionsscoped to your organization. Invalid / unknown values silently resolve to null (the request still succeeds, it just isn't linked to a version).
Sub-tabs inside the Prompts page
Clicking a prompt in the dashboard reveals six sub-tabs:
- Versions, full version list with expandable content preview. Each version has a Roll back button that copies its content as a new latest version (the old version is never deleted, the version counter always increases).
- Diff, select any two versions for an LCS-based line-level diff (+/− colors)
- Traffic, per-version traffic share with quality color coding (≥90 green / 70–89 yellow / <70 red)
- Calls, per-version call count, latency, error rate, QUALITY, cost, and token totals. Clicking a row drills down to
/requests?promptVersionId=.... The Quality column shows the averageeval_resultsscore from Evals. - A/B, live production traffic A/B routing. Different from offline Experiments (see table below). → Full Prompt A/B docs
- Playground, select a version, configure provider key, model, temperature, and variables, then run immediately. Results are not saved to the
requeststable. Rate limit: 20 req/min/user. → Full Playground docs
A/B routing vs Experiments
The word "experiment" appears in two places in Spanlens, here is how they differ:
| A/B (this tab) | Experiments | |
|---|---|---|
| Data | Live production traffic | Offline dataset |
| Timing | Real-time, real users exposed | Runs immediately, no user exposure |
| Measurement | Statistical significance (Welch's t-test) | Direct output comparison + scores |
| Risk | A bad version reaches real users | None |
Complementary: pre-validate with Experiments → confirm in production with A/B.
Limitations
Honest view of what the feature does not do yet:
- No editor affordances. The create/edit form is a plain textarea , no diff view, no syntax highlighting, no variable autocomplete. Good enough for now; polish deferred to post-launch.
- Comparison window is fixed at 30 days in the UI. The API accepts a
sinceHoursquery parameter; we just haven't wired a UI picker yet. - No statistical-significance hints. If v1 has 5 samples and v2 has 5,000, both show up the same way in the table. Significance flags are on the roadmap.
Related: Evals (response quality scoring), Experiments (offline comparison), Savings (model substitution), Traces, /prompts dashboard.