Prompt management
Prompt management treats prompts as deployable artifacts with versions, rollout, and rollback — not strings hardcoded in your application. Spanlens provides versioning, side-by-side diff, Prompt A/B with statistical significance, gradual rollout, and one-click rollback without redeploying your code.
Prompt as a versioned entity
A prompt in Spanlens is a named template with an ordered list of prompt_versions. Each version has a UUID, a creation timestamp, a draft/published flag, and the full template body. Templates can include variables ({{customer_name}}) that are filled at request time.
import { SpanlensClient } from '@spanlens/sdk'
const client = new SpanlensClient()
// Resolve the latest published version of a prompt by name
const prompt = await client.prompts.resolve('classify_intent', { tag: 'production' })
const messages = prompt.render({ customer_name: 'Alex' })
// Then send to OpenAI/Anthropic as usual
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages,
})tsThe resolved version ID flows into the response captured by Spanlens via the X-Spanlens-Prompt-Version header, so every request lands in the dashboard already tagged with the prompt version that produced it.
Prompt A/B with Welch t-test
Spanlens runs side-by-side prompt versions on a configurable traffic split (default 50/50) and reports statistical significance on the three metrics that matter:
| Metric | Test | Why this test |
|---|---|---|
| Latency (ms) | Welch's t-test | Unequal variances expected (one version may be much slower). |
| Cost (USD) | Welch's t-test | Same as latency — unequal variance is the norm. |
| Error rate (4xx/5xx + parse failures) | Two-proportion z-test | Errors are Bernoulli, t-test would be incorrect here. |
| Quality (eval score) | Welch's t-test | Continuous score on [0, 1]. |
The dashboard surfaces p-values and effect sizes. A green "significant improvement" verdict requires p < 0.05 with a directional improvement on the primary metric you picked at experiment setup.
Gradual rollout
Once an A/B reaches significance, promote the winner via gradual rollout. Spanlens routes 10% → 50% → 100% of traffic with configurable bake periods, and automatic rollback if the error rate or eval score collapses during a bake. Rollback is a single API call (or button click) and does not require a redeploy.
await client.prompts.promote('classify_intent', {
to_version: 'pv_xxx',
rollout: [
{ percent: 10, bake_minutes: 30 },
{ percent: 50, bake_minutes: 60 },
{ percent: 100 },
],
auto_rollback_on: {
error_rate_increase: 0.02, // rollback if errors rise more than 2 pp
eval_score_drop: 0.05,
},
})tsDiff and history
Each version stores its full body, so the diff view shows additions and deletions inline. The version history is append-only — published versions cannot be edited, only superseded by a new version. Rollback is achieved by promoting an older version, not by editing.
Where to go next
- Prompts feature page, dashboard surface.
- Prompt A/B, experiment setup.
- Playground, test prompts across models and inputs.
- Evals, how scores feed into A/B decisions.