Prompt management

Prompt management treats prompts as deployable artifacts with versions, rollout, and rollback — not strings hardcoded in your application. Spanlens provides versioning, side-by-side diff, Prompt A/B with statistical significance, gradual rollout, and one-click rollback without redeploying your code.

Prompt as a versioned entity

A prompt in Spanlens is a named template with an ordered list of prompt_versions. Each version has a UUID, a creation timestamp, a draft/published flag, and the full template body. Templates can include variables ({{customer_name}}) that are filled at request time.

import { SpanlensClient } from '@spanlens/sdk'

const client = new SpanlensClient()

// Resolve the latest published version of a prompt by name
const prompt = await client.prompts.resolve('classify_intent', { tag: 'production' })
const messages = prompt.render({ customer_name: 'Alex' })

// Then send to OpenAI/Anthropic as usual
const response = await openai.chat.completions.create({
  model: 'gpt-4o-mini',
  messages,
})
ts

The resolved version ID flows into the response captured by Spanlens via the X-Spanlens-Prompt-Version header, so every request lands in the dashboard already tagged with the prompt version that produced it.

Prompt A/B with Welch t-test

Spanlens runs side-by-side prompt versions on a configurable traffic split (default 50/50) and reports statistical significance on the three metrics that matter:

MetricTestWhy this test
Latency (ms)Welch's t-testUnequal variances expected (one version may be much slower).
Cost (USD)Welch's t-testSame as latency — unequal variance is the norm.
Error rate (4xx/5xx + parse failures)Two-proportion z-testErrors are Bernoulli, t-test would be incorrect here.
Quality (eval score)Welch's t-testContinuous score on [0, 1].

The dashboard surfaces p-values and effect sizes. A green "significant improvement" verdict requires p < 0.05 with a directional improvement on the primary metric you picked at experiment setup.

Gradual rollout

Once an A/B reaches significance, promote the winner via gradual rollout. Spanlens routes 10% → 50% → 100% of traffic with configurable bake periods, and automatic rollback if the error rate or eval score collapses during a bake. Rollback is a single API call (or button click) and does not require a redeploy.

await client.prompts.promote('classify_intent', {
  to_version: 'pv_xxx',
  rollout: [
    { percent: 10, bake_minutes: 30 },
    { percent: 50, bake_minutes: 60 },
    { percent: 100 },
  ],
  auto_rollback_on: {
    error_rate_increase: 0.02, // rollback if errors rise more than 2 pp
    eval_score_drop: 0.05,
  },
})
ts

Diff and history

Each version stores its full body, so the diff view shows additions and deletions inline. The version history is append-only — published versions cannot be edited, only superseded by a new version. Rollback is achieved by promoting an older version, not by editing.

Where to go next