Prompt A/B

Split real production traffic across two prompt versions and compare latency, cost, and error rate with statistical tests. Use offline Experiments to validate first, then run A/B on real users to make the final call.

A/B vs Experiments — which one to use

Spanlens has two places where the word "experiment" appears. Here is how they differ:

A/B (this page)Experiments (offline)
Data sourceReal user trafficPre-defined dataset
TimingReal-time (days to weeks)Runs immediately (minutes)
User exposureYes — real users see itNone
MeasurementStatistical significance (p-value)Side-by-side output comparison + scores
Key metricsLatency, cost, error rateResponse quality, score distribution
RiskA bad version reaches real usersNone

Recommended order: validate with Experiments → confirm with A/B in production. The two tools are complementary, not alternatives.

How it works

Creating an A/B experiment tells the server to split incoming requests for a given prompt name between Version A (control) and Version B (challenger) according to the trafficSplit ratio. The split is applied automatically to requests that pass through the Spanlens proxy. Each request result is recorded in the requests table and accumulates until the experiment ends or is manually stopped.

Creating an experiment

POST /api/v1/prompt-experiments
bash

Auth: JWT (Authorization: Bearer $SPANLENS_JWT)

Request parameters

FieldTypeRequiredDescription
promptNamestringYesPrompt name to run the experiment on (e.g. chatbot-system)
versionAIdstring (UUID)YesPrompt version ID for the control arm
versionBIdstring (UUID)YesPrompt version ID for the challenger arm
trafficSplitintegerOptional (default 50)Percentage of traffic to send to Version B (1–99). E.g. 20 means B:20%, A:80%. Default 50 is an even split.
endsAtstring (ISO 8601)OptionalAuto-end date/time. Runs indefinitely until stopped manually if omitted.
projectIdstring (UUID)OptionalScope the experiment to a specific project. Defaults to organization-wide.

Example

curl https://spanlens-server.vercel.app/api/v1/prompt-experiments \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "promptName":   "chatbot-system",
    "versionAId":   "ae1c3c1e-99eb-4f2a-b821-000000000001",
    "versionBId":   "ae1c3c1e-99eb-4f2a-b821-000000000002",
    "trafficSplit": 20,
    "endsAt":       "2026-06-01T00:00:00Z"
  }'
bash

Experiment status

statusMeaning
runningActively splitting traffic and accumulating data
concludedEnded automatically when endsAt was reached or a winner was set
stoppedManually stopped before conclusion — no winner declared

Statistical metrics

Three statistical tests are computed in real time for each experiment:

MetricTestSignificance threshold
LatencyWelch's t-testp-value < 0.05
CostWelch's t-testp-value < 0.05
Error rateFisher's exact testp-value < 0.05

A p-value < 0.05 means the difference is statistically significant. With small sample sizes (tens of requests), p-values cluster near 1 — wait a few days for data to accumulate before drawing conclusions.

Welch's t-test is valid even when the two groups have unequal variances. Fisher's exact test is more appropriate for binary (success/failure) metrics like error rate.

Declaring a winner

When one version is statistically better, declare it the winner. The experiment status changes to concluded and traffic splitting stops.

PATCH /api/v1/prompt-experiments/:id

curl -X PATCH https://spanlens-server.vercel.app/api/v1/prompt-experiments/<experiment-id> \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "winnerVersionId": "ae1c3c1e-99eb-4f2a-b821-000000000002"
  }'
bash

To promote the winning version as the production default, use the Prompts Roll back button or create a new version with the winning content.

Duplicate experiment guard

If a running experiment already exists for the same promptName, creating a new one returns 409 Conflict. Stop the existing experiment first or wait for its endsAt before starting a new one.

API reference

Method + PathDescription
POST /api/v1/prompt-experimentsCreate experiment and start traffic split
GET /api/v1/prompt-experiments?promptName=...List experiments for a prompt name (newest first)
GET /api/v1/prompt-experiments/:idExperiment status, aggregated metrics, and p-values
PATCH /api/v1/prompt-experiments/:idSet winner or manually stop (status: "stopped")

Limitations

  • Two arms only. Only A and B can be compared at once. Run separate experiments to compare more than two versions simultaneously.
  • One running experiment per prompt name. A new experiment cannot be created while one is already running.
  • Statistical significance requires sufficient samples. With only a few dozen calls per day, it may take weeks to reach a meaningful conclusion. Use Experiments for faster offline validation first.
  • Response quality is not measured. Only latency, cost, and error rate are tracked. Pair with Evals for quality scoring.

Related: Prompts (version management), Experiments (offline dataset comparison), Evals (LLM-as-judge quality scoring), /prompts dashboard.