Prompt A/B

Split real production traffic across two prompt versions and compare latency, cost, and error rate with statistical tests. Use offline Experiments to validate first, then run A/B on real users to make the final call.

A/B vs Experiments — which one to use

Spanlens has two places where the word "experiment" appears. Here is how they differ:

	A/B (this page)	Experiments (offline)
Data source	Real user traffic	Pre-defined dataset
Timing	Real-time (days to weeks)	Runs immediately (minutes)
User exposure	Yes — real users see it	None
Measurement	Statistical significance (p-value)	Side-by-side output comparison + scores
Key metrics	Latency, cost, error rate	Response quality, score distribution
Risk	A bad version reaches real users	None

Recommended order: validate with Experiments → confirm with A/B in production. The two tools are complementary, not alternatives.

How it works

Creating an A/B experiment tells the server to split incoming requests for a given prompt name between Version A (control) and Version B (challenger) according to the trafficSplit ratio. The split is applied automatically to requests that pass through the Spanlens proxy. Each request result is recorded in the requests table and accumulates until the experiment ends or is manually stopped.

Creating an experiment

POST /api/v1/prompt-experiments

bash

Auth: JWT (Authorization: Bearer $SPANLENS_JWT)

Request parameters

Field	Type	Required	Description
`promptName`	string	Yes	Prompt name to run the experiment on (e.g. `chatbot-system`)
`versionAId`	string (UUID)	Yes	Prompt version ID for the control arm
`versionBId`	string (UUID)	Yes	Prompt version ID for the challenger arm
`trafficSplit`	integer	Optional (default 50)	Percentage of traffic to send to Version B (1–99). E.g. 20 means B:20%, A:80%. Default 50 is an even split.
`endsAt`	string (ISO 8601)	Optional	Auto-end date/time. Runs indefinitely until stopped manually if omitted.
`projectId`	string (UUID)	Optional	Scope the experiment to a specific project. Defaults to organization-wide.

Example

curl https://spanlens-server.vercel.app/api/v1/prompt-experiments \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "promptName":   "chatbot-system",
    "versionAId":   "ae1c3c1e-99eb-4f2a-b821-000000000001",
    "versionBId":   "ae1c3c1e-99eb-4f2a-b821-000000000002",
    "trafficSplit": 20,
    "endsAt":       "2026-06-01T00:00:00Z"
  }'

bash

Experiment status

status	Meaning
`running`	Actively splitting traffic and accumulating data
`concluded`	Ended automatically when `endsAt` was reached or a winner was set
`stopped`	Manually stopped before conclusion — no winner declared

Statistical metrics

Three statistical tests are computed in real time for each experiment:

Metric	Test	Significance threshold
Latency	Welch's t-test	p-value < 0.05
Cost	Welch's t-test	p-value < 0.05
Error rate	Fisher's exact test	p-value < 0.05

A p-value < 0.05 means the difference is statistically significant. With small sample sizes (tens of requests), p-values cluster near 1 — wait a few days for data to accumulate before drawing conclusions.

Welch's t-test is valid even when the two groups have unequal variances. Fisher's exact test is more appropriate for binary (success/failure) metrics like error rate.

Declaring a winner

When one version is statistically better, declare it the winner. The experiment status changes to concluded and traffic splitting stops.

PATCH /api/v1/prompt-experiments/:id

curl -X PATCH https://spanlens-server.vercel.app/api/v1/prompt-experiments/<experiment-id> \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "winnerVersionId": "ae1c3c1e-99eb-4f2a-b821-000000000002"
  }'

bash

To promote the winning version as the production default, use the Prompts Roll back button or create a new version with the winning content.

Duplicate experiment guard

If a running experiment already exists for the same promptName, creating a new one returns 409 Conflict. Stop the existing experiment first or wait for its endsAt before starting a new one.

API reference

Method + Path	Description
`POST /api/v1/prompt-experiments`	Create experiment and start traffic split
`GET /api/v1/prompt-experiments?promptName=...`	List experiments for a prompt name (newest first)
`GET /api/v1/prompt-experiments/:id`	Experiment status, aggregated metrics, and p-values
`PATCH /api/v1/prompt-experiments/:id`	Set winner or manually stop (`status: "stopped"`)

Limitations

Two arms only. Only A and B can be compared at once. Run separate experiments to compare more than two versions simultaneously.
One running experiment per prompt name. A new experiment cannot be created while one is already running.
Statistical significance requires sufficient samples. With only a few dozen calls per day, it may take weeks to reach a meaningful conclusion. Use Experiments for faster offline validation first.
Response quality is not measured. Only latency, cost, and error rate are tracked. Pair with Evals for quality scoring.

Related: Prompts (version management), Experiments (offline dataset comparison), Evals (LLM-as-judge quality scoring), /prompts dashboard.