Prompt A/B
Split real production traffic across two prompt versions and compare latency, cost, and error rate with statistical tests. Use offline Experiments to validate first, then run A/B on real users to make the final call.
A/B vs Experiments — which one to use
Spanlens has two places where the word "experiment" appears. Here is how they differ:
| A/B (this page) | Experiments (offline) | |
|---|---|---|
| Data source | Real user traffic | Pre-defined dataset |
| Timing | Real-time (days to weeks) | Runs immediately (minutes) |
| User exposure | Yes — real users see it | None |
| Measurement | Statistical significance (p-value) | Side-by-side output comparison + scores |
| Key metrics | Latency, cost, error rate | Response quality, score distribution |
| Risk | A bad version reaches real users | None |
Recommended order: validate with Experiments → confirm with A/B in production. The two tools are complementary, not alternatives.
How it works
Creating an A/B experiment tells the server to split incoming requests for a given prompt name between Version A (control) and Version B (challenger) according to the trafficSplit ratio. The split is applied automatically to requests that pass through the Spanlens proxy. Each request result is recorded in the requests table and accumulates until the experiment ends or is manually stopped.
Creating an experiment
POST /api/v1/prompt-experimentsbashAuth: JWT (Authorization: Bearer $SPANLENS_JWT)
Request parameters
| Field | Type | Required | Description |
|---|---|---|---|
promptName | string | Yes | Prompt name to run the experiment on (e.g. chatbot-system) |
versionAId | string (UUID) | Yes | Prompt version ID for the control arm |
versionBId | string (UUID) | Yes | Prompt version ID for the challenger arm |
trafficSplit | integer | Optional (default 50) | Percentage of traffic to send to Version B (1–99). E.g. 20 means B:20%, A:80%. Default 50 is an even split. |
endsAt | string (ISO 8601) | Optional | Auto-end date/time. Runs indefinitely until stopped manually if omitted. |
projectId | string (UUID) | Optional | Scope the experiment to a specific project. Defaults to organization-wide. |
Example
curl https://spanlens-server.vercel.app/api/v1/prompt-experiments \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"promptName": "chatbot-system",
"versionAId": "ae1c3c1e-99eb-4f2a-b821-000000000001",
"versionBId": "ae1c3c1e-99eb-4f2a-b821-000000000002",
"trafficSplit": 20,
"endsAt": "2026-06-01T00:00:00Z"
}'bashExperiment status
| status | Meaning |
|---|---|
running | Actively splitting traffic and accumulating data |
concluded | Ended automatically when endsAt was reached or a winner was set |
stopped | Manually stopped before conclusion — no winner declared |
Statistical metrics
Three statistical tests are computed in real time for each experiment:
| Metric | Test | Significance threshold |
|---|---|---|
| Latency | Welch's t-test | p-value < 0.05 |
| Cost | Welch's t-test | p-value < 0.05 |
| Error rate | Fisher's exact test | p-value < 0.05 |
A p-value < 0.05 means the difference is statistically significant. With small sample sizes (tens of requests), p-values cluster near 1 — wait a few days for data to accumulate before drawing conclusions.
Welch's t-test is valid even when the two groups have unequal variances. Fisher's exact test is more appropriate for binary (success/failure) metrics like error rate.
Declaring a winner
When one version is statistically better, declare it the winner. The experiment status changes to concluded and traffic splitting stops.
PATCH /api/v1/prompt-experiments/:id
curl -X PATCH https://spanlens-server.vercel.app/api/v1/prompt-experiments/<experiment-id> \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"winnerVersionId": "ae1c3c1e-99eb-4f2a-b821-000000000002"
}'bashTo promote the winning version as the production default, use the Prompts Roll back button or create a new version with the winning content.
Duplicate experiment guard
If a running experiment already exists for the same promptName, creating a new one returns 409 Conflict. Stop the existing experiment first or wait for its endsAt before starting a new one.
API reference
| Method + Path | Description |
|---|---|
POST /api/v1/prompt-experiments | Create experiment and start traffic split |
GET /api/v1/prompt-experiments?promptName=... | List experiments for a prompt name (newest first) |
GET /api/v1/prompt-experiments/:id | Experiment status, aggregated metrics, and p-values |
PATCH /api/v1/prompt-experiments/:id | Set winner or manually stop (status: "stopped") |
Limitations
- Two arms only. Only A and B can be compared at once. Run separate experiments to compare more than two versions simultaneously.
- One running experiment per prompt name. A new experiment cannot be created while one is already running.
- Statistical significance requires sufficient samples. With only a few dozen calls per day, it may take weeks to reach a meaningful conclusion. Use Experiments for faster offline validation first.
- Response quality is not measured. Only latency, cost, and error rate are tracked. Pair with Evals for quality scoring.
Related: Prompts (version management), Experiments (offline dataset comparison), Evals (LLM-as-judge quality scoring), /prompts dashboard.