Experiments
Run each input in a dataset through two prompt versions, compare outputs with a word-level diff, and optionally score both sides with an evaluator. Answer "is v3 actually better than v2?" in minutes, with no impact on production traffic.
A/B (Prompts) vs Experiments
Spanlens has two places where the word "experiment" appears. They serve different purposes:
| A/B (inside Prompts tab) | Experiments (this page) | |
|---|---|---|
| Data | Live production traffic | Offline dataset |
| Timing | Real-time (days to weeks) | Runs immediately (minutes) |
| Measurement | Statistical significance (Welch's t-test) | Direct output comparison + scores |
| Risk | A bad version reaches real users | None |
| Cost predictability | Hard (days of traffic) | Exact (items × 2 + judge × 2) |
They are complementary: use Experiments to pre-validate → use A/B to confirm in production.
Run flow
- Go to /experiments and click New experiment.
- Choose name → prompt → Version A (control) / Version B (challenger) → dataset → optional evaluator → run provider / model.
- The server runs both versions against each dataset item using the same model (concurrency 3).
- If an evaluator is specified, both outputs are scored by the LLM judge.
- Results appear as: KPI cards (avg_A, avg_B, Δ, total_cost) + expandable rows with word-level diff highlighting.
Word-level diff highlighting
Expanding a result row shows both outputs side by side, with differences color-coded.
- Red — words present in A but not in B
- Green — words present in B but not in A
- Identical words have no highlight
This is a simple token-level comparison rather than a semantic diff, but it immediately shows which parts of the output changed.
Cost visibility
All LLM calls are billed to your provider key (Spanlens does not cover them). Approximate breakdown:
- Prompt runs:
dataset items × 2(one per arm) - Judge calls (if evaluator set):
+ dataset items × 2 - Total calls =
items × 2 × (2 if evaluator, else 1)
Example: 50-item dataset with an evaluator → 50×4 = 200 LLM calls. With gpt-4o-mini, roughly under $0.10.
Hard cap: dataset items limited to 200 per experiment.
API
| Method + Path | Description |
|---|---|
POST /api/v1/experiments | Create and start in background (returns 202 immediately) |
GET /api/v1/experiments?promptName=... | List experiments (max 50) |
GET /api/v1/experiments/:id | Status and aggregated scores (poll while pending/running) |
GET /api/v1/experiments/:id/results | Per-item results for both arms with dataset_items joined |
Example
curl https://spanlens-server.vercel.app/api/v1/experiments \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"name": "support v2 vs v3",
"promptName": "support_reply",
"versionAId": "<v2-id>",
"versionBId": "<v3-id>",
"datasetId": "<dataset-id>",
"evaluatorId": "<optional-evaluator-id>",
"runProvider": "openai",
"runModel": "gpt-4o-mini"
}'bashInput handling rules
How the dataset item's input shape determines what gets sent to the model:
{ "variables": {...} }— substitutes{{var}}placeholders in the prompt content and passes the result as the user message.{ "messages": [...] }— extracts the last user message and passes it in the user role (prompt content becomes the system role).
Limitations
- Two arms only. Compare more than two versions by running separate experiments.
- Same model for both arms. Both versions run with the same
run_model. To compare different models, run two separate experiments. - No pause / resume. Once started, the experiment runs to completion or fails.
- 200-item hard cap. For large-scale regression testing, split the dataset across multiple experiments.
Related: Datasets, Evals, Prompts (A/B live traffic routing), /experiments dashboard.