Experiments

Run each input in a dataset through two prompt versions, compare outputs with a word-level diff, and optionally score both sides with an evaluator. Answer "is v3 actually better than v2?" in minutes, with no impact on production traffic.

A/B (Prompts) vs Experiments

Spanlens has two places where the word "experiment" appears. They serve different purposes:

	A/B (inside Prompts tab)	Experiments (this page)
Data	Live production traffic	Offline dataset
Timing	Real-time (days to weeks)	Runs immediately (minutes)
Measurement	Statistical significance (Welch's t-test)	Direct output comparison + scores
Risk	A bad version reaches real users	None
Cost predictability	Hard (days of traffic)	Exact (items × 2 + judge × 2)

They are complementary: use Experiments to pre-validate → use A/B to confirm in production.

Run flow

Go to /experiments and click New experiment.
Choose name → prompt → Version A (control) / Version B (challenger) → dataset → optional evaluator → run provider / model.
The server runs both versions against each dataset item using the same model (concurrency 3).
If an evaluator is specified, both outputs are scored by the LLM judge.
Results appear as: KPI cards (avg_A, avg_B, Δ, total_cost) + expandable rows with word-level diff highlighting.

Word-level diff highlighting

Expanding a result row shows both outputs side by side, with differences color-coded.

Red — words present in A but not in B
Green — words present in B but not in A
Identical words have no highlight

This is a simple token-level comparison rather than a semantic diff, but it immediately shows which parts of the output changed.

Cost visibility

All LLM calls are billed to your provider key (Spanlens does not cover them). Approximate breakdown:

Prompt runs: dataset items × 2 (one per arm)
Judge calls (if evaluator set): + dataset items × 2
Total calls = items × 2 × (2 if evaluator, else 1)

Example: 50-item dataset with an evaluator → 50×4 = 200 LLM calls. With gpt-4o-mini, roughly under $0.10.

Hard cap: dataset items limited to 200 per experiment.

API

Method + Path	Description
`POST /api/v1/experiments`	Create and start in background (returns 202 immediately)
`GET /api/v1/experiments?promptName=...`	List experiments (max 50)
`GET /api/v1/experiments/:id`	Status and aggregated scores (poll while pending/running)
`GET /api/v1/experiments/:id/results`	Per-item results for both arms with dataset_items joined

Example

curl https://spanlens-server.vercel.app/api/v1/experiments \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "support v2 vs v3",
    "promptName": "support_reply",
    "versionAId": "<v2-id>",
    "versionBId": "<v3-id>",
    "datasetId": "<dataset-id>",
    "evaluatorId": "<optional-evaluator-id>",
    "runProvider": "openai",
    "runModel": "gpt-4o-mini"
  }'

bash

Input handling rules

How the dataset item's input shape determines what gets sent to the model:

{ "variables": {...} } — substitutes {{var}} placeholders in the prompt content and passes the result as the user message.
{ "messages": [...] } — extracts the last user message and passes it in the user role (prompt content becomes the system role).

Limitations

Two arms only. Compare more than two versions by running separate experiments.
Same model for both arms. Both versions run with the same run_model. To compare different models, run two separate experiments.
No pause / resume. Once started, the experiment runs to completion or fails.
200-item hard cap. For large-scale regression testing, split the dataset across multiple experiments.

Related: Datasets, Evals, Prompts (A/B live traffic routing), /experiments dashboard.