Evals
Automatically score production response quality using an LLM-as-judge. Cost and latency are already measured — Evals adds quality to the picture so you can answer did this prompt actually get better?
The problem it solves
What Spanlens already measures: cost, latency, error rate. What it couldn't measure: whether the response is actually good.
Even if v1 is faster and cheaper than v2, that comparison is meaningless if the response quality degraded. Evals is the infrastructure for assigning a 0..1 score to response content.
How it works
Define an evaluator
An evaluator is a reusable definition of how to score responses.
prompt_name— which prompt this evaluator targetsname— e.g. "Helpfulness check"type—llm_judge(the only type in this release)config:criterion— scoring criterion sentencejudge_provider—openai/anthropicjudge_model— e.g.gpt-4o-miniscale_min,scale_max— score range (normalized to 0..1 on save)
Run flow
- Go to /evals and click New evaluator to define the criterion.
- Click Run on an evaluator and select version, time window, and sample size.
- The server samples N responses for the given
prompt_version_idfrom therequeststable and asks the judge LLM to score each one (using your provider key). - Per-sample scores are written to
eval_resultsand aggregated intoeval_runs.avg_score. - The UI shows the score distribution and the 5 lowest-scoring samples as drilldowns.
Where samples come from
Unlike other evaluation tools, you don't need to build a separate dataset. Spanlens already logs every call, so it samples automatically from production responses that used the given prompt version.
To use a Dataset as the sample source instead, see the Datasets page. The dataset's expected_output field becomes the scoring target.
How Evals differs from A/B
| A/B (inside Prompts tab) | Evals (this tab) | |
|---|---|---|
| When | Live production traffic routing | Offline scoring |
| Measures | Which version gets more traffic / fewer failures | Response quality score |
| Time to result | Days (waiting for statistical significance) | Minutes (50 samples ≈ 1–2 min) |
| User impact | Real users see the variation | None |
The two tools are complementary. Use Evals to pre-validate whether a version is worth an A/B test, then use A/B to confirm in production.
Quality column in the Calls tab
The Quality column in Prompts → a specific prompt → Calls sub-tab shows the average eval_results score from evaluators run on this page. Versions that have never been evaluated show —.
Color thresholds:
- ≥70 — good (green)
- 40–69 — warn (yellow)
- <40 — bad (red)
LLM judge reliability
A judge score is only meaningful if it correlates with human judgment. If a team member scores responses manually via Annotation, a Pearson r correlation card appears automatically at the top of the Evals page.
- r ≥ 0.7 — Strong (judge can be trusted)
- 0.4 ≤ r < 0.7 — Moderate
- r < 0.4 — Revisit the criterion
Cost
Judge calls are billed to your provider key (Spanlens does not cover them). Approximate cost with gpt-4o-mini: ~$0.0005 per evaluation. 50 samples ≈ $0.025.
Guardrails:
sample_sizeDB CHECK constraint: 1..1000- Estimated cost card shown in the Run dialog before starting
API
| Method + Path | Description |
|---|---|
POST /api/v1/evaluators | Create an evaluator |
GET /api/v1/evaluators?promptName=... | List evaluators |
DELETE /api/v1/evaluators/:id | Soft archive |
POST /api/v1/eval-runs | Start a run (returns 202 immediately; runs in background) |
POST /api/v1/eval-runs/estimate | Estimate cost before running |
GET /api/v1/eval-runs/:id | Status and aggregated scores (poll while pending/running) |
GET /api/v1/eval-runs/:id/results | Per-sample scores and reasoning |
Example — create and run an evaluator
# 1. Define the evaluator
curl https://spanlens-server.vercel.app/api/v1/evaluators \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"promptName": "support_reply",
"name": "Helpfulness check",
"type": "llm_judge",
"config": {
"criterion": "Does the response helpfully and clearly answer the customer question?",
"judge_provider": "openai",
"judge_model": "gpt-4o-mini",
"scale_min": 0,
"scale_max": 1
}
}'
# 2. Score v2 with 50 samples from the last 7 days
curl https://spanlens-server.vercel.app/api/v1/eval-runs \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"evaluatorId": "<evaluator-id>",
"promptVersionId": "<v2-id>",
"source": "production",
"sampleSize": 50,
"sampleFrom": "2026-05-06T00:00:00Z"
}'
# 3. Poll for results (status: pending → running → completed)
curl https://spanlens-server.vercel.app/api/v1/eval-runs/<run-id> \
-H "Authorization: Bearer $SPANLENS_JWT"bashLimitations
- Only
llm_judgeevaluator type. Heuristic evaluators (regex, JSON schema, length) are planned for a later release. - One evaluator run at a time. Concurrent runs on the same evaluator are not supported.
- Rows with empty
response_bodyare skipped. Roughly 28% of rows may be skipped due to streaming parser failures, old data, or error responses. The UI shows this as "47/50 scored". - The judge itself can be inaccurate. That's why Annotation exists — use it to validate the judge's reliability before relying on the scores.
Related: Datasets (test input sets), Experiments (offline side-by-side comparison), Annotation (human scoring), /evals dashboard.