Evals

Automatically score production response quality using an LLM-as-judge. Cost and latency are already measured — Evals adds quality to the picture so you can answer did this prompt actually get better?

The problem it solves

What Spanlens already measures: cost, latency, error rate. What it couldn't measure: whether the response is actually good.

Even if v1 is faster and cheaper than v2, that comparison is meaningless if the response quality degraded. Evals is the infrastructure for assigning a 0..1 score to response content.

How it works

Define an evaluator

An evaluator is a reusable definition of how to score responses.

prompt_name — which prompt this evaluator targets
name — e.g. "Helpfulness check"
type — llm_judge (the only type in this release)
config:
- criterion — scoring criterion sentence
- judge_provider — openai / anthropic
- judge_model — e.g. gpt-4o-mini
- scale_min, scale_max — score range (normalized to 0..1 on save)

Run flow

Go to /evals and click New evaluator to define the criterion.
Click Run on an evaluator and select version, time window, and sample size.
The server samples N responses for the given prompt_version_id from the requests table and asks the judge LLM to score each one (using your provider key).
Per-sample scores are written to eval_results and aggregated into eval_runs.avg_score.
The UI shows the score distribution and the 5 lowest-scoring samples as drilldowns.

Where samples come from

Unlike other evaluation tools, you don't need to build a separate dataset. Spanlens already logs every call, so it samples automatically from production responses that used the given prompt version.

To use a Dataset as the sample source instead, see the Datasets page. The dataset's expected_output field becomes the scoring target.

How Evals differs from A/B

	A/B (inside Prompts tab)	Evals (this tab)
When	Live production traffic routing	Offline scoring
Measures	Which version gets more traffic / fewer failures	Response quality score
Time to result	Days (waiting for statistical significance)	Minutes (50 samples ≈ 1–2 min)
User impact	Real users see the variation	None

The two tools are complementary. Use Evals to pre-validate whether a version is worth an A/B test, then use A/B to confirm in production.

Quality column in the Calls tab

The Quality column in Prompts → a specific prompt → Calls sub-tab shows the average eval_results score from evaluators run on this page. Versions that have never been evaluated show —.

Color thresholds:

≥70 — good (green)
40–69 — warn (yellow)
<40 — bad (red)

LLM judge reliability

A judge score is only meaningful if it correlates with human judgment. If a team member scores responses manually via Annotation, a Pearson r correlation card appears automatically at the top of the Evals page.

r ≥ 0.7 — Strong (judge can be trusted)
0.4 ≤ r < 0.7 — Moderate
r < 0.4 — Revisit the criterion

Cost

Judge calls are billed to your provider key (Spanlens does not cover them). Approximate cost with gpt-4o-mini: ~$0.0005 per evaluation. 50 samples ≈ $0.025.

Guardrails:

sample_size DB CHECK constraint: 1..1000
Estimated cost card shown in the Run dialog before starting

API

Method + Path	Description
`POST /api/v1/evaluators`	Create an evaluator
`GET /api/v1/evaluators?promptName=...`	List evaluators
`DELETE /api/v1/evaluators/:id`	Soft archive
`POST /api/v1/eval-runs`	Start a run (returns 202 immediately; runs in background)
`POST /api/v1/eval-runs/estimate`	Estimate cost before running
`GET /api/v1/eval-runs/:id`	Status and aggregated scores (poll while pending/running)
`GET /api/v1/eval-runs/:id/results`	Per-sample scores and reasoning

Example — create and run an evaluator

# 1. Define the evaluator
curl https://spanlens-server.vercel.app/api/v1/evaluators \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "promptName": "support_reply",
    "name": "Helpfulness check",
    "type": "llm_judge",
    "config": {
      "criterion": "Does the response helpfully and clearly answer the customer question?",
      "judge_provider": "openai",
      "judge_model": "gpt-4o-mini",
      "scale_min": 0,
      "scale_max": 1
    }
  }'

# 2. Score v2 with 50 samples from the last 7 days
curl https://spanlens-server.vercel.app/api/v1/eval-runs \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluatorId": "<evaluator-id>",
    "promptVersionId": "<v2-id>",
    "source": "production",
    "sampleSize": 50,
    "sampleFrom": "2026-05-06T00:00:00Z"
  }'

# 3. Poll for results (status: pending → running → completed)
curl https://spanlens-server.vercel.app/api/v1/eval-runs/<run-id> \
  -H "Authorization: Bearer $SPANLENS_JWT"

bash

Limitations

Only llm_judge evaluator type. Heuristic evaluators (regex, JSON schema, length) are planned for a later release.
One evaluator run at a time. Concurrent runs on the same evaluator are not supported.
Rows with empty response_body are skipped. Roughly 28% of rows may be skipped due to streaming parser failures, old data, or error responses. The UI shows this as "47/50 scored".
The judge itself can be inaccurate. That's why Annotation exists — use it to validate the judge's reliability before relying on the scores.

Related: Datasets (test input sets), Experiments (offline side-by-side comparison), Annotation (human scoring), /evals dashboard.