Evals

Automatically score production response quality using an LLM-as-judge. Cost and latency are already measured — Evals adds quality to the picture so you can answer did this prompt actually get better?

The problem it solves

What Spanlens already measures: cost, latency, error rate. What it couldn't measure: whether the response is actually good.

Even if v1 is faster and cheaper than v2, that comparison is meaningless if the response quality degraded. Evals is the infrastructure for assigning a 0..1 score to response content.

How it works

Define an evaluator

An evaluator is a reusable definition of how to score responses.

  • prompt_name — which prompt this evaluator targets
  • name — e.g. "Helpfulness check"
  • typellm_judge (the only type in this release)
  • config:
    • criterion — scoring criterion sentence
    • judge_provideropenai / anthropic
    • judge_model — e.g. gpt-4o-mini
    • scale_min, scale_max — score range (normalized to 0..1 on save)

Run flow

  1. Go to /evals and click New evaluator to define the criterion.
  2. Click Run on an evaluator and select version, time window, and sample size.
  3. The server samples N responses for the given prompt_version_id from the requests table and asks the judge LLM to score each one (using your provider key).
  4. Per-sample scores are written to eval_results and aggregated into eval_runs.avg_score.
  5. The UI shows the score distribution and the 5 lowest-scoring samples as drilldowns.

Where samples come from

Unlike other evaluation tools, you don't need to build a separate dataset. Spanlens already logs every call, so it samples automatically from production responses that used the given prompt version.

To use a Dataset as the sample source instead, see the Datasets page. The dataset's expected_output field becomes the scoring target.

How Evals differs from A/B

A/B (inside Prompts tab)Evals (this tab)
WhenLive production traffic routingOffline scoring
MeasuresWhich version gets more traffic / fewer failuresResponse quality score
Time to resultDays (waiting for statistical significance)Minutes (50 samples ≈ 1–2 min)
User impactReal users see the variationNone

The two tools are complementary. Use Evals to pre-validate whether a version is worth an A/B test, then use A/B to confirm in production.

Quality column in the Calls tab

The Quality column in Prompts → a specific prompt → Calls sub-tab shows the average eval_results score from evaluators run on this page. Versions that have never been evaluated show .

Color thresholds:

  • ≥70 — good (green)
  • 40–69 — warn (yellow)
  • <40 — bad (red)

LLM judge reliability

A judge score is only meaningful if it correlates with human judgment. If a team member scores responses manually via Annotation, a Pearson r correlation card appears automatically at the top of the Evals page.

  • r ≥ 0.7 — Strong (judge can be trusted)
  • 0.4 ≤ r < 0.7 — Moderate
  • r < 0.4 — Revisit the criterion

Cost

Judge calls are billed to your provider key (Spanlens does not cover them). Approximate cost with gpt-4o-mini: ~$0.0005 per evaluation. 50 samples ≈ $0.025.

Guardrails:

  • sample_size DB CHECK constraint: 1..1000
  • Estimated cost card shown in the Run dialog before starting

API

Method + PathDescription
POST /api/v1/evaluatorsCreate an evaluator
GET /api/v1/evaluators?promptName=...List evaluators
DELETE /api/v1/evaluators/:idSoft archive
POST /api/v1/eval-runsStart a run (returns 202 immediately; runs in background)
POST /api/v1/eval-runs/estimateEstimate cost before running
GET /api/v1/eval-runs/:idStatus and aggregated scores (poll while pending/running)
GET /api/v1/eval-runs/:id/resultsPer-sample scores and reasoning

Example — create and run an evaluator

# 1. Define the evaluator
curl https://spanlens-server.vercel.app/api/v1/evaluators \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "promptName": "support_reply",
    "name": "Helpfulness check",
    "type": "llm_judge",
    "config": {
      "criterion": "Does the response helpfully and clearly answer the customer question?",
      "judge_provider": "openai",
      "judge_model": "gpt-4o-mini",
      "scale_min": 0,
      "scale_max": 1
    }
  }'

# 2. Score v2 with 50 samples from the last 7 days
curl https://spanlens-server.vercel.app/api/v1/eval-runs \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "evaluatorId": "<evaluator-id>",
    "promptVersionId": "<v2-id>",
    "source": "production",
    "sampleSize": 50,
    "sampleFrom": "2026-05-06T00:00:00Z"
  }'

# 3. Poll for results (status: pending → running → completed)
curl https://spanlens-server.vercel.app/api/v1/eval-runs/<run-id> \
  -H "Authorization: Bearer $SPANLENS_JWT"
bash

Limitations

  • Only llm_judge evaluator type. Heuristic evaluators (regex, JSON schema, length) are planned for a later release.
  • One evaluator run at a time. Concurrent runs on the same evaluator are not supported.
  • Rows with empty response_body are skipped. Roughly 28% of rows may be skipped due to streaming parser failures, old data, or error responses. The UI shows this as "47/50 scored".
  • The judge itself can be inaccurate. That's why Annotation exists — use it to validate the judge's reliability before relying on the scores.

Related: Datasets (test input sets), Experiments (offline side-by-side comparison), Annotation (human scoring), /evals dashboard.