Evals
Automatically score production response quality using an LLM-as-judge. Cost and latency are already measured, Evals adds quality to the picture so you can answer did this prompt actually get better?
The problem it solves
What Spanlens already measures: cost, latency, error rate. What it couldn't measure: whether the response is actually good.
Even if v1 is faster and cheaper than v2, that comparison is meaningless if the response quality degraded. Evals is the infrastructure for assigning a 0..1 score to response content.
How it works
Quick-start with a template
Visit /evals on a fresh workspace and the empty state shows ten built-in evaluator templates grouped into three categories. Click Use template to pre-fill the New evaluator dialog with a curated criterion and a recommended judge model. You only need to pick which prompt the evaluator targets.
| Category | Templates | Default judge |
|---|---|---|
| Quality (5) | Response quality · Readability · Completeness · Persona match · Conciseness | gpt-4o-mini |
| Safety (4) | No PII leak · Toxicity · Hallucination · Prompt injection | gpt-4o-mini (Hallucination uses claude-3-5-sonnet for reasoning depth) |
| Cost (1) | Cost vs quality (could a cheaper model have produced this answer?) | claude-3-5-sonnet |
Templates are stored in evaluator_templates on the server, not hard-coded in the dashboard, so new templates can ship without a frontend deploy. The catalogue is global (every workspace sees the same suggestions) and read-only from the dashboard; every field on a template (criterion, judge provider/model, score range) is still editable in the New evaluator dialog after you load it.
Define an evaluator
An evaluator is a reusable definition of how to score responses.
prompt_name, which prompt this evaluator targetsname, e.g. "Helpfulness check"type, one ofllm_judge,regex,json_schema,exact_match,contains, orembedding. Theregextype checks the response body against a configured pattern and scores 1 on match. Thejson_schematype validates the response body against a JSON Schema document via Ajv and scores 1 on valid.exact_match(configvalue, optionalcaseSensitive/trim) scores 1 when the response equals the value;contains(configsubstring, optionalcaseSensitive) scores 1 when the substring is present. The four deterministic types are free of LLM cost.embedding(configprovider/model, optionalreference_text/threshold) scores the cosine similarity (0–1) of the response vs a reference answer — the reference is the dataset item'sexpected_outputwhen present, otherwisereference_text; it calls an embeddings API on your provider key.config:criterion, scoring criterion sentencejudge_provider,openai,anthropic, orgemini. Gemini usesresponseMimeType: application/jsonwithresponseSchemafor strict JSON output (matches OpenAI'sresponse_format: json_objectstrictness).judge_model, any model inmodel_pricesfor that provider. The Evals UI picker reads from/api/v1/modelsso newly seeded models appear automatically.scale_min,scale_max, score range (normalized to 0..1 on save)
Run flow
- Go to /evals and click New evaluator to define the criterion.
- Click Run on an evaluator and select version, time window, and sample size.
- The server samples N responses for the given
prompt_version_idfrom therequeststable and asks the judge LLM to score each one (using your provider key). - Per-sample scores are written to
eval_resultsand aggregated intoeval_runs.avg_score. - The UI shows the score distribution and the 5 lowest-scoring samples as drilldowns.
Where samples come from
Unlike other evaluation tools, you don't need to build a separate dataset. Spanlens already logs every call, so it samples automatically from production responses that used the given prompt version.
To use a Dataset as the sample source instead, see the Datasets page. In dataset mode the runner does two things back to back:
- For each item, run the chosen prompt version against its
inputusingrunProvider+runModel(an active provider key of that provider must exist on the workspace). This produces a fresh response. - Send the fresh response to the judge, which scores it against the criterion.
That means dataset mode measures how the prompt actually performs on the curated inputs, not how friendly the static expected_output text is. The expected_output field is reference only in this release; a later release may feed it to the judge as a target for similarity checks.
On the Eval Run dialog, switching the Sample source toggle to Dataset exposes a Plus Upload button next to the dataset picker. Picking a JSON or CSV file creates a fresh dataset with an auto generated name (for example upload-2026-05-22-2245), bulk inserts every parsed item, and pre selects it. Rename or delete the dataset from /datasets later.
How Evals differs from A/B
| A/B (inside Prompts tab) | Evals (this tab) | |
|---|---|---|
| When | Live production traffic routing | Offline scoring |
| Measures | Which version gets more traffic / fewer failures | Response quality score |
| Time to result | Days (waiting for statistical significance) | Minutes (50 samples ≈ 1–2 min) |
| User impact | Real users see the variation | None |
The two tools are complementary. Use Evals to pre-validate whether a version is worth an A/B test, then use A/B to confirm in production.
Quality column in the Calls tab
The Quality column in Prompts → a specific prompt → Calls sub-tab shows the average eval_results score from evaluators run on this page. Versions that have never been evaluated show ,.
Color thresholds:
- ≥70, good (green)
- 40–69, warn (yellow)
- <40, bad (red)
LLM judge reliability
A judge score is only meaningful if it correlates with human judgment. If a team member scores responses manually via Annotation, the Evals page shows a judge-human agreement card at the top of the run summary with the right statistic for the evaluator type.
For numeric scores (judge returns a 0..1 or 1..5 scale) Spanlens computes Pearson r. For categorical labels (PASS / FAIL, A / B / C) it computes Cohen's kappa instead, which accounts for chance agreement and is the right measure when r would treat label distance as numeric. Both surface the same traffic-light bands so you can read them the same way.
- r ≥ 0.7 or κ ≥ 0.6, Strong (judge can be trusted)
- 0.4 ≤ r < 0.7 or 0.4 ≤ κ < 0.6, Moderate
- r < 0.4 or κ < 0.4, Revisit the criterion
Judge result caching
Re-running an evaluator on the same prompt + sample set is common during prompt tuning, and most of those re-runs ask the judge the exact same question twice. Spanlens caches judge verdicts keyed by (evaluator_config_hash, response_hash), so identical (evaluator settings, model output) pairs reuse the prior verdict instead of paying for another LLM call.
- Cache hits skip the judge API call entirely. Latency drops, cost goes to zero on the hit.
- Editing the rubric, anchors, judge model, temperature, or prompt template changes the config hash and invalidates the cache for that evaluator. New verdicts are computed and stored.
- Every run reports a
cache_hitscounter so you can see how much was reused vs. paid for. Entries are pruned daily via/cron/prune-judge-cacheafter 30 days of no-hit.
Cost
Judge calls are billed to your provider key (Spanlens does not cover them). Approximate cost with gpt-4o-mini: ~$0.0005 per evaluation. 50 samples ≈ $0.025.
Guardrails:
sample_sizeDB CHECK constraint: 1..1000- Estimated cost card shown in the Run dialog before starting
API
| Method + Path | Description |
|---|---|
POST /api/v1/evaluators | Create an evaluator |
GET /api/v1/evaluators?promptName=... | List evaluators |
DELETE /api/v1/evaluators/:id | Soft archive |
POST /api/v1/eval-runs | Start a run (returns 202 immediately; runs in background) |
POST /api/v1/eval-runs/estimate | Estimate cost before running |
GET /api/v1/eval-runs/:id | Status and aggregated scores (poll while pending/running) |
GET /api/v1/eval-runs/:id/results | Per-sample scores and reasoning |
These endpoints accept either a dashboard session (Supabase JWT) or a full-access Spanlens API key (sl_live_*), so you can drive evals from CI as well as the dashboard. Read endpoints also accept a public key (sl_live_pub_*); the write endpoints (create evaluator, start a run) require a full key and reject a public key with PUBLIC_KEY_WRITE_FORBIDDEN, since a run spends your provider key.
Example, create and run an evaluator
# 1. Define the evaluator
curl https://server.spanlens.io/api/v1/evaluators \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"promptName": "support_reply",
"name": "Helpfulness check",
"type": "llm_judge",
"config": {
"criterion": "Does the response helpfully and clearly answer the customer question?",
"judge_provider": "openai",
"judge_model": "gpt-4o-mini",
"scale_min": 0,
"scale_max": 1
}
}'
# 2. Score v2 with 50 samples from the last 7 days
curl https://server.spanlens.io/api/v1/eval-runs \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"evaluatorId": "<evaluator-id>",
"promptVersionId": "<v2-id>",
"source": "production",
"sampleSize": 50,
"sampleFrom": "2026-05-06T00:00:00Z"
}'
# 2b. Dataset mode also accepts runProvider + runModel.
# The runner generates a response per item before scoring.
curl https://server.spanlens.io/api/v1/eval-runs \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"evaluatorId": "<evaluator-id>",
"promptVersionId": "<v2-id>",
"source": "dataset",
"datasetId": "<dataset-id>",
"sampleSize": 50,
"runProvider": "openai",
"runModel": "gpt-4o-mini"
}'
# 3. Poll for results (status: pending → running → completed)
curl https://server.spanlens.io/api/v1/eval-runs/<run-id> \
-H "Authorization: Bearer $SPANLENS_JWT"bashRun from CI (prompt CI)
Gate a prompt change on its eval score. The SDK's client.evals.run() triggers a run with a full sl_live_* key, polls until it finishes, and returns the scored run so the job can fail the build when quality regresses. Unlike tracing (fire-and-forget), this call blocks and throws on failure.
import { SpanlensClient } from '@spanlens/sdk'
// Use a full-access key (sl_live_*), not a public key.
const client = new SpanlensClient({ apiKey: process.env.SPANLENS_API_KEY! })
const run = await client.evals.run({
evaluatorId: process.env.EVALUATOR_ID!,
promptVersionId: process.env.PROMPT_VERSION_ID!,
sampleSize: 50,
})
console.log(`scored ${run.scored_count}/${run.attempted_count}, avg ${run.avg_score}`)
// Quality gate: fail the build if the average drops below the bar.
if (run.status !== 'completed' || (run.avg_score ?? 0) < 0.8) {
console.error('Eval gate failed')
process.exit(1)
}typescriptPass { wait: false } to return immediately after the run is queued, or tune pollIntervalMs / timeoutMs. Use client.evals.getResults(run.id) to read the lowest-scoring samples for a CI log.
Confidence intervals
A score is only as trustworthy as its sample size: 0.82 from 8 samples and 0.82from 200 are not the same evidence, and “version B scored 0.84 vs A's 0.81” can be noise. Each completed run stores score_stddev (the sample standard deviation of the scores behind avg_score), and the dashboard renders a 95% confidence interval (avg ± 1.96·σ/√n) next to the average. It is populated for numeric and pass-rate (boolean) evaluators; categorical and text types have no mean, so it stays empty.
In CI, gate on the interval instead of the point estimate so the build fails only on a meaningful regression, not sampling jitter:
import { SpanlensClient, scoreConfidenceInterval } from '@spanlens/sdk'
const client = new SpanlensClient({ apiKey: process.env.SPANLENS_API_KEY! })
const run = await client.evals.run({
evaluatorId: process.env.EVALUATOR_ID!,
promptVersionId: process.env.PROMPT_VERSION_ID!,
sampleSize: 100,
})
const ci = scoreConfidenceInterval(run) // { mean, margin, low, high } | null
const GATE = 0.8
// Fail only when even the optimistic bound is below the bar — a wide
// interval (small / noisy sample) is told to collect more data instead.
if (run.status !== 'completed' || (ci?.high ?? run.avg_score ?? 0) < GATE) {
console.error(`gate failed: ${ci?.mean.toFixed(2)} ±${ci?.margin.toFixed(2)}`)
process.exit(1)
}typescriptTuning the judge: rubric & calibration anchors
A bare criterion leaves the judge to invent its own scale, so scores drift run to run. Two optional fields on an LLM-judge evaluator make scoring consistent (set them under Advanced when creating the evaluator, or pass them in config to POST /api/v1/evaluators):
rubric— free-form guidance injected into the prompt, e.g.1.0 = fully correct and complete · 0.5 = partially correct · 0 = wrong. Applies to every score type.anchors— up to 10 few-shot calibration examples, each an exampleresponsepaired with thescoreit should get (and an optionalreasoning). The judge anchors its scale to these. Numeric judges only.
// POST /api/v1/evaluators (config excerpt)
{
"criterion": "Is the answer factually correct and complete?",
"judge_provider": "openai",
"judge_model": "gpt-4o-mini",
"scale_min": 0,
"scale_max": 1,
"rubric": "1.0 = correct and complete · 0.5 = correct but missing detail · 0 = wrong",
"anchors": [
{ "response": "Paris is the capital of France.", "score": 1, "reasoning": "correct and complete" },
{ "response": "I think it's somewhere in Europe.", "score": 0.3, "reasoning": "vague, no answer" }
]
}jsonLong responses are truncated to a character cap before judging, but middle-out: the start and the end are both kept (the actual answer often lives in the conclusion) with the middle elided.
Pairwise comparison (A vs B)
Absolute scores drift, and a 0.84-vs-0.81 gap is often noise. A pairwise run instead shows the judge BOTH versions' responses to the same input and asks which one wins. Relative judgments are far more consistent, so a win-rate is a more trustworthy signal than two separate averages. Pick Pairwise (A vs B) when running an evaluator, choose a baseline (A) and a candidate (B), and a dataset.
- Each item is run through both versions, then judged head-to-head.
- Position bias is counterbalanced — the judge favours whichever response it sees first, so Spanlens alternates the A/B presentation order across the sample and un-swaps the verdict.
- The run reports B's win-rate as
avg_score(1 = B wins, 0 = A wins, 0.5 = tie) plus ab_wins/a_wins/tiestally, and the 95% confidence interval applies to the win-rate.
import { SpanlensClient, scoreConfidenceInterval } from '@spanlens/sdk'
const client = new SpanlensClient({ apiKey: process.env.SPANLENS_API_KEY! })
const run = await client.evals.run({
evaluatorId: process.env.EVALUATOR_ID!,
mode: 'pairwise',
promptVersionId: BASELINE_VERSION_ID, // A
promptVersionBId: CANDIDATE_VERSION_ID, // B
source: 'dataset',
datasetId: process.env.DATASET_ID!,
runProvider: 'openai',
runModel: 'gpt-4o-mini',
sampleSize: 100,
})
const ci = scoreConfidenceInterval(run) // CI on B's win-rate
// Ship B only if it beats A with the interval clearing 50%.
if ((ci?.low ?? run.avg_score ?? 0) > 0.5) {
console.log(`B wins ${run.b_wins}/${run.scored_count} — promote it`)
}typescriptAgent trajectory evaluation
Every other evaluator scores a single response. A trajectory evaluator scores the whole agent trace, the ordered sequence of spans (LLM calls, tool calls, intermediate steps), against a criterion. It reuses the tracing data you already send, so you can judge how the agent worked, not just its final answer.
- A trajectory evaluator binds to a trace name (the name your SDK passes to
createTrace()), not a prompt. Create it under Type → Agent trajectory and give it the trace name + a criterion. - Running it samples the most recent N traces with that name, serializes each one's steps in execution order, and the judge scores the trajectory 0..1. The run reports the average + the same 95% confidence interval.
- From the SDK, a trajectory run needs only the evaluator id (no prompt version):
client.evals.run({ evaluatorId, sampleSize: 50 }).
Write criteria about the process: “did the agent call the search tool before answering”, “were there redundant or failed tool calls”, “did it follow the required steps in order”.
Reproducibility & reliability options
sampleStrategy(production source):recent(default) scores the latest N requests;randomdraws a representative sample (ORDER BY rand()) without recency bias.generationTemperature(dataset source): the temperature used to generate each response before judging. Defaults to0so a re-run produces the same answers; raise it to sample variability on purpose.- Golden-set scoring. Dataset items with an
expected_outputnow have it injected into the judge prompt as a reference, so the judge compares the response against the expected answer instead of scoring on the criterion alone. - Retries. Judge and generation calls retry transient failures (429 / 5xx / network) with exponential backoff. Concurrency and retry counts are tunable via
EVAL_JUDGE_CONCURRENCY,EVAL_GENERATION_CONCURRENCY, andEVAL_MAX_RETRIES.
Auto-run on a new version (golden regression suite)
Turn an evaluator into a regression gate: enable Auto-run on each new version on the evaluator (or pass autoRunOnVersion + autoRunDatasetId / autoRunProvider / autoRunModel to POST /api/v1/evaluators). Whenever a new version of that evaluator's prompt is created, Spanlens automatically runs the evaluator against the chosen dataset and scores it — no manual trigger.
It is a dataset run, not production: a brand-new version has no traffic yet, so the run generates responses for the golden dataset with the configured model and scores them. Pair it with an eval_score alert to be notified when a version regresses, and with expected_output on the dataset items for golden-set scoring. Auto-runs spend your provider key, so they are opt-in per evaluator.
Limitations
- Six evaluator types ship today.
llm_judge(model scores 0–1),regex,json_schema,exact_match,contains(the four deterministic types run with no LLM cost), andembedding(cosine similarity via your provider key). Custom-code (JS) evaluators are planned for a later release. - One evaluator run at a time. Concurrent runs on the same evaluator are not supported.
- Rows with empty
response_bodyare skipped.Roughly 28% of rows may be skipped due to streaming parser failures, old data, or error responses. The UI shows this as "47/50 scored". - The judge itself can be inaccurate.That's why Annotation exists , use it to validate the judge's reliability before relying on the scores.
Related: Datasets (test input sets), Experiments (offline side-by-side comparison), Annotation (human scoring), /evals dashboard.