Evals

Evals score LLM responses on a 0 to 1 scale per prompt version, so you can tell whether v8 is actually better than v7 instead of just cheaper. Spanlens supports LLM-as-judge automated scoring, human annotation, and the correlation between the two as a first-class drift signal.

The three score sources

Source	Entity	Update cadence
LLM-as-judge	`evals`	Async on ingest; caches by content hash so re-running is free.
Human annotation	`annotations`	Manual via /annotation.
Programmatic (custom)	`evals` with `source='custom'`	Push via the SDK from your own scripts or CI.

LLM-as-judge in practice

The default judge is a frontier model (configurable per project) scoring on a rubric you write in plain English. Scores are stored with the judge model name and the exact rubric used, so when you change the rubric, old scores do not silently shift meaning. The judge result is cached by a hash of (response, rubric, judge_model), so re-running an eval on the same data is instant and free.

import { SpanlensClient } from '@spanlens/sdk'

const client = new SpanlensClient()

await client.evals.run({
  prompt_version_id: 'pv_xxx',
  rubric:
    'Score 1.0 if the response correctly cites the source document, 0.5 if cited but inaccurate, 0.0 otherwise.',
  judge_model: 'claude-3-5-sonnet',
  sample_size: 200, // pulls 200 recent responses for this prompt version
})

Judge-to-human correlation

Spanlens computes the Pearson correlation between LLM judge scores and human annotation scores per prompt version. The number lives in the eval detail view next to the average score. When correlation drops below your threshold (default 0.7), Spanlens flags it as judge drift and suggests re-grounding the judge rubric against fresh human labels.

This metric is the difference between "our eval score is 0.85" and "our eval score is 0.85 and a human would agree 85% of the time." Without correlation, judge scores can drift with no signal that they have.

Experiments and the eval feedback loop

Evals integrate with experimentsso you can replay a fixed dataset across prompt versions and judge each output on the same rubric. The experiment table shows quality, cost, and latency side by side, which lets you make the "cheaper but as good?" decision on evidence rather than vibes.

Eval-driven anomaly detection

Anomalies (see /docs/features/anomalies) fire when eval scores deviate more than 3σ from the rolling 7-day baseline. A score collapse usually means a prompt regression, a model variant change, or a content drift in your input distribution. The anomaly contains contributing factors so you can jump straight to the prompt version or customer responsible.

Where to go next

Evals feature page, dashboard surface.
Experiments, replay datasets across prompt versions and models.
Annotation, build human-labeled golden sets from real traffic.
Prompt management, how versions relate to eval scores.