Annotation

Team members rate production responses with 1–5 stars. Comparing those ratings to LLM judge scores reveals in a single number (Pearson r) whether the judge can actually be trusted.

Why it matters

An Evals judge score is meaningless if it doesn't correlate with human judgment. If the judge gives 70 but a human gives 30, the criterion needs rethinking.

Annotation is where you build that validation dataset. The ratings can also serve as ground truth for future fine-tuning.

Rating flow

Go to REVIEW → Annotation in the sidebar.
Use the top filters: select a prompt / enable Unscored only (show only what you haven't rated yet) / enable Low judge score (judge scored below 50 — highest validation priority).
Each card shows the user input and response in two columns. Click expand to read the full content.
Click a star rating (1–5), optionally add a comment, and click Save rating.
Already-rated rows show "You: 60" in the header. Rating the same row again overwrites the previous score.

Score normalization

Users click 1–5 stars, but the database stores the value normalized to 0..1 as (stars - 1) / 4. This makes it directly comparable to eval_results.score (which is already 0..1) for Pearson r calculation.

Stars	Normalized score	UI display (×100)
1	0.00	0
2	0.25	25
3	0.50	50
4	0.75	75
5	1.00	100

Both raw_score (original star count) and score (normalized) are stored, so the UI can display the original star rating.

Duplicate prevention

A UNIQUE (request_id, reviewer_id) constraint ensures each user leaves at most one score per request. Rating the same row again performs an upsert — it updates raw_score, score, and comment.

Multiple reviewers can rate the same request — each gets their own row.

Correlation card on the Evals page

When a request has both an LLM judge score and a human score, it forms a paired sample. A per-prompt Pearson r card appears automatically at the top of the /evals page.

r ≥ 0.7 — Strong: judge can be trusted
0.4 ≤ r < 0.7 — Moderate
r < 0.4 — Revisit the judge criterion

The card includes a 120×120 SVG scatter plot with a diagonal reference line (perfect agreement) so you can see visually where the divergence occurs.

RLS policy

SELECT — any org member (you can see others' scores)
INSERT — any org member
UPDATE / DELETE — own rows only (reviewer_id = auth.uid())

API

Method + Path	Description
`GET /api/v1/annotation/queue`	Rating queue (filters: promptName, promptVersionId, unscoredOnly, lowJudgeScoreOnly)
`POST /api/v1/human-evals`	Save a rating (upsert)
`GET /api/v1/human-evals?promptVersionId=...`	List ratings for a specific version
`DELETE /api/v1/human-evals/:id`	Delete your own rating
`GET /api/v1/human-evals/correlation?promptName=...`	Returns (judgeScore, humanScore) pairs. Client computes Pearson r.

Example — save a rating

# 4 stars + comment
curl https://spanlens-server.vercel.app/api/v1/human-evals \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "requestId": "<request-uuid>",
    "score": 0.75,
    "rawScore": 4,
    "comment": "Friendly but a bit long"
  }'

bash

Limitations

No keyboard shortcuts. j/k navigation and 1–5 number key shortcuts are planned. Currently mouse-only.
No multi-reviewer averaging. The correlation card uses the most recent score per request, not an average across reviewers.
No reviewer permission management. Any org member can rate any request.
experiment_results / eval_results are not ratable. Only direct requests can be annotated. A UI for human pairwise comparison of experiment arms is planned.

Related: Evals (LLM judge infrastructure), /annotation dashboard.