Tutorial: nightly evals on production traffic
Forty-five minutes. We create an LLM-as-judge evaluator, run it once by hand on a sample of yesterday's requests, then schedule the same run nightly via cron. The output is a quality score per prompt version, sitting next to cost and latency in the dashboard.
What you will end up with
- One Evaluator named Helpfulness check targeting your
rag-systemprompt. - One Eval Run per night, sampling 100 production responses, scoring each 0..1.
- Trend line in /evals showing average score per day.
- Drill-down to the five lowest-scoring samples each night for manual review.
Prerequisites
- You have at least one prompt version logged (the RAG tutorial sets this up).
- You have at least one provider key (OpenAI or Anthropic) registered. The judge LLM uses that key.
- A scheduler that can hit an HTTPS endpoint with a header (Vercel Cron, GitHub Actions, a real cron box, Modal, Inngest, anything).
Step 1. Create the Evaluator
Open /evals and click New evaluator. Fill in:
- Targets prompt:
rag-system(the name you registered). - Name:
Helpfulness check. - Type:
llm_judge(the only type today). - Criterion: write the rubric. Keep it specific. Example:
textRate how helpful this response is to the user's question, on a scale of 1 to 5. - 5: directly answers the question with correct, cited information from the context. - 4: directly answers but does not cite, or has a minor inaccuracy. - 3: partially answers; misses something important. - 2: largely off-topic or hallucinates. - 1: refuses, errors out, or returns gibberish. Reply with: {"score": <1-5>, "reasoning": "<one sentence>"} - Judge provider / model: pick a stronger model than the one being judged. If
rag-systemuses gpt-4o-mini, judge with gpt-4o. - Scale:
1to5. Spanlens normalizes to 0..1 on save.
See Evals referencefor the full evaluator config and the schema used to validate the judge's response.
Step 2. Run it once by hand
Confirm the evaluator works before automating. From the evaluator detail page, click Run and select:
- Source:
production. - Prompt version:
rag-system@1. - Time window: last 24 hours.
- Sample size: 25 (small for a smoke test).
The server samples 25 responses tagged with prompt_version_id = rag-system@1, asks the judge to score each, and writes one eval_results row per sample. Total run cost is shown on completion; for 25 samples on gpt-4o the bill is usually under $0.10.
When the run shows Completed, the average score and the five lowest-scoring drilldowns appear. Read the low ones. If the judge is being nitpicky in ways you do not care about, tighten the criterion and re-run.
Step 3. Trigger runs via the REST API
The same UI action is also a REST endpoint. Authenticate with your project Spanlens key (sl_live_*).
# Look up the evaluator id once
curl -H "Authorization: Bearer $SPANLENS_API_KEY" \
https://server.spanlens.io/api/v1/evaluators
# Trigger a new run
curl -X POST https://server.spanlens.io/api/v1/eval-runs \
-H "Authorization: Bearer $SPANLENS_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"evaluator_id": "<id-from-above>",
"source": "production",
"prompt_version_id": "<rag-system@1 uuid>",
"sample_from": "2026-05-30T00:00:00Z",
"sample_to": "2026-05-31T00:00:00Z",
"sample_size": 100
}'bashResponse includes the new eval_run.id. The run is async; poll GET /api/v1/eval-runs/<id> for status: completed.
Estimate cost before kicking off: POST /api/v1/eval-runs/estimate accepts the same body and returns the expected judge cost based on average sample size.
Step 4. Schedule it
Now make it run automatically every night at 2am UTC. Three common patterns; pick what your stack already has.
Vercel Cron
Add a route handler in your app and declare a cron in vercel.json.
// vercel.json
{
"crons": [
{ "path": "/api/cron/nightly-eval", "schedule": "0 2 * * *" }
]
}json// app/api/cron/nightly-eval/route.ts
export async function GET(req: Request) {
// Vercel sends an Authorization header you can verify against CRON_SECRET
if (req.headers.get('authorization') !== `Bearer ${process.env.CRON_SECRET}`) {
return new Response('unauthorized', { status: 401 })
}
const now = new Date()
const yesterday = new Date(now.getTime() - 24 * 60 * 60 * 1000)
const isoDay = (d: Date) => d.toISOString()
const res = await fetch('https://server.spanlens.io/api/v1/eval-runs', {
method: 'POST',
headers: {
'Authorization': `Bearer ${process.env.SPANLENS_API_KEY}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
evaluator_id: process.env.SPANLENS_EVALUATOR_ID,
source: 'production',
prompt_version_id: process.env.SPANLENS_PROMPT_VERSION_ID,
sample_from: isoDay(yesterday),
sample_to: isoDay(now),
sample_size: 100,
}),
})
if (!res.ok) return new Response(await res.text(), { status: 500 })
return Response.json(await res.json())
}tsGitHub Actions
# .github/workflows/nightly-eval.yml
name: Nightly Spanlens eval
on:
schedule:
- cron: '0 2 * * *'
workflow_dispatch:
jobs:
trigger:
runs-on: ubuntu-latest
steps:
- name: Trigger eval run
env:
SPANLENS_API_KEY: ${{ secrets.SPANLENS_API_KEY }}
EVALUATOR_ID: ${{ vars.SPANLENS_EVALUATOR_ID }}
PV_ID: ${{ vars.SPANLENS_PROMPT_VERSION_ID }}
run: |
NOW=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
FROM=$(date -u -d '1 day ago' +"%Y-%m-%dT%H:%M:%SZ")
curl -sf -X POST https://server.spanlens.io/api/v1/eval-runs \
-H "Authorization: Bearer $SPANLENS_API_KEY" \
-H "Content-Type: application/json" \
-d "{
\"evaluator_id\": \"$EVALUATOR_ID\",
\"source\": \"production\",
\"prompt_version_id\": \"$PV_ID\",
\"sample_from\": \"$FROM\",
\"sample_to\": \"$NOW\",
\"sample_size\": 100
}"yamlPlain cron
# crontab -e
0 2 * * * /usr/local/bin/nightly-eval.sh >> /var/log/nightly-eval.log 2>&1bashThe shell script is the same curl from the GitHub Actions job.
Step 5. Alert when scores drop
Set up an alerton the evaluator'saverage score metric. Threshold below 0.7 (or whatever your baseline is) fires a webhook to Slack / PagerDuty. Catch a prompt regression before users complain.
Cost guardrails
- The judge cost is paid through your provider key (the OpenAI / Anthropic key you registered). It shows up as normal requests in /requests tagged with
x-spanlens-internal=eval. - A sample of 100 against gpt-4o costs roughly $0.30 to $0.80 per run. Nightly is ~$25/month per evaluator. Use a cheaper judge model if cost matters more than calibration.
- The
POST /eval-runs/estimateendpoint returns a pre-run cost estimate. Useful as a sanity check in the cron handler before firing the real request.
Tuning the rubric
Two failure modes to watch for:
- Judge is too lenient. Average score stuck at 0.95 forever, even on bad responses. Make the rubric stricter (more explicit deductions for each failure type).
- Judge is too strict. Average score never above 0.6, lots of 2s on perfectly good answers. Add explicit examples of what counts as a 5.
Iterate on the rubric the same way you iterate on a prompt: version it, score the same N samples with v1 vs v2 of the rubric, eyeball the deltas.
Next: Evals reference for full config options, or Prompt A/B to compare two prompt versions on the same evaluator.