LLM-as-judge scores production responses against a criterion you define. Cost is billed to your provider key.
LLM judge vs Human agreement
customer-support-v2
—Pearson r · —
1 paired sample
Dot = one request judged by both. Dashed line = perfect agreement.
data-extraction
-1.00Pearson r · Strong
2 paired samples
Dot = one request judged by both. Dashed line = perfect agreement.
EvaluatorAvg scoreRuns
Tone
customer-support-v2 · judge: gpt-4o-mini
92.0
1 runs
JSON validity
data-extraction · judge: claude-3-5-haiku-20241022
71.0
1 runs
This is sample data. Sign up free to run real evaluations against your production traffic.