Offline side-by-side: runs both prompt versions on a dataset and compares outputs. Unlike A/B (Prompts), no production traffic is affected.
NameA scoreB scoreΔCost
Support v6 vs v7
Completedcustomer-support-v2 · gpt-4o-mini
76.0
84.0
+8.0
$0.180
Extraction haiku vs sonnet
Completeddata-extraction · claude-3-5-haiku-20241022
68.0
71.0
+3.0
$0.090
Classifier prompt tweak
Runningemail-classifier · gpt-4o-mini
—
—
—
$0.011