Datasets
Named collections of (input, expected_output?) pairs. Evals and Experiments can use a dataset instead of sampling from live production traffic when you want evaluations against a fixed, controlled input set.
When to use a dataset
- No production traffic yet — you want to evaluate a prompt before the first real calls accumulate.
- Sensitive production data — healthcare, finance, or other regulated domains where you need an anonymized set.
- Regression test set — a curated golden set of 30 past failure cases that every new prompt version must handle correctly.
Schema
datasets table
name— unique within the organizationdescription— free textarchived_at— soft delete
dataset_items table
input(jsonb) — two shapes are accepted:
json{ "variables": { "company_name": "Acme", "customer_name": "Alice" } } { "messages": [{ "role": "user", "content": "..." }] }expected_output— reference answer text (optional). Used as the scoring target when running Evals in dataset mode. Items without a value are skipped.source_request_id— set when the item was imported from a production request.
Three ways to add items
1. Manual entry (dashboard)
Go to /datasets, select a dataset, click Add item, then toggle between two input modes:
- User message — a single chat-style user message
- Variables JSON — for prompts with
{{var}}placeholders
2. Import from production requests (API)
curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items/import-requests \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{ "requestIds": ["uuid-1", "uuid-2", ...] }'bashThe server extracts request_body.messages as input and the response text as expected_output and saves them in bulk (max 200 per request).
3. Single item (API)
curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"input": { "variables": { "name": "Alice" } },
"expectedOutput": "Hello Alice, how can I help?"
}'bashConnection to Evals (replay mode)
When running an Eval, select Source: Dataset to score the dataset's expected_output values instead of live production responses. Items without an expected_output are skipped.
This is called "replay mode" — scoring already-generated outputs. Fresh run mode (run the prompt against each dataset input and then score the new outputs) is handled by Experiments.
API
| Method + Path | Description |
|---|---|
POST /api/v1/datasets | Create a dataset |
GET /api/v1/datasets | List datasets with item_count |
GET /api/v1/datasets/:id | Dataset with all items |
DELETE /api/v1/datasets/:id | Soft archive |
POST /api/v1/datasets/:id/items | Add a single item |
POST /api/v1/datasets/:id/items/import-requests | Bulk import from request IDs (max 200) |
DELETE /api/v1/datasets/:id/items/:itemId | Delete a single item |
Limitations
- No CSV upload. Items must be added via the dashboard form or the API. CSV import is planned for a later release.
- Evals dataset source is replay mode only. Fresh-run evaluation (running the prompt live against dataset inputs) is handled by Experiments.
- No item edit UI. To correct a wrongly entered item, delete it and add a new one.
Related: Evals, Experiments, /datasets dashboard.