Datasets
Named collections of (input, expected_output?) pairs. Evals and Experiments can use a dataset instead of sampling from live production traffic when you want evaluations against a fixed, controlled input set.
When to use a dataset
- No production traffic yet, you want to evaluate a prompt before the first real calls accumulate.
- Sensitive production data, healthcare, finance, or other regulated domains where you need an anonymized set.
- Regression test set, a curated golden set of 30 past failure cases that every new prompt version must handle correctly.
Schema
datasets table
name, unique within the organizationdescription, free textarchived_at, soft delete
dataset_items table
input(jsonb), two shapes are accepted:Bulk upload also accepts a plain string for
json{ "variables": { "company_name": "Acme", "customer_name": "Alice" } } { "messages": [{ "role": "user", "content": "..." }] }input; the server wraps it as a single user message automatically.expected_output, reference answer text (optional). Stored alongside the item but not consumed by the eval runner in this release. The runner generates a fresh response per item and scores that, so prompt quality is what gets measured.source_request_id, set when the item was imported from a production request.
Four ways to add items
1. Manual entry (dashboard)
Go to /datasets, select a dataset, click Add item, then toggle between two input modes:
- User message, a single chat-style user message
- Variables JSON, for prompts with
{{var}}placeholders
2. Import from production requests (API)
curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items/import-requests \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{ "requestIds": ["uuid-1", "uuid-2", ...] }'bashThe server extracts request_body.messages as input and the response text as expected_output and saves them in bulk (max 200 per request).
3. Single item (API)
curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{
"input": { "variables": { "name": "Alice" } },
"expectedOutput": "Hello Alice, how can I help?"
}'bash4. File upload (dashboard)
Two entry points accept a file and bulk-insert its rows in one step:
- New dataset dialog — on the /datasets list page, click New dataset, then switch to the Upload file tab. Pick a file, confirm the name (auto-filled from the filename), and click Create. A new dataset is created and all rows are inserted immediately.
- Import items button — on an existing dataset's detail page, click Import items in the top-right and pick a file. Rows are added to the existing dataset and a confirmation count appears inline.
All parsing is done in the browser (no file is sent to the server). Accepted formats:
- JSON array (
.json), an array of{ input, expected_output? }objects whereinputis{ messages: [...] }or{ variables: {...} }.
json[ { "input": { "messages": [{ "role": "user", "content": "What is 2+2?" }] }, "expected_output": "4" }, { "input": { "variables": { "name": "Alice" } }, "expected_output": "Hello Alice" } ] - JSONL (
.jsonl), one JSON object per line, same shape as above. - CSV (
.csv), header row withinput(required) andexpected_output(optional). Quoted fields are supported. Plain-textinputvalues are automatically wrapped as a single user message.
textinput,expected_output What is the capital of France?,Paris "What is 2+2?",4
Behind the scenes the upload calls:
curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items/bulk \
-H "Authorization: Bearer $SPANLENS_JWT" \
-H "Content-Type: application/json" \
-d '{ "items": [ { "input": { "messages": [...] } }, ... ] }'bashMax 5000 items per request.
Connection to Evals
When running an Eval, select Source: Dataset, then pick a Run provider and Run model. The eval runner does this per item: take the dataset input, run it through the chosen prompt version with that run model, then send the generated response to the judge for scoring.
Older releases scored the dataset's expected_output text directly. That measured how friendly the curated reference text was, not how the prompt actually behaves, so it was replaced. expected_output is stored but unused by the runner in the current release.
API
| Method + Path | Description |
|---|---|
POST /api/v1/datasets | Create a dataset |
GET /api/v1/datasets | List datasets with item_count |
GET /api/v1/datasets/:id | Dataset with all items |
DELETE /api/v1/datasets/:id | Soft archive |
POST /api/v1/datasets/:id/items | Add a single item |
POST /api/v1/datasets/:id/items/import-requests | Bulk import from request IDs (max 200) |
POST /api/v1/datasets/:id/items/bulk | Bulk insert pre parsed items (max 5000). Used by the Eval Run dialog file upload. |
DELETE /api/v1/datasets/:id/items/:itemId | Delete a single item |
Limitations
- No item edit UI. To correct a wrongly entered item, delete it and add a new one.
- Bulk upload limit is 5000 items. Larger sets must be split across multiple calls.
expected_outputis reference only. The eval runner does not currently feed it to the judge as a target. A future release may add a similarity mode that uses it.
Related: Evals, Experiments, /datasets dashboard.