Datasets

Named collections of (input, expected_output?) pairs. Evals and Experiments can use a dataset instead of sampling from live production traffic when you want evaluations against a fixed, controlled input set.

When to use a dataset

No production traffic yet — you want to evaluate a prompt before the first real calls accumulate.
Sensitive production data — healthcare, finance, or other regulated domains where you need an anonymized set.
Regression test set — a curated golden set of 30 past failure cases that every new prompt version must handle correctly.

Schema

`datasets` table

name — unique within the organization
description — free text
archived_at — soft delete

`dataset_items` table

input (jsonb) — two shapes are accepted:

{ "variables": { "company_name": "Acme", "customer_name": "Alice" } }
{ "messages": [{ "role": "user", "content": "..." }] }

json

expected_output — reference answer text (optional). Used as the scoring target when running Evals in dataset mode. Items without a value are skipped.
source_request_id — set when the item was imported from a production request.

Three ways to add items

1. Manual entry (dashboard)

Go to /datasets, select a dataset, click Add item, then toggle between two input modes:

User message — a single chat-style user message
Variables JSON — for prompts with {{var}} placeholders

2. Import from production requests (API)

curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items/import-requests \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{ "requestIds": ["uuid-1", "uuid-2", ...] }'

bash

The server extracts request_body.messages as input and the response text as expected_output and saves them in bulk (max 200 per request).

3. Single item (API)

curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "variables": { "name": "Alice" } },
    "expectedOutput": "Hello Alice, how can I help?"
  }'

bash

Connection to Evals (replay mode)

When running an Eval, select Source: Dataset to score the dataset's expected_output values instead of live production responses. Items without an expected_output are skipped.

This is called "replay mode" — scoring already-generated outputs. Fresh run mode (run the prompt against each dataset input and then score the new outputs) is handled by Experiments.

API

Method + Path	Description
`POST /api/v1/datasets`	Create a dataset
`GET /api/v1/datasets`	List datasets with item_count
`GET /api/v1/datasets/:id`	Dataset with all items
`DELETE /api/v1/datasets/:id`	Soft archive
`POST /api/v1/datasets/:id/items`	Add a single item
`POST /api/v1/datasets/:id/items/import-requests`	Bulk import from request IDs (max 200)
`DELETE /api/v1/datasets/:id/items/:itemId`	Delete a single item

Limitations

No CSV upload. Items must be added via the dashboard form or the API. CSV import is planned for a later release.
Evals dataset source is replay mode only. Fresh-run evaluation (running the prompt live against dataset inputs) is handled by Experiments.
No item edit UI. To correct a wrongly entered item, delete it and add a new one.

Related: Evals, Experiments, /datasets dashboard.