Datasets

Named collections of (input, expected_output?) pairs. Evals and Experiments can use a dataset instead of sampling from live production traffic when you want evaluations against a fixed, controlled input set.

When to use a dataset

  • No production traffic yet — you want to evaluate a prompt before the first real calls accumulate.
  • Sensitive production data — healthcare, finance, or other regulated domains where you need an anonymized set.
  • Regression test set — a curated golden set of 30 past failure cases that every new prompt version must handle correctly.

Schema

datasets table

  • name — unique within the organization
  • description — free text
  • archived_at — soft delete

dataset_items table

  • input (jsonb) — two shapes are accepted:
    { "variables": { "company_name": "Acme", "customer_name": "Alice" } }
    { "messages": [{ "role": "user", "content": "..." }] }
    json
  • expected_output — reference answer text (optional). Used as the scoring target when running Evals in dataset mode. Items without a value are skipped.
  • source_request_id — set when the item was imported from a production request.

Three ways to add items

1. Manual entry (dashboard)

Go to /datasets, select a dataset, click Add item, then toggle between two input modes:

  • User message — a single chat-style user message
  • Variables JSON — for prompts with {{var}} placeholders

2. Import from production requests (API)

curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items/import-requests \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{ "requestIds": ["uuid-1", "uuid-2", ...] }'
bash

The server extracts request_body.messages as input and the response text as expected_output and saves them in bulk (max 200 per request).

3. Single item (API)

curl https://spanlens-server.vercel.app/api/v1/datasets/<dataset-id>/items \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "variables": { "name": "Alice" } },
    "expectedOutput": "Hello Alice, how can I help?"
  }'
bash

Connection to Evals (replay mode)

When running an Eval, select Source: Dataset to score the dataset's expected_output values instead of live production responses. Items without an expected_output are skipped.

This is called "replay mode" — scoring already-generated outputs. Fresh run mode (run the prompt against each dataset input and then score the new outputs) is handled by Experiments.

API

Method + PathDescription
POST /api/v1/datasetsCreate a dataset
GET /api/v1/datasetsList datasets with item_count
GET /api/v1/datasets/:idDataset with all items
DELETE /api/v1/datasets/:idSoft archive
POST /api/v1/datasets/:id/itemsAdd a single item
POST /api/v1/datasets/:id/items/import-requestsBulk import from request IDs (max 200)
DELETE /api/v1/datasets/:id/items/:itemIdDelete a single item

Limitations

  • No CSV upload. Items must be added via the dashboard form or the API. CSV import is planned for a later release.
  • Evals dataset source is replay mode only. Fresh-run evaluation (running the prompt live against dataset inputs) is handled by Experiments.
  • No item edit UI. To correct a wrongly entered item, delete it and add a new one.

Related: Evals, Experiments, /datasets dashboard.