Datasets

Named collections of (input, expected_output?) pairs. Evals and Experiments can use a dataset instead of sampling from live production traffic when you want evaluations against a fixed, controlled input set.

When to use a dataset

  • No production traffic yet, you want to evaluate a prompt before the first real calls accumulate.
  • Sensitive production data, healthcare, finance, or other regulated domains where you need an anonymized set.
  • Regression test set, a curated golden set of 30 past failure cases that every new prompt version must handle correctly.

Schema

DatasetItemidstringautoinputobjectOpenAI messages shapeexpected_outputstring | nullgrading target (optional)metadatajsonbtags, source, weightsEvalsLLM-as-judge scoreExperimentsdiff v1 vs v2
Each dataset item carries an input, an optional expected output for grading, and free-form metadata. Evals and Experiments consume the same shape.

datasets table

  • name, unique within the organization
  • description, free text
  • archived_at, soft delete

dataset_items table

  • input (jsonb), two shapes are accepted:
    { "variables": { "company_name": "Acme", "customer_name": "Alice" } }
    { "messages": [{ "role": "user", "content": "..." }] }
    json
    Bulk upload also accepts a plain string for input; the server wraps it as a single user message automatically.
  • expected_output, reference answer text (optional). Stored alongside the item but not consumed by the eval runner in this release. The runner generates a fresh response per item and scores that, so prompt quality is what gets measured.
  • source_request_id, set when the item was imported from a production request.

Four ways to add items

1. Manual entry (dashboard)

Go to /datasets, select a dataset, click Add item, then toggle between two input modes:

  • User message, a single chat-style user message
  • Variables JSON, for prompts with {{var}} placeholders

2. Import from production requests (API)

curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items/import-requests \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{ "requestIds": ["uuid-1", "uuid-2", ...] }'
bash

The server extracts request_body.messages as input and the response text as expected_output and saves them in bulk (max 200 per request).

3. Single item (API)

curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{
    "input": { "variables": { "name": "Alice" } },
    "expectedOutput": "Hello Alice, how can I help?"
  }'
bash

4. File upload (dashboard)

Two entry points accept a file and bulk-insert its rows in one step:

  • New dataset dialog — on the /datasets list page, click New dataset, then switch to the Upload file tab. Pick a file, confirm the name (auto-filled from the filename), and click Create. A new dataset is created and all rows are inserted immediately.
  • Import items button — on an existing dataset's detail page, click Import items in the top-right and pick a file. Rows are added to the existing dataset and a confirmation count appears inline.

All parsing is done in the browser (no file is sent to the server). Accepted formats:

  • JSON array (.json), an array of { input, expected_output? } objects where input is { messages: [...] } or { variables: {...} }.
    [
      { "input": { "messages": [{ "role": "user", "content": "What is 2+2?" }] }, "expected_output": "4" },
      { "input": { "variables": { "name": "Alice" } }, "expected_output": "Hello Alice" }
    ]
    json
  • JSONL (.jsonl), one JSON object per line, same shape as above.
  • CSV (.csv), header row with input (required) and expected_output (optional). Quoted fields are supported. Plain-text input values are automatically wrapped as a single user message.
    input,expected_output
    What is the capital of France?,Paris
    "What is 2+2?",4
    text

Behind the scenes the upload calls:

curl https://server.spanlens.io/api/v1/datasets/<dataset-id>/items/bulk \
  -H "Authorization: Bearer $SPANLENS_JWT" \
  -H "Content-Type: application/json" \
  -d '{ "items": [ { "input": { "messages": [...] } }, ... ] }'
bash

Max 5000 items per request.

Connection to Evals

When running an Eval, select Source: Dataset, then pick a Run provider and Run model. The eval runner does this per item: take the dataset input, run it through the chosen prompt version with that run model, then send the generated response to the judge for scoring.

Older releases scored the dataset's expected_output text directly. That measured how friendly the curated reference text was, not how the prompt actually behaves, so it was replaced. expected_output is stored but unused by the runner in the current release.

API

Method + PathDescription
POST /api/v1/datasetsCreate a dataset
GET /api/v1/datasetsList datasets with item_count
GET /api/v1/datasets/:idDataset with all items
DELETE /api/v1/datasets/:idSoft archive
POST /api/v1/datasets/:id/itemsAdd a single item
POST /api/v1/datasets/:id/items/import-requestsBulk import from request IDs (max 200)
POST /api/v1/datasets/:id/items/bulkBulk insert pre parsed items (max 5000). Used by the Eval Run dialog file upload.
DELETE /api/v1/datasets/:id/items/:itemIdDelete a single item

Limitations

  • No item edit UI. To correct a wrongly entered item, delete it and add a new one.
  • Bulk upload limit is 5000 items. Larger sets must be split across multiple calls.
  • expected_output is reference only. The eval runner does not currently feed it to the judge as a target. A future release may add a similarity mode that uses it.

Related: Evals, Experiments, /datasets dashboard.