Cost tracking

Every request row carries a cost_usdfield computed at ingest time from your actual token usage and the provider's current list price. No approximations, no post-hoc math in the dashboard, the number is deterministic and auditable.

Why it matters

Provider dashboards show spend a day late and at the billing-period level. That's useless for the daily question, “is my new feature's LLM usage going to blow up the budget?” Spanlens computes cost the moment the response lands, so you can track per-minute, per-project, per-user cost without waiting for an invoice.

How it works

Input vs output price per 1M tokens for popular models (May 2026 list rates). Output is typically 4–5× input — your prompt-to-completion ratio drives total spend more than model choice alone.

Price table

Prices live in the model_prices Supabase table, USD per 1M tokens, separately for prompt, completion, and cache read/write. The proxy reads them through an in-memory cache (5-minute stale-while-revalidate) with a hardcoded fallback in apps/server/src/lib/model-prices-cache.ts for cold-start. Updates to the table take effect within 5 minutes per Vercel function instance, no redeploy needed. Snapshot as of 2026-05:

// OpenAI
'gpt-4o':                         { prompt: 2.5,   completion: 10   }
'gpt-4o-mini':                    { prompt: 0.15,  completion: 0.6  }
'gpt-4.1':                        { prompt: 2.0,   completion: 8.0  }
'gpt-4.1-mini':                   { prompt: 0.4,   completion: 1.6  }
'gpt-4.1-nano':                   { prompt: 0.1,   completion: 0.4  }
'gpt-4-turbo':                    { prompt: 10,    completion: 30   }
'gpt-4':                          { prompt: 30,    completion: 60   }
// Anthropic
'claude-opus-4-7':                { prompt: 5,     completion: 25   }
'claude-sonnet-4-6':              { prompt: 3,     completion: 15   }
'claude-haiku-4-5':               { prompt: 1,     completion: 5    }
'claude-3-5-haiku-20241022':      { prompt: 0.8,   completion: 4    }
// Gemini
'gemini-2.5-pro':                 { prompt: 1.25,  completion: 10   }
'gemini-2.5-flash':               { prompt: 0.3,   completion: 2.5  }
'gemini-2.5-flash-lite':          { prompt: 0.1,   completion: 0.4  }
'gemini-2.0-flash':               { prompt: 0.1,   completion: 0.4  }
// ...
ts

The formula

// nonCachedPromptTokens = promptTokens - cacheReadTokens - cacheWriteTokens
promptCost      = (nonCachedPromptTokens / 1_000_000) * price.prompt
cacheReadCost   = (cacheReadTokens       / 1_000_000) * price.cacheRead   // ≈ 0.1× input on Anthropic, 0.5× on OpenAI
cacheWriteCost  = (cacheWriteTokens      / 1_000_000) * price.cacheWrite  // ≈ 1.25× input on Anthropic (5min); n/a on OpenAI
completionCost  = (completionTokens      / 1_000_000) * price.completion
totalCost       = promptCost + cacheReadCost + cacheWriteCost + completionCost
ts

Embeddings (2026-06-12+)

OpenAI embedding calls land on /requests with a real cost_usd, not NULL. The proxy was always forwarding POST /v1/embeddings and the parser was always extracting usage.prompt_tokens; the only missing piece was a price row. The three current models (text-embedding-3-small at $0.020 / 1M, text-embedding-3-large at $0.130 / 1M, text-embedding-ada-002 at $0.100 / 1M) are now seeded in model_prices and the in-memory fallback, so cold-start instances calculate the right cost before the cache refreshes from Supabase.

Embeddings are input-only — the completion-side price stays 0 and the formula above contributes nothing on that axis. For a RAG workload this can move the daily spend figure by 30~50% once historical embedding traffic flows through the new pricing rows on subsequent calls (existing pre-2026-06-12 rows keep their original NULL cost).

Prompt caching (2026-05-14+)

Anthropic cache_read_input_tokens / cache_creation_input_tokens and OpenAI prompt_tokens_details.cached_tokensare now extracted and charged at each provider's reduced cache rate. The original behaviour was to fold them into promptTokens at the regular rate, which over-counted cost by 2–10× for cache-heavy workloads. requests.cache_read_tokens / cache_write_tokens store the breakdown alongside every new row.

Historical rows (pre-2026-05-14) keep their original cost_usd and have cache_*_tokens = 0, backfill isn't possible because the raw breakdown wasn't recorded.

The dated-variant problem (critical gotcha)

OpenAI returns dated variants in the model field of the response body (e.g. you request gpt-4o-mini and get back gpt-4o-mini-2024-07-18). That dated string is what lands in requests.model. Naive lookup against the price map keyed by gpt-4o-mini would miss and return null.

calculateCost() handles this by:

  1. Exact match first, if gpt-4o-mini-2024-07-18 is in the table, use it.
  2. Otherwise, longest boundary-aware prefix match. The model id must start with a registered key followed by -, so gpt-4 does not accidentally match gpt-4o-mini.

The same matching pattern is reused by Savings and any future feature that keys on model family.

Graceful degradation on unknown models

If a request comes in for a model we don't have pricing for (brand-new release, fine-tuned custom model), calculateCost() returns nulland the row's cost_usd is NULL. Dashboard filters this out of cost aggregates, we never estimate or fabricate. The gap is visible, not hidden.

Fix: a Spanlens operator can add the row directly to model_prices via the admin API (POST /api/v1/admin/model-prices) , the cache picks it up within 5 minutes, no deploy required. Backfill isn't retroactive; cost appears on new requests only.

Using it

Programmatic access

Cost is a first-class field on every request. Fetch via:

# All requests with cost, last 7 days
GET /api/v1/requests?sinceHours=168

# Aggregate cost by model
GET /api/v1/stats?sinceHours=720&groupBy=model
# → [
#     { "model": "gpt-4o", "requestCount": 1204, "totalCostUsd": 12.84 },
#     { "model": "gpt-4o-mini", "requestCount": 42103, "totalCostUsd": 6.32 }
#   ]
bash

Per-project rollup

Pass projectId to scope. Useful for chargeback when multiple teams share one Spanlens org:

GET /api/v1/stats?projectId=<uuid>&sinceHours=720&groupBy=model
bash

Design choices

  • Computed at ingest, not at read time.Freezes the price at the moment of the call. If OpenAI drops gpt-4o prices tomorrow, your historical cost data doesn't retroactively change. Audit-friendly.
  • Stored as numeric(14,8). 8 decimal places of precision , enough to represent fractional cents on very cheap models without rounding error.
  • Table, not API.We don't fetch live prices from provider APIs , they don't expose one. Hand-maintained table is the reality for the industry.

Limitations

  • Price table drifts.When a provider changes prices, our table needs a PR. Tracked as a monthly maintenance item. If you're self-hosting, pin a specific commit or expect to cherry-pick updates.
  • No batch API discount. OpenAI and Anthropic both offer ~50% off batch calls. Spanlens treats a batch response as a normal request. Mark batch traffic with a custom tag and filter manually for now.

Note: this page is about per-request USD cost (how much each LLM call cost you at the provider). For Spanlens subscription billing, plan quotas, overage rates, invoicing, see Billing & quotas.

Related: Requests, Savings, /dashboard. Source: apps/server/src/lib/cost.ts.