Savings

Spanlens analyzes your LLM traffic over a configurable window (7 / 14 / 30 days) and suggests specific (provider, model)pairs that can be swapped for cheaper alternatives at the same task quality. Recommendations come with an estimated monthly savings figure in USD, a confidence tier, and automatic detection when you've already made the switch.

Why it matters

The most common LLM cost mistake is using GPT-4 for everything. Extraction, classification, short-form generation, intent detection, these workloads are indistinguishable from GPT-4o-mini at 1/15 the price, but teams default to the most capable model out of caution and never revisit.

Savings is a cold look at your actual usage: “You sent 42,000 gpt-4o calls last week with an average prompt of 180 tokens and output of 85 tokens. That pattern fits the gpt-4o-mini envelope. Switching would save ~$380/month.”

How it works

Aggregation

Spanlens aggregates the requests table over your chosen analysis window (see Analysis window below), grouped by (provider, model), computing:

  • sampleCount, how many requests in the bucket
  • avgPromptTokens, avgCompletionTokens
  • totalCostUsdLastNDays, actual spend over the window
  • Extrapolated monthly cost = window total ÷ window days × 30

Substitute matching

Each bucket is matched against a curated SUBSTITUTES rule table in lib/model-recommend-rules.ts. Current rules (subject to change as models release):

Current modelSuggested substituteCost ratioToken envelope
openai:gpt-4oopenai:gpt-4o-mini6%prompt ≤ 500, completion ≤ 150
openai:gpt-4.1openai:gpt-4.1-mini20%prompt ≤ 500, completion ≤ 150
openai:gpt-4-turboopenai:gpt-4o25%prompt ≤ 2000, completion ≤ 500
openai:gpt-4openai:gpt-4o8.3%prompt ≤ 4000, completion ≤ 1000
anthropic:claude-opus-4-7anthropic:claude-haiku-4.520%prompt ≤ 500, completion ≤ 200
anthropic:claude-3-opus-20240229anthropic:claude-haiku-4.56.7%prompt ≤ 500, completion ≤ 200
anthropic:claude-sonnet-4-6anthropic:claude-haiku-4.533.3%prompt ≤ 800, completion ≤ 250
anthropic:claude-haiku-4-5anthropic:claude-haiku-4.533.3%prompt ≤ 800, completion ≤ 250
gemini:gemini-2.5-progemini:gemini-2.5-flash25%prompt ≤ 1000, completion ≤ 300
gemini:gemini-2.5-progemini:gemini-2.5-flash6%prompt ≤ 1000, completion ≤ 300

A recommendation fires only if bothavg-prompt and avg-completion fit inside the envelope. This is the conservative guard, if your average request exceeds the envelope, the suggested cheaper model probably will underperform on your actual workload, and we don't show it.

Longest-prefix matching for dated variants

OpenAI returns dated model strings (gpt-4o-mini-2024-07-18) in the response body, and that's what ends up in requests.model. The matcher does an exact lookup first, then falls back to longest boundary-aware prefix matchso a dated variant correctly resolves to its family rule:

matchSubstitute('openai:gpt-4o-2024-08-06')
// → resolves to 'openai:gpt-4o' rule
// → suggests gpt-4o-mini
ts

Savings calculation

monthlyCostCurrent   = totalCostUsdLastNDays * (30 / windowDays)
monthlyCostSuggested = monthlyCostCurrent * substitute.costRatio
estimatedMonthlySavingsUsd = monthlyCostCurrent - monthlyCostSuggested
text

Only recommendations with estimatedMonthlySavingsUsd ≥ $5surface in the dashboard by default, below that, the signal-to-noise isn't worth your attention. You can override this with the ?minSavings= query parameter.

Confidence tiers

Each recommendation is assigned a confidence tier based on projected savings and sample volume. Higher volume = more representative average tokens = more trustworthy envelope match.

TierCriteriaSignal
High≥ $40/mo projected savings and ≥ 100 samples3-bar indicator (green)
Medium≥ $10/mo projected savings and ≥ 30 samples2-bar indicator (neutral)
LowBelow medium threshold1-bar indicator (muted)

The hero tile at the top of the Savings dashboard surfaces the highest tier available and its aggregate savings figure. High-confidence recommendations also trigger automatic email alerts once per recommendation.

Achieved tracking

Spanlens automatically detects when you've acted on a recommendation by comparing spend in two equal-length windows:

currentWindow  = [now - N days  …  now]
priorWindow    = [now - 2N days …  now - N days]

dropPct = (priorWindowCost − currentWindowCost) / priorWindowCost
text

When dropPct ≥ 70%, the recommendation is marked achieved. The dashboard shows:

  • A green ACHIEVED badge instead of SWAP
  • ACTUAL / MO, the realized monthly savings based on the observed drop
  • The drop percentage vs. the prior window (“usage dropped 72% vs prior 7d”)
  • The originally projected savings for comparison

Achieved recommendations live in a separate collapsible section below the open list. They bypass the minSavingsfilter (you've already made the change, hiding the row would be confusing) but still require the prior window spend to be meaningful (≥ the minSavings threshold annualized).

The hero tile gains a achieved X/mo ✓ line when any achieved swaps exist, and the fourth stat tile switches from showing the best confidence tier to showing total realized monthly savings and the count of adopted swaps.

The sort order places open recommendations first (by estimated savings desc), then achieved (by actual savings desc).

Analysis window

The topbar of the Savings page has a 7d / 14d / 30d selector that controls how far back Spanlens looks when computing averages. The selection is per-session (not persisted) and defaults to 7 days.

  • 7 days, most responsive to recent model usage changes; default.
  • 14 days, smooths out weekly spikes; useful when traffic is seasonal.
  • 30 days, highest sample counts, most stable confidence tiers.

Changing the window re-fetches the API with a different ?hours= value and recomputes all savings estimates in-page, no page reload needed. The achieved tracking prior window always equals the same duration as the selected analysis window.

Using it

Dashboard

Visit /savings in the sidebar. The page has four zones:

  • Hero tile, shows your total projected open savings, best confidence tier, and (when present) achieved monthly savings. The three stat tiles to the right show spend in the window, open count, and achieved savings / best confidence tier. Hero tile totals are computed from the unfilteredopen set, active filters don't shrink the headline number.
  • Sort & filter bar, sort by Savings (default), Confidence, or Name; filter by Provider (OpenAI / Anthropic / Gemini) or Confidence tier (High / Medium / Low). Sort and filter state is persisted in localStorageacross page reloads. A filter badge shows how many recommendations are visible vs. total when filters are active. A “Clear filters” empty state appears if filters exclude all open rows.
  • Open recommendations, each row shows: current model → suggested model, the confidence bar, rationale text, projected monthly savings, and three action buttons: Compare (opens the Compare dialog), Simulate (opens the Simulate dialog), and Hide.
  • Achieved section, a collapsible section (click to expand) listing recommendations where Spanlens detected a ≥70% spend drop. Rows show a green ACHIEVED badge, the realized monthly savings, and the drop percentage vs. the prior window. No Simulate button, you already switched.
  • Hidden section, recommendations you've dismissed live here. A “Show hidden” toggle expands the section; each row has an Unhide button to restore it.

Compare dialog

Click Compare on any open recommendation to run both models side by side against a real prompt, without touching your application code.

  1. Pick a prompt version, choose any saved Prompts version as the input. The playground will fill in the prompt template automatically.
  2. Choose provider keys, select the provider key to use for each model (current and suggested). Both models must have an active key registered in /projects.
  3. Run, both models execute in parallel. While running, each column shows a loading spinner.

Each result card shows:

  • The model's full text response
  • Latency (ms) and total tokens
  • Cost in USD for that single call

Below the two cards, a delta summaryline shows the cost and latency difference between the suggested and current model (e.g. “−$0.0004 / call · +120 ms”). This gives you a quick sanity check on whether the savings estimate matches real-world behavior on your prompt.

Compare calls are executed via the Playground endpoint (POST /api/v1/prompts/playground/run) and are not logged as production requests.

Simulate dialog

Click Simulate on any open recommendation to open the Simulate dialog. It contains:

  • Cost summary, spend in the current window, projected monthly savings, and the projection formula for transparency.
  • Token distribution, a P50 / P95 / P99 table for prompt and completion tokens, computed from your actual requests (not just the average). Each row is compared against the substitute rule's envelope, the maximum average token count the cheaper model is rated for.

If P95 exceeds the envelopefor prompt or completion tokens, a yellow warning box appears: “P95 exceeds the substitute envelope for prompt tokens. Some requests may degrade in quality, run a shadow comparison first.” This is the key signal that the average looks safe but the tail of your distribution might not be.

Token distribution data is fetched lazily, the API call only fires when you open the dialog, not on page load.

Hiding recommendations

Click Hideon a row to dismiss a recommendation you've already evaluated and decided against. Dismissed rows move to the collapsible “Hidden recommendations” section at the bottom of the page and are stored in localStorage, they persist across page reloads in the same browser. Click Unhide inside the hidden section to bring a row back.

When all visible recommendations have been hidden, the empty state message changes from the generic “no opportunities” copy to “All recommendations are hidden , use Show hidden to review them.”

High-confidence email alerts

Every day at 09:00 UTC, Spanlens runs the recommendation engine for every organization and checks for high-confidenceswaps (≥ $40/mo + ≥ 100 samples) that haven't been notified before. When new high-confidence recommendations are found, the org owner receives a plain-text email listing:

  • The current and suggested (provider, model) pair
  • Projected monthly savings in USD
  • Sample count (for context on estimate quality)
  • A direct link to the Savings dashboard

Notifications are idempotent, each (org, swap pair) triggers at most one email, stored in the recommendation_notifications table. Future cron runs skip already- notified pairs. A new notification fires only if a net-new high-confidence recommendation appears (e.g., more traffic builds confidence on a previously low-tier swap).

API

GET /api/v1/recommendations

GET /api/v1/recommendations
GET /api/v1/recommendations?hours=336          # 14-day window
GET /api/v1/recommendations?hours=720          # 30-day window
GET /api/v1/recommendations?minSavings=20      # only show ≥ $20/mo

# →
#   {
#     "data": [
#       {
#         "currentProvider": "openai",
#         "currentModel": "gpt-4o-2024-08-06",
#         "sampleCount": 42103,
#         "avgPromptTokens": 180,
#         "avgCompletionTokens": 85,
#         "totalCostUsdLastNDays": 96.25,
#         "suggestedProvider": "openai",
#         "suggestedModel": "gpt-4o-mini",
#         "estimatedMonthlySavingsUsd": 387.75,
#         "reason": "Short inputs/outputs suggest classification/extraction workload.",
#         "maxPromptTokens": 500,
#         "maxCompletionTokens": 150,
#         "priorWindowCostUsd": 98.10,
#         "achieved": false,
#         "actualMonthlySavingsUsd": null
#       }
#     ],
#     "meta": { "hours": 168, "minSavingsUsd": 5 }
#   }
bash

Query parameters:

ParameterDefaultDescription
hours168 (7 days)Analysis window in hours. Use 336 for 14 days, 720 for 30 days.
minSavings5Minimum projected monthly savings in USD. Recommendations below this threshold are excluded (achieved items bypass this filter).

Response fields of note:

FieldTypeDescription
maxPromptTokensnumberThe substitute rule's max avg prompt token envelope, used by the Simulate dialog.
maxCompletionTokensnumberThe substitute rule's max avg completion token envelope.
priorWindowCostUsdnumber | nullSpend in the prior equal-length window. null if no data exists for that period.
achievedbooleantrue when spend dropped ≥ 70% vs the prior window.
actualMonthlySavingsUsdnumber | nullRealized monthly savings when achieved; null otherwise.

GET /api/v1/recommendations/percentiles

GET /api/v1/recommendations/percentiles?provider=openai&model=gpt-4o&hours=168

# →
#   {
#     "data": {
#       "p50PromptTokens": 180,
#       "p95PromptTokens": 340,
#       "p99PromptTokens": 520,
#       "p50CompletionTokens": 85,
#       "p95CompletionTokens": 148,
#       "p99CompletionTokens": 210,
#       "sampleCount": 42103
#     }
#   }
bash

Query parameters:

ParameterRequiredDescription
providerYesProvider slug, e.g. openai, anthropic, gemini, azure.
modelYesModel string as stored in requests.model, e.g. gpt-4o-2024-08-06.
hoursNoAnalysis window in hours. Default 168 (7 days). Should match the window used for the parent recommendation.

Returns null in data if fewer than 5 requests with non-null token counts exist in the window. Computed in-database via percentile_cont ordered-set aggregation, no row scanning in application code.

Design choices

  • Rules are curated, not ML. Empirical cost ratios and token envelopes come from hand-tested substitutions. A learned recommender would drift as model prices change weekly; curated rules are easier to audit and correct.
  • Achieved tracking uses a spend signal, not a model-field signal. We compare spend in two consecutive windows rather than inspecting the requests.model field of future requests. This correctly handles partial rollouts, gradual traffic shifts, and cases where both models are still active.
  • No A/B auto-rollout.We show you the recommendation; you decide whether to switch. Automated multi-armed-bandit model routing is out of scope for launch, it's a different product surface.
  • Conservative envelope. Better to miss a borderline recommendation than to suggest a swap that degrades your UX. False negatives are recoverable; false positives break trust.
  • Email alerts are once-per-recommendation. Nagging users with the same recommendation every day would train them to ignore the emails. One notification per high-confidence finding; future findings on new pairs trigger fresh alerts.
  • P95 vs. P99 for envelope warnings. P95 was chosen as the warning threshold rather than P99, P99 would suppress warnings for tails that are meaningfully out-of-envelope. P99 is shown for reference in the Simulate dialog.

Limitations

  • Token-based, not task-based.A 200-token prompt can be “summarize this article” (gpt-4o-mini is fine) or “generate the JSON schema for my domain model” (gpt-4o is better). The envelope catches most cases but occasional false positives are possible, hence the manual-approval loop.
  • No cross-provider recommendations yet.We don't suggest “switch from gpt-4o-mini to claude-haiku” even when cheaper, accuracy comparisons across providers are too workload-dependent to ship blind.
  • Dismiss and sort/filter state are browser-local. Both are stored in localStorage and do not sync across devices or team members.
  • Achieved tracking requires two full windows of data. If your organization has been on Spanlens for less than 2× the analysis window, the prior window will be empty and priorWindowCostUsd will be null, achieved detection is not possible for that period.

Related: Prompts (A/B by cost), /savings dashboard. Source: apps/server/src/lib/model-recommend-rules.ts, apps/server/src/lib/model-recommend.ts, apps/server/src/lib/recommendation-notify.ts.