Savings

Spanlens analyzes your last 7 days of LLM traffic and suggests specific (provider, model) pairs that can be swapped for cheaper alternatives at the same task quality. Recommendations come with an estimated monthly savings figure in USD — no hand-waving.

Why it matters

The most common LLM cost mistake is using GPT-4 for everything. Extraction, classification, short-form generation, intent detection — these workloads are indistinguishable from GPT-4o-mini at 1/15 the price, but teams default to the most capable model out of caution and never revisit.

Savings is a cold look at your actual usage: “You sent 42,000 gpt-4o calls last week with an average prompt of 180 tokens and output of 85 tokens. That pattern fits the gpt-4o-mini envelope. Switching would save ~$380/month.”

How it works

Aggregation

Every 24 hours we aggregate the last N=7 days of requests grouped by (provider, model), computing:

  • sampleCount — how many requests in the bucket
  • avgPromptTokens, avgCompletionTokens
  • totalCostUsd — actual spend
  • Extrapolated monthly cost = 7-day total ÷ 7 × 30

Substitute matching

Each bucket is matched against a curated SUBSTITUTES rule table in lib/model-recommend-rules.ts. Current rules (subject to change as models release):

Current modelSuggested substituteCost ratioToken envelope
openai:gpt-4oopenai:gpt-4o-mini6%prompt ≤ 500, completion ≤ 150
anthropic:claude-3-opusanthropic:claude-haiku-4.54%prompt ≤ 500, completion ≤ 200
anthropic:claude-3-5-sonnetanthropic:claude-haiku-4.525%prompt ≤ 800, completion ≤ 250
gemini:gemini-1.5-progemini:gemini-1.5-flash6.7%prompt ≤ 1000, completion ≤ 300

A recommendation fires only if both avg-prompt and avg-completion fit inside the envelope. This is the conservative guard — if your average request exceeds the envelope, the suggested cheaper model probably will underperform on your actual workload, and we don't show it.

Longest-prefix matching for dated variants

OpenAI returns dated model strings (gpt-4o-mini-2024-07-18) in the response body, and that's what ends up in requests.model. The matcher does an exact lookup first, then falls back to longest boundary-aware prefix matchso a dated variant correctly resolves to its family rule:

matchSubstitute('openai:gpt-4o-2024-08-06')
// → resolves to 'openai:gpt-4o' rule
// → suggests gpt-4o-mini
ts

Savings calculation

monthlyCostCurrent   = totalCostUsd * (30 / 7)
monthlyCostSuggested = monthlyCostCurrent * substitute.costRatio
estimatedSavingsUsd  = monthlyCostCurrent - monthlyCostSuggested
text

Only recommendations with estimatedSavingsUsd ≥ $5 surface in the dashboard — below that, the signal-to-noise isn't worth your attention.

Using it

Dashboard

Visit /recommendations (labeled “Savings” in the sidebar). Each row shows:

  • Current model + sample count + monthly cost
  • Suggested model + estimated monthly cost after swap
  • Estimated savings in USD/month
  • Rationale — the rule's reason string (e.g. “Short inputs/outputs suggest classification workload”)

API

GET /api/v1/recommendations

# →
#   [
#     {
#       "currentProvider": "openai",
#       "currentModel": "gpt-4o-2024-08-06",
#       "sampleCount": 42103,
#       "avgPromptTokens": 180,
#       "avgCompletionTokens": 85,
#       "monthlyCostCurrentUsd": 412.50,
#       "suggestedProvider": "openai",
#       "suggestedModel": "gpt-4o-mini",
#       "monthlyCostSuggestedUsd": 24.75,
#       "estimatedSavingsUsd": 387.75,
#       "reason": "Short inputs/outputs suggest classification/extraction workload — gpt-4o-mini covers it at ~15x lower cost."
#     }
#   ]
bash

Design choices

  • Rules are curated, not ML. Empirical cost ratios and token envelopes come from hand-tested substitutions. A learned recommender would drift as model prices change weekly; curated rules are easier to audit and correct.
  • No A/B auto-rollout. We show you the recommendation; you decide whether to switch. Automated multi-armed-bandit model routing is out of scope for launch — it's a different product surface.
  • Conservative envelope. Better to miss a borderline recommendation than to suggest a swap that degrades your UX. False negatives are recoverable; false positives break trust.

Limitations

  • Token-based, not task-based. A 200-token prompt can be “summarize this article” (gpt-4o-mini is fine) or “generate the JSON schema for my domain model” (gpt-4o is better). The envelope catches most cases but occasional false positives are possible — hence the manual-approval loop.
  • Rule table needs periodic refresh. New models (GPT-5, Claude 4.7) need rule entries added. Tracked as a Phase 3 maintenance item.
  • No cross-provider recommendations yet. We don't suggest “switch from gpt-4o-mini to claude-haiku” even when cheaper — accuracy comparisons across providers are too workload-dependent to ship blind.

Related: Prompts (A/B by cost), /recommendations dashboard. Source: apps/server/src/lib/model-recommend-rules.ts.