Savings
Spanlens analyzes your last 7 days of LLM traffic and suggests specific (provider, model) pairs that can be swapped for cheaper alternatives at the same task quality. Recommendations come with an estimated monthly savings figure in USD — no hand-waving.
Why it matters
The most common LLM cost mistake is using GPT-4 for everything. Extraction, classification, short-form generation, intent detection — these workloads are indistinguishable from GPT-4o-mini at 1/15 the price, but teams default to the most capable model out of caution and never revisit.
Savings is a cold look at your actual usage: “You sent 42,000 gpt-4o calls last week with an average prompt of 180 tokens and output of 85 tokens. That pattern fits the gpt-4o-mini envelope. Switching would save ~$380/month.”
How it works
Aggregation
Every 24 hours we aggregate the last N=7 days of requests grouped by (provider, model), computing:
sampleCount— how many requests in the bucketavgPromptTokens,avgCompletionTokenstotalCostUsd— actual spend- Extrapolated monthly cost = 7-day total ÷ 7 × 30
Substitute matching
Each bucket is matched against a curated SUBSTITUTES rule table in lib/model-recommend-rules.ts. Current rules (subject to change as models release):
| Current model | Suggested substitute | Cost ratio | Token envelope |
|---|---|---|---|
openai:gpt-4o | openai:gpt-4o-mini | 6% | prompt ≤ 500, completion ≤ 150 |
anthropic:claude-3-opus | anthropic:claude-haiku-4.5 | 4% | prompt ≤ 500, completion ≤ 200 |
anthropic:claude-3-5-sonnet | anthropic:claude-haiku-4.5 | 25% | prompt ≤ 800, completion ≤ 250 |
gemini:gemini-1.5-pro | gemini:gemini-1.5-flash | 6.7% | prompt ≤ 1000, completion ≤ 300 |
A recommendation fires only if both avg-prompt and avg-completion fit inside the envelope. This is the conservative guard — if your average request exceeds the envelope, the suggested cheaper model probably will underperform on your actual workload, and we don't show it.
Longest-prefix matching for dated variants
OpenAI returns dated model strings (gpt-4o-mini-2024-07-18) in the response body, and that's what ends up in requests.model. The matcher does an exact lookup first, then falls back to longest boundary-aware prefix matchso a dated variant correctly resolves to its family rule:
matchSubstitute('openai:gpt-4o-2024-08-06')
// → resolves to 'openai:gpt-4o' rule
// → suggests gpt-4o-minitsSavings calculation
monthlyCostCurrent = totalCostUsd * (30 / 7)
monthlyCostSuggested = monthlyCostCurrent * substitute.costRatio
estimatedSavingsUsd = monthlyCostCurrent - monthlyCostSuggestedtextOnly recommendations with estimatedSavingsUsd ≥ $5 surface in the dashboard — below that, the signal-to-noise isn't worth your attention.
Using it
Dashboard
Visit /recommendations (labeled “Savings” in the sidebar). Each row shows:
- Current model + sample count + monthly cost
- Suggested model + estimated monthly cost after swap
- Estimated savings in USD/month
- Rationale — the rule's
reasonstring (e.g. “Short inputs/outputs suggest classification workload”)
API
GET /api/v1/recommendations
# →
# [
# {
# "currentProvider": "openai",
# "currentModel": "gpt-4o-2024-08-06",
# "sampleCount": 42103,
# "avgPromptTokens": 180,
# "avgCompletionTokens": 85,
# "monthlyCostCurrentUsd": 412.50,
# "suggestedProvider": "openai",
# "suggestedModel": "gpt-4o-mini",
# "monthlyCostSuggestedUsd": 24.75,
# "estimatedSavingsUsd": 387.75,
# "reason": "Short inputs/outputs suggest classification/extraction workload — gpt-4o-mini covers it at ~15x lower cost."
# }
# ]bashDesign choices
- Rules are curated, not ML. Empirical cost ratios and token envelopes come from hand-tested substitutions. A learned recommender would drift as model prices change weekly; curated rules are easier to audit and correct.
- No A/B auto-rollout. We show you the recommendation; you decide whether to switch. Automated multi-armed-bandit model routing is out of scope for launch — it's a different product surface.
- Conservative envelope. Better to miss a borderline recommendation than to suggest a swap that degrades your UX. False negatives are recoverable; false positives break trust.
Limitations
- Token-based, not task-based. A 200-token prompt can be “summarize this article” (gpt-4o-mini is fine) or “generate the JSON schema for my domain model” (gpt-4o is better). The envelope catches most cases but occasional false positives are possible — hence the manual-approval loop.
- Rule table needs periodic refresh. New models (GPT-5, Claude 4.7) need rule entries added. Tracked as a Phase 3 maintenance item.
- No cross-provider recommendations yet. We don't suggest “switch from gpt-4o-mini to claude-haiku” even when cheaper — accuracy comparisons across providers are too workload-dependent to ship blind.
Related: Prompts (A/B by cost), /recommendations dashboard. Source: apps/server/src/lib/model-recommend-rules.ts.