LLM Cost Tracking: Monitor and Reduce Your AI API Spend
What to capture, how to slice it, and the five swaps that typically cut OpenAI + Anthropic + Gemini bills by 30 to 60 percent without quality regression.
LLM API bills surprise teams because cost is a moving target. Token counts change with prompt edits, model prices change with provider releases, and a single hot user can multiply your spend overnight. Without per-request tracking, the first signal you get is the monthly invoice — by which point the damage is already booked.
This guide covers what cost data to capture, how to aggregate it, and the handful of high-leverage swaps that actually move the bill. For interactive estimation, use the LLM cost calculator. For provider-specific pricing breakdowns, see /pricing/gpt-4o, /pricing/claude-3-5-sonnet, and /pricing/gemini-2-0-flash.
What to capture per request
Model variant (exact dated name)
gpt-4o-mini-2024-07-18 is priced differently from gpt-4o-2024-08-06. Always capture the exact returned name, not just the requested model alias.
Input tokens (with cache split)
Anthropic and OpenAI both return cache_creation and cache_read counts separately. Cached tokens are cheaper and worth tracking independently.
Output tokens
Reasoning models (o1, o3-mini, Claude extended thinking) emit reasoning tokens that are billed as output but produce no user-visible text. Capture them as a separate field.
Provider returned cost (where available)
Anthropic now returns usage cost in the response. Capture it as cost_usd_provider for reconciliation against your own calculation.
Latency split (TTFT and total)
For cost-vs-quality decisions, time-to-first-token often matters more than total time. Streaming captures both.
Customer / session / endpoint tags
Three dimensions almost everyone wants to slice by. Tag at request time with X-Spanlens-User, X-Spanlens-Session, and a custom tag header.
Five swaps that cut bills 30 to 60 percent
- 1
Route routing calls to a small model
Classification, intent detection, and routing decisions almost never need a frontier model. Replace GPT-4o with GPT-4o-mini for these — same quality for narrow tasks at one-fifteenth the cost. Confirm with an eval comparison before rollout.
- 2
Enable prompt caching for shared system prompts
Anthropic prompt caching is 90% off cache reads. OpenAI cache reads are 50% off. If your system prompt is more than 1K tokens and shared across many requests, caching pays back immediately.
- 3
Cap max_tokens
A failed structured-output retry that runs to 4096 tokens costs 100x a successful one. Set max_tokens explicitly. Detect failures earlier with stricter output schemas.
- 4
Pre-summarize long context once
For an agent with a growing conversation history, summarize the older turns once and replace them with the summary. A 10-turn conversation that sends the full history every turn costs O(n²) — summarizing makes it O(n).
- 5
Switch reasoning models for the right tasks only
o1 and o3-mini are 6x to 60x more expensive than GPT-4o on output. Use them for hard reasoning steps only, not as a default. Most agent workflows have at most one or two reasoning-heavy steps.
Model price quick reference
| Model | Input / 1M | Output / 1M | Detail |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | See breakdown → |
| GPT-4o-mini | $0.15 | $0.60 | See breakdown → |
| o3-mini | $1.10 | $4.40 | — |
| Claude 3.5 Sonnet | $3.00 | $15.00 | See breakdown → |
| Claude 3.5 Haiku | $0.80 | $4.00 | — |
| Gemini 2.0 Flash | $0.10 | $0.40 | See breakdown → |
| Gemini 1.5 Pro | $1.25 | $5.00 | — |
Prices in USD as of 2026-06. See each model page for region, cache, and batch discounts.
Related
FAQ
How do I track LLM API costs?
Calculate cost per request from input + output tokens against the current model price table. Aggregate by model, prompt version, customer, or endpoint. Spanlens does this automatically — every captured request includes a cost_usd field computed from the provider response and the latest published price.
How is GPT-4o priced?
GPT-4o costs $2.50 per 1M input tokens and $10 per 1M output tokens at the standard tier. Cached inputs cost $1.25 per 1M. GPT-4o-mini is $0.15 per 1M input and $0.60 per 1M output — about 15x cheaper. See /pricing/gpt-4o for a fuller breakdown including monthly cost scenarios.
How do I reduce my OpenAI bill?
Five high-leverage moves. (1) Route classification and routing calls to GPT-4o-mini instead of GPT-4o — same quality for narrow tasks at one-fifteenth the cost. (2) Enable prompt caching for shared system prompts. (3) Set max_tokens so retries cannot run away with output cost. (4) Switch to JSON mode and trim the response schema. (5) Pre-summarize long context once instead of resending it per turn.
What is a model savings recommender?
A tool that analyzes your captured traffic and suggests model swaps with dollar figures attached, like "swap these gpt-4o classification calls to gpt-4o-mini, save $412/mo." Spanlens has one built in. The recommendation includes evidence: sample requests, expected accuracy delta from your own eval data, and the exact filter to apply.
How do budget alerts work?
Budget alerts fire when projected spend exceeds a configured threshold. Spanlens supports daily, weekly, and monthly budgets with optional alerts at 50%, 80%, and 100%. Alerts go to email on Pro and to Slack + webhooks on Team. The projection uses a rolling 7-day rate, not just a linear extrapolation of MTD spend.
Can I bill my customers for their LLM usage?
Yes. Tag each request with a customer ID using the X-Spanlens-User header. Spanlens then aggregates per-customer cost in /users — feed that to your billing system. Same applies for sessions, projects, or any other dimension you tag.
See your real cost per request, per model, per customer.