LLM Cost Tracking: Monitor and Reduce Your AI API Spend

What to capture, how to slice it, and the five swaps that typically cut OpenAI + Anthropic + Gemini bills by 30 to 60 percent without quality regression.

LLM API bills surprise teams because cost is a moving target. Token counts change with prompt edits, model prices change with provider releases, and a single hot user can multiply your spend overnight. Without per-request tracking, the first signal you get is the monthly invoice — by which point the damage is already booked.

This guide covers what cost data to capture, how to aggregate it, and the handful of high-leverage swaps that actually move the bill. For interactive estimation, use the LLM cost calculator. For provider-specific pricing breakdowns, see /pricing/gpt-4o, /pricing/claude-3-5-sonnet, and /pricing/gemini-2-0-flash.

What to capture per request

Model variant (exact dated name)

gpt-4o-mini-2024-07-18 is priced differently from gpt-4o-2024-08-06. Always capture the exact returned name, not just the requested model alias.

Input tokens (with cache split)

Anthropic and OpenAI both return cache_creation and cache_read counts separately. Cached tokens are cheaper and worth tracking independently.

Output tokens

Reasoning models (o1, o3-mini, Claude extended thinking) emit reasoning tokens that are billed as output but produce no user-visible text. Capture them as a separate field.

Provider returned cost (where available)

Anthropic now returns usage cost in the response. Capture it as cost_usd_provider for reconciliation against your own calculation.

Latency split (TTFT and total)

For cost-vs-quality decisions, time-to-first-token often matters more than total time. Streaming captures both.

Customer / session / endpoint tags

Three dimensions almost everyone wants to slice by. Tag at request time with X-Spanlens-User, X-Spanlens-Session, and a custom tag header.

Five swaps that cut bills 30 to 60 percent

1
Route routing calls to a small model
Classification, intent detection, and routing decisions almost never need a frontier model. Replace GPT-4o with GPT-4o-mini for these — same quality for narrow tasks at one-fifteenth the cost. Confirm with an eval comparison before rollout.
2
Enable prompt caching for shared system prompts
Anthropic prompt caching is 90% off cache reads. OpenAI cache reads are 50% off. If your system prompt is more than 1K tokens and shared across many requests, caching pays back immediately.
3
Cap max_tokens
A failed structured-output retry that runs to 4096 tokens costs 100x a successful one. Set max_tokens explicitly. Detect failures earlier with stricter output schemas.
4
Pre-summarize long context once
For an agent with a growing conversation history, summarize the older turns once and replace them with the summary. A 10-turn conversation that sends the full history every turn costs O(n²) — summarizing makes it O(n).
5
Switch reasoning models for the right tasks only
o1 and o3-mini are 6x to 60x more expensive than GPT-4o on output. Use them for hard reasoning steps only, not as a default. Most agent workflows have at most one or two reasoning-heavy steps.

Model price quick reference

Model	Input / 1M	Output / 1M	Detail
GPT-4o	$2.50	$10.00	See breakdown →
GPT-4o-mini	$0.15	$0.60	See breakdown →
o3-mini	$1.10	$4.40	—
Claude 3.5 Sonnet	$3.00	$15.00	See breakdown →
Claude 3.5 Haiku	$0.80	$4.00	—
Gemini 2.0 Flash	$0.10	$0.40	See breakdown →
Gemini 1.5 Pro	$1.25	$5.00	—

Prices in USD as of 2026-06. See each model page for region, cache, and batch discounts.

LLM Cost Calculator

Estimate monthly spend across providers in your browser.

LLM Observability Hub

The broader monitoring picture cost is a subset of.

Agent Tracing

Per-step cost breakdown for multi-step workflows.

Spanlens Pricing

How Spanlens itself prices request logging.

FAQ

How do I track LLM API costs?

Calculate cost per request from input + output tokens against the current model price table. Aggregate by model, prompt version, customer, or endpoint. Spanlens does this automatically — every captured request includes a cost_usd field computed from the provider response and the latest published price.

How is GPT-4o priced?

GPT-4o costs $2.50 per 1M input tokens and $10 per 1M output tokens at the standard tier. Cached inputs cost $1.25 per 1M. GPT-4o-mini is $0.15 per 1M input and $0.60 per 1M output — about 15x cheaper. See /pricing/gpt-4o for a fuller breakdown including monthly cost scenarios.

How do I reduce my OpenAI bill?

Five high-leverage moves. (1) Route classification and routing calls to GPT-4o-mini instead of GPT-4o — same quality for narrow tasks at one-fifteenth the cost. (2) Enable prompt caching for shared system prompts. (3) Set max_tokens so retries cannot run away with output cost. (4) Switch to JSON mode and trim the response schema. (5) Pre-summarize long context once instead of resending it per turn.

What is a model savings recommender?

A tool that analyzes your captured traffic and suggests model swaps with dollar figures attached, like "swap these gpt-4o classification calls to gpt-4o-mini, save $412/mo." Spanlens has one built in. The recommendation includes evidence: sample requests, expected accuracy delta from your own eval data, and the exact filter to apply.

How do budget alerts work?

Budget alerts fire when projected spend exceeds a configured threshold. Spanlens supports daily, weekly, and monthly budgets with optional alerts at 50%, 80%, and 100%. Alerts go to email on Pro and to Slack + webhooks on Team. The projection uses a rolling 7-day rate, not just a linear extrapolation of MTD spend.

Can I bill my customers for their LLM usage?

Yes. Tag each request with a customer ID using the X-Spanlens-User header. Spanlens then aggregates per-customer cost in /users — feed that to your billing system. Same applies for sessions, projects, or any other dimension you tag.

See your real cost per request, per model, per customer.

Start free →Try the calculator

LLM Cost Tracking: Monitor and Reduce Your AI API Spend

What to capture per request

Model variant (exact dated name)

Input tokens (with cache split)

Output tokens

Provider returned cost (where available)

Latency split (TTFT and total)

Customer / session / endpoint tags

Five swaps that cut bills 30 to 60 percent

Route routing calls to a small model

Enable prompt caching for shared system prompts

Cap max_tokens

Pre-summarize long context once

Switch reasoning models for the right tasks only

Model price quick reference

Related

FAQ