LLM Cost Tracking: Monitor and Reduce Your AI API Spend

What to capture, how to slice it, and the five swaps that typically cut OpenAI + Anthropic + Gemini bills by 30 to 60 percent without quality regression.

LLM API bills surprise teams because cost is a moving target. Token counts change with prompt edits, model prices change with provider releases, and a single hot user can multiply your spend overnight. Without per-request tracking, the first signal you get is the monthly invoice — by which point the damage is already booked.

This guide covers what cost data to capture, how to aggregate it, and the handful of high-leverage swaps that actually move the bill. For interactive estimation, use the LLM cost calculator. For provider-specific pricing breakdowns, see /pricing/gpt-4o, /pricing/claude-3-5-sonnet, and /pricing/gemini-2-0-flash.

What to capture per request

Model variant (exact dated name)

gpt-4o-mini-2024-07-18 is priced differently from gpt-4o-2024-08-06. Always capture the exact returned name, not just the requested model alias.

Input tokens (with cache split)

Anthropic and OpenAI both return cache_creation and cache_read counts separately. Cached tokens are cheaper and worth tracking independently.

Output tokens

Reasoning models (o1, o3-mini, Claude extended thinking) emit reasoning tokens that are billed as output but produce no user-visible text. Capture them as a separate field.

Provider returned cost (where available)

Anthropic now returns usage cost in the response. Capture it as cost_usd_provider for reconciliation against your own calculation.

Latency split (TTFT and total)

For cost-vs-quality decisions, time-to-first-token often matters more than total time. Streaming captures both.

Customer / session / endpoint tags

Three dimensions almost everyone wants to slice by. Tag at request time with X-Spanlens-User, X-Spanlens-Session, and a custom tag header.

Five swaps that cut bills 30 to 60 percent

  1. 1

    Route routing calls to a small model

    Classification, intent detection, and routing decisions almost never need a frontier model. Replace GPT-4o with GPT-4o-mini for these — same quality for narrow tasks at one-fifteenth the cost. Confirm with an eval comparison before rollout.

  2. 2

    Enable prompt caching for shared system prompts

    Anthropic prompt caching is 90% off cache reads. OpenAI cache reads are 50% off. If your system prompt is more than 1K tokens and shared across many requests, caching pays back immediately.

  3. 3

    Cap max_tokens

    A failed structured-output retry that runs to 4096 tokens costs 100x a successful one. Set max_tokens explicitly. Detect failures earlier with stricter output schemas.

  4. 4

    Pre-summarize long context once

    For an agent with a growing conversation history, summarize the older turns once and replace them with the summary. A 10-turn conversation that sends the full history every turn costs O(n²) — summarizing makes it O(n).

  5. 5

    Switch reasoning models for the right tasks only

    o1 and o3-mini are 6x to 60x more expensive than GPT-4o on output. Use them for hard reasoning steps only, not as a default. Most agent workflows have at most one or two reasoning-heavy steps.

Model price quick reference

ModelInput / 1MOutput / 1MDetail
GPT-4o$2.50$10.00See breakdown →
GPT-4o-mini$0.15$0.60See breakdown →
o3-mini$1.10$4.40
Claude 3.5 Sonnet$3.00$15.00See breakdown →
Claude 3.5 Haiku$0.80$4.00
Gemini 2.0 Flash$0.10$0.40See breakdown →
Gemini 1.5 Pro$1.25$5.00

Prices in USD as of 2026-06. See each model page for region, cache, and batch discounts.

Related

FAQ

How do I track LLM API costs?

Calculate cost per request from input + output tokens against the current model price table. Aggregate by model, prompt version, customer, or endpoint. Spanlens does this automatically — every captured request includes a cost_usd field computed from the provider response and the latest published price.

How is GPT-4o priced?

GPT-4o costs $2.50 per 1M input tokens and $10 per 1M output tokens at the standard tier. Cached inputs cost $1.25 per 1M. GPT-4o-mini is $0.15 per 1M input and $0.60 per 1M output — about 15x cheaper. See /pricing/gpt-4o for a fuller breakdown including monthly cost scenarios.

How do I reduce my OpenAI bill?

Five high-leverage moves. (1) Route classification and routing calls to GPT-4o-mini instead of GPT-4o — same quality for narrow tasks at one-fifteenth the cost. (2) Enable prompt caching for shared system prompts. (3) Set max_tokens so retries cannot run away with output cost. (4) Switch to JSON mode and trim the response schema. (5) Pre-summarize long context once instead of resending it per turn.

What is a model savings recommender?

A tool that analyzes your captured traffic and suggests model swaps with dollar figures attached, like "swap these gpt-4o classification calls to gpt-4o-mini, save $412/mo." Spanlens has one built in. The recommendation includes evidence: sample requests, expected accuracy delta from your own eval data, and the exact filter to apply.

How do budget alerts work?

Budget alerts fire when projected spend exceeds a configured threshold. Spanlens supports daily, weekly, and monthly budgets with optional alerts at 50%, 80%, and 100%. Alerts go to email on Pro and to Slack + webhooks on Team. The projection uses a rolling 7-day rate, not just a linear extrapolation of MTD spend.

Can I bill my customers for their LLM usage?

Yes. Tag each request with a customer ID using the X-Spanlens-User header. Spanlens then aggregates per-customer cost in /users — feed that to your billing system. Same applies for sessions, projects, or any other dimension you tag.

See your real cost per request, per model, per customer.