LLM Observability: The 2026 Guide for Production AI Apps

What to monitor, how to instrument, and how to pick a tool when your LLM app stops being a prototype.

LLM observability is the practice of capturing every call your application makes to a large language model, then surfacing the cost, latency, token usage, and behavioral signals that decide whether the app stays profitable, fast, and safe. It is the LLM-specific equivalent of APM (application performance monitoring) for traditional services, but the failure modes are different enough that you cannot just point Datadog or New Relic at the problem.

This page is a hub. It links out to deeper guides on agent tracing, LLM cost tracking, the major tool comparisons, and provider-specific integration walkthroughs for OpenAI, Anthropic, and Gemini.

What to monitor

A production LLM app generates five categories of signal worth capturing. Most teams start with cost because the bill is the first thing that surprises them, but the other four catch issues earlier.

Cost (USD per request)

Calculated from input + output tokens against the current model price table. Aggregate by model, prompt version, customer, or endpoint. Surfaces the gpt-4o classification calls that should have been gpt-4o-mini.

Latency (p50, p95, p99)

Split by model and prompt version. Streaming responses report time-to-first-token and time-to-last-token separately. A p99 spike usually means a hot prompt got long, not that the provider is down.

Quality (eval scores)

Score every response with an LLM-as-judge or a stored human label. Track drift by prompt version. Pair with experiments so you can replay a fixed dataset across versions and compare before rollout.

Reliability (error rates and retries)

Upstream 429/500, network timeouts, structured-output parse failures. Catch the silent retry loop that doubles your bill without doubling your output.

Security (PII, injection, secret leakage)

Scan request bodies for SSN/credit card/email/IBAN/passport, prompt-injection patterns, and stray API keys. Flag for review without blocking the request, since blocking the LLM call to the user is usually worse than the security issue.

How to instrument

Three patterns dominate. Pick based on your stack rather than ideology.

1. Drop-in SDK

Swap the provider SDK import for an observability-instrumented version. Same surface area, same types. Fastest for single-language apps. Spanlens, Langfuse, and LangSmith all ship drop-in wrappers for OpenAI and Anthropic.

2. Proxy

Point the LLM baseURL at a logging endpoint and put your observability key in the Authorization header. Works in any language including Ruby, Go, and raw HTTP. Spanlens and Helicone are proxy-first; Langfuse can proxy via its gateway but it is not the default mode.

3. OpenTelemetry

Emit OTLP spans from your existing tracing setup. Best if you already have an OTel pipeline and want LLM spans to flow through it. Spanlens, Langfuse, and Arize Phoenix all accept OTLP/HTTP at /v1/traces. See /docs/otel.

Tool landscape (2026)

The space has consolidated around five tools. Each compares head-to-head with Spanlens on a dedicated page.

ToolModelLicenseCompare
SpanlensProxy + SDK + OTelMIT (full)
LangfuseSDK + OTelMIT + ee/ folderCompare →
HeliconeProxyApache 2.0 (in maintenance)Compare →
LangSmithLangChain callbacksClosed sourceCompare →
BraintrustSDKClosed sourceCompare →
Arize PhoenixSDK + OTelELv2Compare →

Related guides

Frequently asked questions

What is LLM observability?

LLM observability is the practice of capturing every call your application makes to a large language model, then surfacing cost, latency, token usage, and behavioral signals. It is the LLM-specific equivalent of APM (application performance monitoring) for traditional services.

Why is LLM observability different from regular APM?

Regular APM tracks HTTP requests and database queries. LLM calls have unique signals: token counts that drive cost, model variants with different quality, prompt versions that change behavior, tool use that branches into agent flows, and non-deterministic output. Standard APM misses cost-per-call, model-by-model breakdown, prompt drift, and PII in request bodies.

What should I monitor for an LLM application?

Five categories. Cost: per-request USD by model and customer. Latency: p50/p95/p99 split by model and prompt version. Quality: eval scores per output, drift over time, human vs judge correlation. Reliability: error rates, retry counts, timeouts. Security: PII matches, prompt injection patterns, API key leakage in logs.

How do I add LLM observability to an existing app?

Three common patterns. Drop-in SDK: swap the provider SDK import for an observability-instrumented version. Proxy: change the LLM baseURL so every call routes through a logging endpoint. OpenTelemetry: emit OTLP spans from your existing tracing setup. Drop-in is fastest for single-language apps, proxy is best for polyglot stacks, OTel is best if you already have an OTel pipeline.

What is the difference between Spanlens, Langfuse, Helicone, and LangSmith?

Spanlens is a drop-in proxy with built-in evals, agent tracing, and Prompt A/B, fully MIT licensed. Langfuse uses an SDK + OTel model with a commercial ee/ folder for enterprise add-ons. Helicone is closest architecturally (proxy-first) but entered maintenance after the 2026 Mintlify acquisition. LangSmith is LangChain-native and excels inside LangChain pipelines.

Can I self-host LLM observability?

Yes. Spanlens, Langfuse, and Arize Phoenix all offer self-hostable builds. Spanlens runs from one Docker compose file with no enterprise feature gating. Langfuse self-host omits the ee/ folder features (SCIM, audit logs, data masking). Helicone self-host is community-maintained after the acquisition.

Add LLM observability in one line. Free tier, no credit card.