Anomalies

Spanlens continuously watches your request stream for latency and cost spikes that fall outside normal variation. No thresholds to configure, no baselines to set — it uses textbook 3-sigma statistics against a rolling 7-day reference window, computed per(provider, model) bucket.

Why it matters

Alerts with hand-set thresholds are either too loud (“fires every day at 9am when traffic ramps”) or too quiet (“threshold was set last October, now misses real problems”). The root cause is the same: your workload's idea of “normal” changes, but static thresholds don't.

Anomaly detection sidesteps this by letting your own data define normal. Every bucket learns its baseline from itself.

How it works

The math (simple)

Pick an observation window (default: the last 1 hour) and a reference window (default: the preceding 7 days, excluding the observation window).
Group requests in both windows by (provider, model).
For each bucket with ≥ 30 reference samples, compute sample mean (μ) and sample standard deviation (σ) on latency and cost.
Flag buckets where the observation-window mean sits 3σ or more above baseline. (Configurable threshold per API call.)

deviations = (currentValue - baselineMean) / baselineStdDev

if deviations >= sigmaThreshold:
  flag as anomaly

text

3σ corresponds to ~0.13% false-positive rate under a normal distribution — generous enough to catch real spikes without flooding your inbox.

Why per-bucket matters

gpt-4o and gpt-4o-mini have totally different latency profiles (by 5-10×), as do different Anthropic and Gemini models. Computing one global baseline would hide real anomalies. Each model learns its own normal.

Two signals tracked

Signal	What it catches
Latency	Provider slowdowns (OpenAI having a bad day), network issues, unusually long prompts in your workload, regional outages
Cost	Prompt bloat (retrieval returning too many docs), runaway completions, someone accidentally switching to a more expensive model in code

Both are computed against their own baselines — no coupling.

On-demand detection + daily history

The “right now” view runs on-demand when you open the dashboard or hit the API, always using the current time — so the view is always fresh. A background cron job also runs once a day at 04:00 UTC to persist a snapshot into the 30-day history log.

Using it

Dashboard

Visit /anomalies. Flagged buckets show:

provider + model
Signal (latency / cost)
Current value (last hour mean)
Baseline mean ± stddev
Deviations (how many σ above normal)
Sample counts (both windows)

No anomalies? The page tells you — that's the good state. Your infrastructure is behaving predictably.

API

GET /api/v1/anomalies?observationHours=1&referenceHours=168&sigma=3

# → array of flagged buckets:
# [
#   {
#     "provider": "openai",
#     "model": "gpt-4o",
#     "kind": "latency",
#     "currentValue": 8200,        // ms
#     "baselineMean": 1100,
#     "baselineStdDev": 180,
#     "deviations": 39.4,
#     "sampleCount": 42,
#     "referenceCount": 18420
#   }
# ]

bash

Tuning

Query parameters let you adjust sensitivity:

Param	Default	When to change
`observationHours`	1	Bigger (6, 24) if you have low traffic — avoids small-sample noise
`referenceHours`	168 (7d)	Shorter if your workload changed recently and old data is unrepresentative
`sigma`	3	Lower to 2 for more sensitive detection (more false positives); higher for quieter
`minSamples`	30	Don't usually touch — below this, stats are meaningless

Design choices

Sample stddev (n−1 denominator). Bessel's correction — unbiased estimator for a finite sample.
No seasonal decomposition. A 7-day rolling baseline already captures weekly rhythm implicitly. More sophisticated (STL, Prophet, LSTM) models are overkill at current scale and harder to explain.
One-sided detection. Only “spike above baseline” triggers — drops in latency or cost are good news, not incidents.

Limitations

High-severity anomalies (≥5σ) auto-notify via your configured notification channels (Slack, email, Discord). Configure channels in Alerts. Medium-severity anomalies (3–5σ) are dashboard-only — use threshold-based alert rules for finer-grained routing.
Latency / cost detection uses success-only rows. Failed requests usually return fast and would distort the latency baseline; we filter them out for those signals. Error-rate detection includes ALL rows since that's the point.
History is daily-snapshot, not real-time. New anomalies appear in the live view immediately but take up to 24 hours to land in the 30-day history log.

Related: Alerts (threshold + notification), Cost tracking, /anomalies dashboard.