Anomalies
Spanlens continuously watches your request stream for latency and cost spikes that fall outside normal variation. No thresholds to configure, no baselines to set — it uses textbook 3-sigma statistics against a rolling 7-day reference window, computed per(provider, model) bucket.
Why it matters
Alerts with hand-set thresholds are either too loud (“fires every day at 9am when traffic ramps”) or too quiet (“threshold was set last October, now misses real problems”). The root cause is the same: your workload's idea of “normal” changes, but static thresholds don't.
Anomaly detection sidesteps this by letting your own data define normal. Every bucket learns its baseline from itself.
How it works
The math (simple)
- Pick an observation window (default: the last
1 hour) and a reference window (default: the preceding7 days, excluding the observation window). - Group requests in both windows by
(provider, model). - For each bucket with ≥ 30 reference samples, compute sample mean (μ) and sample standard deviation (σ) on latency and cost.
- Flag buckets where the observation-window mean sits 3σ or more above baseline. (Configurable threshold per API call.)
deviations = (currentValue - baselineMean) / baselineStdDev
if deviations >= sigmaThreshold:
flag as anomalytext3σ corresponds to ~0.13% false-positive rate under a normal distribution — generous enough to catch real spikes without flooding your inbox.
Why per-bucket matters
gpt-4o and gpt-4o-mini have totally different latency profiles (by 5-10×), as do different Anthropic and Gemini models. Computing one global baseline would hide real anomalies. Each model learns its own normal.
Two signals tracked
| Signal | What it catches |
|---|---|
| Latency | Provider slowdowns (OpenAI having a bad day), network issues, unusually long prompts in your workload, regional outages |
| Cost | Prompt bloat (retrieval returning too many docs), runaway completions, someone accidentally switching to a more expensive model in code |
Both are computed against their own baselines — no coupling.
On-demand detection + daily history
The “right now” view runs on-demand when you open the dashboard or hit the API, always using the current time — so the view is always fresh. A background cron job also runs once a day at 04:00 UTC to persist a snapshot into the 30-day history log.
Using it
Dashboard
Visit /anomalies. Flagged buckets show:
- provider + model
- Signal (latency / cost)
- Current value (last hour mean)
- Baseline mean ± stddev
- Deviations (how many σ above normal)
- Sample counts (both windows)
No anomalies? The page tells you — that's the good state. Your infrastructure is behaving predictably.
API
GET /api/v1/anomalies?observationHours=1&referenceHours=168&sigma=3
# → array of flagged buckets:
# [
# {
# "provider": "openai",
# "model": "gpt-4o",
# "kind": "latency",
# "currentValue": 8200, // ms
# "baselineMean": 1100,
# "baselineStdDev": 180,
# "deviations": 39.4,
# "sampleCount": 42,
# "referenceCount": 18420
# }
# ]bashTuning
Query parameters let you adjust sensitivity:
| Param | Default | When to change |
|---|---|---|
observationHours | 1 | Bigger (6, 24) if you have low traffic — avoids small-sample noise |
referenceHours | 168 (7d) | Shorter if your workload changed recently and old data is unrepresentative |
sigma | 3 | Lower to 2 for more sensitive detection (more false positives); higher for quieter |
minSamples | 30 | Don't usually touch — below this, stats are meaningless |
Design choices
- Sample stddev (n−1 denominator). Bessel's correction — unbiased estimator for a finite sample.
- No seasonal decomposition. A 7-day rolling baseline already captures weekly rhythm implicitly. More sophisticated (STL, Prophet, LSTM) models are overkill at current scale and harder to explain.
- One-sided detection. Only “spike above baseline” triggers — drops in latency or cost are good news, not incidents.
Limitations
- High-severity anomalies (≥5σ) auto-notify via your configured notification channels (Slack, email, Discord). Configure channels in Alerts. Medium-severity anomalies (3–5σ) are dashboard-only — use threshold-based alert rules for finer-grained routing.
- Latency / cost detection uses success-only rows. Failed requests usually return fast and would distort the latency baseline; we filter them out for those signals. Error-rate detection includes ALL rows since that's the point.
- History is daily-snapshot, not real-time. New anomalies appear in the live view immediately but take up to 24 hours to land in the 30-day history log.
Related: Alerts (threshold + notification), Cost tracking, /anomalies dashboard.