Anomalies
Spanlens continuously watches your request stream for latency spikes, cost spikes, and error rate increases that fall outside normal variation. No thresholds to configure, no baselines to set, it uses textbook 3-sigma statistics against a rolling 7-day reference window, computed per (provider, model) bucket.
Why it matters
Alerts with hand-set thresholds are either too loud (“fires every day at 9am when traffic ramps”) or too quiet (“threshold was set last October, now misses real problems”). The root cause is the same: your workload's idea of “normal” changes, but static thresholds don't.
Anomaly detection sidesteps this by letting your own data define normal. Every bucket learns its baseline from itself.
How it works
The math (simple)
- Pick an observation window (default: the last
1 hour) and a reference window (default: the preceding7 days, excluding the observation window). - Group requests in both windows by
(provider, model). - For each bucket with ≥ 10 reference samples, compute sample mean (μ) and sample standard deviation (σ) on the signal. Each anomaly is tagged with a confidence label based on how many samples the baseline is built from , see Confidence tiers below.
- Flag buckets where the observation-window mean sits 3σ or more above baseline. (Configurable threshold per API call.)
deviations = (currentValue - baselineMean) / baselineStdDev
if deviations >= sigmaThreshold:
flag as anomalytext3σ corresponds to ~0.13% false-positive rate under a normal distribution , generous enough to catch real spikes without flooding your inbox.
Confidence tiers
The baseline's reliability scales with the size of the reference window. New organisations (and rarely-used model buckets) need somedirectional signal in their first week, but a 12-sample standard deviation is much noisier than a 1,000-sample one. The confidence label tells you which regime you're in:
| Confidence | Reference samples | How to read it |
|---|---|---|
| low | 10 – 29 | Directional only. The σ estimate is noisy, treat as an early warning, verify against the underlying requests before paging. |
| medium | 30 – 99 | The classic 3σ threshold regime. False-positive rate is approximately as advertised. |
| high | 100+ | Statistically robust. Use as the gate when wiring anomalies into a paging integration , see minSamples below for the API parameter. |
Below 10 reference samples the bucket is suppressed entirely (no detection, regardless of observation). Buckets ingested before this tier system was introduced are surfaced with confidence null for back-compat.
Why per-bucket matters
gpt-4o and gpt-4o-mini have totally different latency profiles (by 5-10×), as do different Anthropic and Gemini models. Computing one global baseline would hide real anomalies. Each model learns its own normal.
Three signals tracked
| Signal | What it catches |
|---|---|
| Latency | Provider slowdowns (OpenAI having a bad day), network issues, unusually long prompts in your workload, regional outages |
| Cost | Prompt bloat (retrieval returning too many docs), runaway completions, someone accidentally switching to a more expensive model in code |
| Error rate | Provider outages, quota exhaustion, auth misconfigurations, upstream changes that silently start returning 4xx/5xx. Measured as fraction of requests with status ≥ 400. |
Each signal is computed against its own baseline, no coupling. Latency and cost baselines use success-only rows (failed requests are fast and would distort the latency baseline). Error-rate detection intentionally includes all rows.
On-demand detection + daily history
The “right now” view runs on-demand when you open the dashboard or hit the API, always using the current time, so the view is always fresh. A background cron job also runs once a day at 01:00 UTC to persist a snapshot into the 30-day history log.
Using it
Dashboard
Visit /anomalies. Flagged buckets show:
- provider + model
- Signal (latency / cost / error rate)
- Current value (last hour mean)
- Baseline mean ± stddev
- Deviations (how many σ above normal)
- Sample counts (both windows)
- Confidence badge, low / medium / high based on reference-window size (see Confidence tiers)
- Contributing factors, a
why ·hint explaining the likely root cause - Acknowledged state (if you've silenced it)
No anomalies? The page tells you, that's the good state. Your infrastructure is behaving predictably.
Understanding why, Contributing factors
When a bucket is flagged, Spanlens automatically fetches root-cause context so you don't have to dig through raw logs first. The why · line appears beneath each anomaly entry, powered by a single additional DB scan per unique (provider, model).
| Signal | What the hint shows | How to interpret it |
|---|---|---|
| Latency or Cost | The token type that changed most between obs and reference windows , e.g. Prompt tokens ↑ 3,200 (was 890, +259%) | Prompt token spike → retrieval returning too many chunks, or context growth. Completion token spike → verbose outputs, runaway generation, or a model switch. |
| Error rate | Top HTTP status codes in the observation window, ranked by frequency , e.g. 429: 45 req · 500: 8 req | 429 → quota exhaustion or rate limiting. 500/503 → provider outage. 401/403 → auth misconfiguration or key rotation. 400/422 → request format change or upstream API shift. |
Contributing factors are fetched for the same time windows as the anomaly detection run. If the obs window has no data yet (e.g. you just deployed), the hint is omitted rather than showing misleading nulls.
Acknowledging an anomaly
If you've investigated a flagged bucket and determined it's expected (a deliberate model switch, a batch job, a known provider incident), you can acknowledge it. Acknowledged anomalies are still shown but visually muted so you can focus on new ones.
# Acknowledge
POST /api/v1/anomalies/ack
Content-Type: application/json
{
"provider": "openai",
"model": "gpt-4o",
"kind": "latency",
"projectId": "proj_xxx" // optional, omit for org-wide ack
}
# Un-acknowledge
DELETE /api/v1/anomalies/ack?provider=openai&model=gpt-4o&kind=latencybashRequires admin or editor role. Acks are scoped per (org, project, provider, model, kind), acknowledging a bucket org-wide doesn't silence it inside a specific project, and vice versa.
Live API
GET /api/v1/anomalies?observationHours=1&referenceHours=168&sigma=3
# → array of flagged buckets:
# [
# {
# "provider": "openai",
# "model": "gpt-4o",
# "kind": "latency",
# "currentValue": 8200, // ms
# "baselineMean": 1100,
# "baselineStdDev": 180,
# "deviations": 39.4,
# "sampleCount": 42,
# "referenceCount": 18420,
# "confidence": "high", // low | medium | high, reliability of the baseline
# "acknowledgedAt": null, // ISO string if acked, null otherwise
# "factors": { // root-cause contributing factors
# "obsPromptTokensMean": 3200,
# "refPromptTokensMean": 890,
# "obsCompletionTokensMean": 410,
# "refCompletionTokensMean": 390,
# "obsTotalTokensMean": 3610,
# "refTotalTokensMean": 1280,
# "obsStatusDistribution": [] // e.g. [{code:429,count:5},{code:500,count:2}]
# }
# }
# ]bashAdd projectId=<id> to scope detection to a single project.
30-day history
The history view shows past daily snapshots, useful for spotting recurring patterns (“every Monday morning, latency spikes on gpt-4o”).
GET /api/v1/anomalies/history?days=30
# → same shape as the live response, without acknowledgedAt.
# Results cover the last N days, excluding today
# (today is shown in the live view above).bashHigh-severity auto-notifications (≥5σ)
Anomalies that reach 5σ or more are automatically delivered to your configured notification channels (Slack, email, Discord) by the daily snapshot job, no alert rule needed. Medium-severity anomalies (3–5σ) are dashboard-only; use threshold-based alert rules for finer-grained routing.
Configure channels in Settings → Notifications.
Export
Download historical anomaly events as CSV or JSON for offline analysis:
GET /api/v1/exports/anomalies?format=csv&days=30
# format: csv (default) | json
# days: 1–365 (default 30)bashTuning
Query parameters let you adjust sensitivity:
| Param | Default | When to change |
|---|---|---|
observationHours | 1 | Bigger (6, 24) if you have low traffic, avoids small-sample noise |
referenceHours | 168 (7d) | Shorter if your workload changed recently and old data is unrepresentative |
sigma | 3 | Lower to 2 for more sensitive detection (more false positives); higher for quieter |
projectId | , | Scope detection to a single project instead of the whole org |
minSamples | 10 | Raise to 30 or 100 to suppress low/medium-confidence findings when wiring into paging or noisy channels. |
Below minSamples the bucket is suppressed entirely (no row, no notification). The default 10 surfaces directional signal for new orgs in their first week; the dashboard tags each finding with a confidence badge so you can scan past low-confidence rows visually.
Design choices
- Sample stddev (n−1 denominator).Bessel's correction, unbiased estimator for a finite sample.
- No seasonal decomposition. A 7-day rolling baseline already captures weekly rhythm implicitly. More sophisticated (STL, Prophet, LSTM) models are overkill at current scale and harder to explain.
- One-sided detection.Only “spike above baseline” triggers , drops in latency or cost are good news, not incidents.
Limitations
- History is daily-snapshot, not real-time. New anomalies appear in the live view immediately but take up to 24 hours to land in the 30-day history log (cron runs at 01:00 UTC).
- Sparse buckets are skipped. Any
(provider, model, kind)combination with fewer thanminSamplesrequests (default 10) in the reference window produces no signal, not enough data for any baseline. Buckets between 10 and 29 samples surface with low confidence so you can decide whether to act. - No anomaly-level alert routing.You can't route “only latency anomalies for gpt-4o” to a specific channel. High-severity (≥5σ) goes to all active channels; for finer routing, create a threshold-based alert rule instead.
Related: Alerts (threshold + notification), Cost tracking, /anomalies dashboard.