Anomalies

Spanlens continuously watches your request stream for latency spikes, cost spikes, and error rate increases that fall outside normal variation. No thresholds to configure, no baselines to set, it uses textbook 3-sigma statistics against a rolling 7-day reference window, computed per (provider, model) bucket.

Why it matters

Alerts with hand-set thresholds are either too loud (“fires every day at 9am when traffic ramps”) or too quiet (“threshold was set last October, now misses real problems”). The root cause is the same: your workload's idea of “normal” changes, but static thresholds don't.

Anomaly detection sidesteps this by letting your own data define normal. Every bucket learns its baseline from itself.

How it works

Anomaly = a sample beyond ±3σ of the rolling baseline. Roughly 0.3% of normal samples land there by chance, so persistent breaches signal a real shift, not noise.

The math (simple)

  1. Pick an observation window (default: the last 1 hour) and a reference window (default: the preceding 7 days, excluding the observation window).
  2. Group requests in both windows by (provider, model).
  3. For each bucket with ≥ 10 reference samples, compute sample mean (μ) and sample standard deviation (σ) on the signal. Each anomaly is tagged with a confidence label based on how many samples the baseline is built from , see Confidence tiers below.
  4. Flag buckets where the observation-window mean sits 3σ or more above baseline. (Configurable threshold per API call.)
deviations = (currentValue - baselineMean) / baselineStdDev

if deviations >= sigmaThreshold:
  flag as anomaly
text

corresponds to ~0.13% false-positive rate under a normal distribution , generous enough to catch real spikes without flooding your inbox.

Confidence tiers

The baseline's reliability scales with the size of the reference window. New organisations (and rarely-used model buckets) need somedirectional signal in their first week, but a 12-sample standard deviation is much noisier than a 1,000-sample one. The confidence label tells you which regime you're in:

ConfidenceReference samplesHow to read it
low10 – 29Directional only. The σ estimate is noisy, treat as an early warning, verify against the underlying requests before paging.
medium30 – 99The classic 3σ threshold regime. False-positive rate is approximately as advertised.
high100+Statistically robust. Use as the gate when wiring anomalies into a paging integration , see minSamples below for the API parameter.

Below 10 reference samples the bucket is suppressed entirely (no detection, regardless of observation). Buckets ingested before this tier system was introduced are surfaced with confidence null for back-compat.

Why per-bucket matters

gpt-4o and gpt-4o-mini have totally different latency profiles (by 5-10×), as do different Anthropic and Gemini models. Computing one global baseline would hide real anomalies. Each model learns its own normal.

Three signals tracked

SignalWhat it catches
LatencyProvider slowdowns (OpenAI having a bad day), network issues, unusually long prompts in your workload, regional outages
CostPrompt bloat (retrieval returning too many docs), runaway completions, someone accidentally switching to a more expensive model in code
Error rateProvider outages, quota exhaustion, auth misconfigurations, upstream changes that silently start returning 4xx/5xx. Measured as fraction of requests with status ≥ 400.

Each signal is computed against its own baseline, no coupling. Latency and cost baselines use success-only rows (failed requests are fast and would distort the latency baseline). Error-rate detection intentionally includes all rows.

On-demand detection + daily history

The “right now” view runs on-demand when you open the dashboard or hit the API, always using the current time, so the view is always fresh. A background cron job also runs once a day at 01:00 UTC to persist a snapshot into the 30-day history log.

Using it

Dashboard

Visit /anomalies. Flagged buckets show:

  • provider + model
  • Signal (latency / cost / error rate)
  • Current value (last hour mean)
  • Baseline mean ± stddev
  • Deviations (how many σ above normal)
  • Sample counts (both windows)
  • Confidence badge, low / medium / high based on reference-window size (see Confidence tiers)
  • Contributing factors, a why · hint explaining the likely root cause
  • Acknowledged state (if you've silenced it)

No anomalies? The page tells you, that's the good state. Your infrastructure is behaving predictably.

Understanding why, Contributing factors

When a bucket is flagged, Spanlens automatically fetches root-cause context so you don't have to dig through raw logs first. The why · line appears beneath each anomaly entry, powered by a single additional DB scan per unique (provider, model).

SignalWhat the hint showsHow to interpret it
Latency or CostThe token type that changed most between obs and reference windows , e.g. Prompt tokens ↑ 3,200 (was 890, +259%)Prompt token spike → retrieval returning too many chunks, or context growth. Completion token spike → verbose outputs, runaway generation, or a model switch.
Error rateTop HTTP status codes in the observation window, ranked by frequency , e.g. 429: 45 req · 500: 8 req429 → quota exhaustion or rate limiting. 500/503 → provider outage. 401/403 → auth misconfiguration or key rotation. 400/422 → request format change or upstream API shift.

Contributing factors are fetched for the same time windows as the anomaly detection run. If the obs window has no data yet (e.g. you just deployed), the hint is omitted rather than showing misleading nulls.

Acknowledging an anomaly

If you've investigated a flagged bucket and determined it's expected (a deliberate model switch, a batch job, a known provider incident), you can acknowledge it. Acknowledged anomalies are still shown but visually muted so you can focus on new ones.

# Acknowledge
POST /api/v1/anomalies/ack
Content-Type: application/json

{
  "provider": "openai",
  "model": "gpt-4o",
  "kind": "latency",
  "projectId": "proj_xxx"   // optional, omit for org-wide ack
}

# Un-acknowledge
DELETE /api/v1/anomalies/ack?provider=openai&model=gpt-4o&kind=latency
bash

Requires admin or editor role. Acks are scoped per (org, project, provider, model, kind), acknowledging a bucket org-wide doesn't silence it inside a specific project, and vice versa.

Live API

GET /api/v1/anomalies?observationHours=1&referenceHours=168&sigma=3

# → array of flagged buckets:
# [
#   {
#     "provider": "openai",
#     "model": "gpt-4o",
#     "kind": "latency",
#     "currentValue": 8200,        // ms
#     "baselineMean": 1100,
#     "baselineStdDev": 180,
#     "deviations": 39.4,
#     "sampleCount": 42,
#     "referenceCount": 18420,
#     "confidence": "high",        // low | medium | high, reliability of the baseline
#     "acknowledgedAt": null,      // ISO string if acked, null otherwise
#     "factors": {                 // root-cause contributing factors
#       "obsPromptTokensMean": 3200,
#       "refPromptTokensMean": 890,
#       "obsCompletionTokensMean": 410,
#       "refCompletionTokensMean": 390,
#       "obsTotalTokensMean": 3610,
#       "refTotalTokensMean": 1280,
#       "obsStatusDistribution": [] // e.g. [{code:429,count:5},{code:500,count:2}]
#     }
#   }
# ]
bash

Add projectId=<id> to scope detection to a single project.

30-day history

The history view shows past daily snapshots, useful for spotting recurring patterns (“every Monday morning, latency spikes on gpt-4o”).

GET /api/v1/anomalies/history?days=30

# → same shape as the live response, without acknowledgedAt.
# Results cover the last N days, excluding today
# (today is shown in the live view above).
bash

High-severity auto-notifications (≥5σ)

Anomalies that reach 5σ or more are automatically delivered to your configured notification channels (Slack, email, Discord) by the daily snapshot job, no alert rule needed. Medium-severity anomalies (3–5σ) are dashboard-only; use threshold-based alert rules for finer-grained routing.

Configure channels in Settings → Notifications.

Export

Download historical anomaly events as CSV or JSON for offline analysis:

GET /api/v1/exports/anomalies?format=csv&days=30

# format: csv (default) | json
# days: 1–365 (default 30)
bash

Tuning

Query parameters let you adjust sensitivity:

ParamDefaultWhen to change
observationHours1Bigger (6, 24) if you have low traffic, avoids small-sample noise
referenceHours168 (7d)Shorter if your workload changed recently and old data is unrepresentative
sigma3Lower to 2 for more sensitive detection (more false positives); higher for quieter
projectId,Scope detection to a single project instead of the whole org
minSamples10Raise to 30 or 100 to suppress low/medium-confidence findings when wiring into paging or noisy channels.

Below minSamples the bucket is suppressed entirely (no row, no notification). The default 10 surfaces directional signal for new orgs in their first week; the dashboard tags each finding with a confidence badge so you can scan past low-confidence rows visually.

Design choices

  • Sample stddev (n−1 denominator).Bessel's correction, unbiased estimator for a finite sample.
  • No seasonal decomposition. A 7-day rolling baseline already captures weekly rhythm implicitly. More sophisticated (STL, Prophet, LSTM) models are overkill at current scale and harder to explain.
  • One-sided detection.Only “spike above baseline” triggers , drops in latency or cost are good news, not incidents.

Limitations

  • History is daily-snapshot, not real-time. New anomalies appear in the live view immediately but take up to 24 hours to land in the 30-day history log (cron runs at 01:00 UTC).
  • Sparse buckets are skipped. Any (provider, model, kind) combination with fewer than minSamples requests (default 10) in the reference window produces no signal, not enough data for any baseline. Buckets between 10 and 29 samples surface with low confidence so you can decide whether to act.
  • No anomaly-level alert routing.You can't route “only latency anomalies for gpt-4o” to a specific channel. High-severity (≥5σ) goes to all active channels; for finer routing, create a threshold-based alert rule instead.

Related: Alerts (threshold + notification), Cost tracking, /anomalies dashboard.