Alerts

Define simple threshold rules on your LLM traffic. When a rule fires, Spanlens sends a notification to your chosen channel, email, Slack, or Discord. Runs on a 15-minute cron, honors cooldowns, and logs every delivery so you can audit what fired when.

Why it matters

You don't want to manually check the dashboard every morning to see if last night's deploy caused a cost explosion. You want a Slack message at 3am if something's wrong, and quiet otherwise. Alerts give you that with three common rule types that cover 90% of what teams actually watch.

How it works

Three rule types

TypeWhat it watchesExample rule
budgetTotal spend over a rolling window“Alert if cost > $50 in the last 60 minutes”
error_rateFraction of non-2xx responses“Alert if error rate > 5% in the last 30 minutes”
latency_p9595th percentile response time“Alert if p95 > 5000ms in the last 15 minutes”
eval_scoreMean of completed evalruns' avg score over the window (0–1). A quality floor: fires when the score drops to/below the threshold, the opposite direction from the others.“Alert if avg eval score < 0.8 in the last 24 hours”

Evaluation loop

Cron tickevery 60sQuery metricsClickHouseCompare thresholdN consecutive?Firewebhook + emailcooldown · default 10 min
Alerts evaluate on a 1-minute tick. A rule that breaches its threshold for N consecutive evaluations fires once, then enters a cooldown to prevent storms.

GitHub Actions fires cron-evaluate-alerts every 15 minutes. For each active rule, the evaluator:

  1. Computes the metric over the rule's window (from the requests table)
  2. Compares against the threshold
  3. If triggered AND the rule is outside its cooldown_minutes from the last fire, send notifications via lib/notifiers.ts
  4. Log each channel delivery into alert_deliveries (success or error)
  5. Update the rule's last_triggered_at

Cooldowns prevent alert storms. If you set cooldown_minutes: 60, a sustained error condition fires once, stays quiet for an hour, then fires again if still above threshold. Tune it to your noise tolerance.

Supported channels

ChannelHow it sendsRequired config
EmailResend APIRESEND_API_KEY env + recipient email
SlackIncoming webhookWebhook URL (channel-level or workspace-level)
DiscordWebhookWebhook URL

Each channel renders a sensible default message: alert name, threshold, current value, window size, and (if set) a dashboard link.

Using it

1. Add a notification channel

In /alerts, create a channel first. Channels are stored per-org and can be reused across multiple rules.

POST /api/v1/notification-channels
Content-Type: application/json

{
  "name": "#ops-alerts",
  "type": "slack",
  "config": {
    "webhookUrl": "https://hooks.slack.com/services/..."
  }
}
bash

2. Create an alert rule

POST /api/v1/alerts
Content-Type: application/json

{
  "name": "Cost spike guard",
  "type": "budget",
  "threshold": 50,              // $50
  "windowMinutes": 60,
  "cooldownMinutes": 60,
  "channelIds": ["<channel-uuid>"]
}
bash

3. Verify it

The dashboard shows each rule's last_triggered_at + recent deliveries. You can also manually trigger evaluation via POST /api/v1/alerts/evaluate to confirm wiring before the next cron tick.

Architectural notes

  • Delivery is at-least-once. If Resend/Slack/Discord returns an error, we log it and retry on the next cron. At-most-once semantics would require per-channel idempotency keys, not worth the complexity for ops alerts.
  • Cron runs on GitHub Actions, not Vercel Cron. Why: easier to audit, cheaper on Hobby/Pro plans, and decoupled from Vercel function timeouts.
  • Rule evaluation is stateless. Each cron tick recomputes from the requests table. No separate aggregation store; Postgres handles the aggregations in a single query.

Limitations

  • No PagerDuty / OpsGenie integration yet.Slack webhooks can be piped through those services if you need escalation, but we don't natively integrate.
  • Fixed metric set. Only budget / error_rate / latency_p95 / eval_score today. Custom SQL or anomaly-based rules are roadmap items. Note thateval_score is org-level: eval_runs has no project_id, so a project filter on an eval_score rule is ignored.
  • Quota-overage warning emails run on a separate cron (hourly). Org owners get automatic emails at 80% and 100% of the monthly request quota, no setup required. Content is context-aware: at 100% with overage billing enabled, the email tells the user that overage charges are now active (not that their requests are being rejected). Toggle in /settings.

Related: Anomalies (unsupervised), /alerts dashboard. Cron: .github/workflows/cron-evaluate-alerts.yml.