Alerts

Define simple threshold rules on your LLM traffic. When a rule fires, Spanlens sends a notification to your chosen channel — email, Slack, or Discord. Runs on a 15-minute cron, honors cooldowns, and logs every delivery so you can audit what fired when.

Why it matters

You don't want to manually check the dashboard every morning to see if last night's deploy caused a cost explosion. You want a Slack message at 3am if something's wrong, and quiet otherwise. Alerts give you that with three common rule types that cover 90% of what teams actually watch.

How it works

Three rule types

TypeWhat it watchesExample rule
budgetTotal spend over a rolling window“Alert if cost > $50 in the last 60 minutes”
error_rateFraction of non-2xx responses“Alert if error rate > 5% in the last 30 minutes”
latency_p9595th percentile response time“Alert if p95 > 5000ms in the last 15 minutes”

Evaluation loop

GitHub Actions fires cron-evaluate-alerts every 15 minutes. For each active rule, the evaluator:

  1. Computes the metric over the rule's window (from the requests table)
  2. Compares against the threshold
  3. If triggered AND the rule is outside its cooldown_minutes from the last fire, send notifications via lib/notifiers.ts
  4. Log each channel delivery into alert_deliveries (success or error)
  5. Update the rule's last_triggered_at

Cooldowns prevent alert storms. If you set cooldown_minutes: 60, a sustained error condition fires once, stays quiet for an hour, then fires again if still above threshold. Tune it to your noise tolerance.

Supported channels

ChannelHow it sendsRequired config
EmailResend APIRESEND_API_KEY env + recipient email
SlackIncoming webhookWebhook URL (channel-level or workspace-level)
DiscordWebhookWebhook URL

Each channel renders a sensible default message: alert name, threshold, current value, window size, and (if set) a dashboard link.

Using it

1. Add a notification channel

In /alerts, create a channel first. Channels are stored per-org and can be reused across multiple rules.

POST /api/v1/notification-channels
Content-Type: application/json

{
  "name": "#ops-alerts",
  "type": "slack",
  "config": {
    "webhookUrl": "https://hooks.slack.com/services/..."
  }
}
bash

2. Create an alert rule

POST /api/v1/alerts
Content-Type: application/json

{
  "name": "Cost spike guard",
  "type": "budget",
  "threshold": 50,              // $50
  "windowMinutes": 60,
  "cooldownMinutes": 60,
  "channelIds": ["<channel-uuid>"]
}
bash

3. Verify it

The dashboard shows each rule's last_triggered_at + recent deliveries. You can also manually trigger evaluation via POST /api/v1/alerts/evaluate to confirm wiring before the next cron tick.

Architectural notes

  • Delivery is at-least-once. If Resend/Slack/Discord returns an error, we log it and retry on the next cron. At-most-once semantics would require per-channel idempotency keys — not worth the complexity for ops alerts.
  • Cron runs on GitHub Actions, not Vercel Cron. Why: easier to audit, cheaper on Hobby/Pro plans, and decoupled from Vercel function timeouts.
  • Rule evaluation is stateless. Each cron tick recomputes from the requests table. No separate aggregation store; Postgres handles the aggregations in a single query.

Limitations

  • No PagerDuty / OpsGenie integration yet. Slack webhooks can be piped through those services if you need escalation — but we don't natively integrate.
  • Fixed metric set. Only budget / error_rate / latency_p95 today. Custom SQL or anomaly-based rules are roadmap items.
  • Quota-overage warning emails run on a separate cron (hourly). Org owners get automatic emails at 80% and 100% of the monthly request quota — no setup required. Content is context-aware: at 100% with overage billing enabled, the email tells the user that overage charges are now active (not that their requests are being rejected). Toggle in /settings.

Related: Anomalies (unsupervised), /alerts dashboard. Cron: .github/workflows/cron-evaluate-alerts.yml.