Alerts
Define simple threshold rules on your LLM traffic. When a rule fires, Spanlens sends a notification to your chosen channel — email, Slack, or Discord. Runs on a 15-minute cron, honors cooldowns, and logs every delivery so you can audit what fired when.
Why it matters
You don't want to manually check the dashboard every morning to see if last night's deploy caused a cost explosion. You want a Slack message at 3am if something's wrong, and quiet otherwise. Alerts give you that with three common rule types that cover 90% of what teams actually watch.
How it works
Three rule types
| Type | What it watches | Example rule |
|---|---|---|
budget | Total spend over a rolling window | “Alert if cost > $50 in the last 60 minutes” |
error_rate | Fraction of non-2xx responses | “Alert if error rate > 5% in the last 30 minutes” |
latency_p95 | 95th percentile response time | “Alert if p95 > 5000ms in the last 15 minutes” |
Evaluation loop
GitHub Actions fires cron-evaluate-alerts every 15 minutes. For each active rule, the evaluator:
- Computes the metric over the rule's window (from the
requeststable) - Compares against the threshold
- If triggered AND the rule is outside its
cooldown_minutesfrom the last fire, send notifications vialib/notifiers.ts - Log each channel delivery into
alert_deliveries(success or error) - Update the rule's
last_triggered_at
Cooldowns prevent alert storms. If you set cooldown_minutes: 60, a sustained error condition fires once, stays quiet for an hour, then fires again if still above threshold. Tune it to your noise tolerance.
Supported channels
| Channel | How it sends | Required config |
|---|---|---|
| Resend API | RESEND_API_KEY env + recipient email | |
| Slack | Incoming webhook | Webhook URL (channel-level or workspace-level) |
| Discord | Webhook | Webhook URL |
Each channel renders a sensible default message: alert name, threshold, current value, window size, and (if set) a dashboard link.
Using it
1. Add a notification channel
In /alerts, create a channel first. Channels are stored per-org and can be reused across multiple rules.
POST /api/v1/notification-channels
Content-Type: application/json
{
"name": "#ops-alerts",
"type": "slack",
"config": {
"webhookUrl": "https://hooks.slack.com/services/..."
}
}bash2. Create an alert rule
POST /api/v1/alerts
Content-Type: application/json
{
"name": "Cost spike guard",
"type": "budget",
"threshold": 50, // $50
"windowMinutes": 60,
"cooldownMinutes": 60,
"channelIds": ["<channel-uuid>"]
}bash3. Verify it
The dashboard shows each rule's last_triggered_at + recent deliveries. You can also manually trigger evaluation via POST /api/v1/alerts/evaluate to confirm wiring before the next cron tick.
Architectural notes
- Delivery is at-least-once. If Resend/Slack/Discord returns an error, we log it and retry on the next cron. At-most-once semantics would require per-channel idempotency keys — not worth the complexity for ops alerts.
- Cron runs on GitHub Actions, not Vercel Cron. Why: easier to audit, cheaper on Hobby/Pro plans, and decoupled from Vercel function timeouts.
- Rule evaluation is stateless. Each cron tick recomputes from the
requeststable. No separate aggregation store; Postgres handles the aggregations in a single query.
Limitations
- No PagerDuty / OpsGenie integration yet. Slack webhooks can be piped through those services if you need escalation — but we don't natively integrate.
- Fixed metric set. Only budget / error_rate / latency_p95 today. Custom SQL or anomaly-based rules are roadmap items.
- Quota-overage warning emails run on a separate cron (hourly). Org owners get automatic emails at 80% and 100% of the monthly request quota — no setup required. Content is context-aware: at 100% with overage billing enabled, the email tells the user that overage charges are now active (not that their requests are being rejected). Toggle in /settings.
Related: Anomalies (unsupervised), /alerts dashboard. Cron: .github/workflows/cron-evaluate-alerts.yml.