Disaster recovery

A runbook for the person on call. Each failure mode below lists what data is at risk, what protects it automatically, and the steps to recover. Pair this with Reliability (how the system degrades) and Backup and restore (the restore commands).

Recovery objectives

Spanlens is designed so a dependency outage never fails your end users' LLM calls: the proxy returns the provider response before any logging happens. The risk in an outage is observability data (request logs, traces, usage), not your application traffic. The targets below are what the queues and backups are sized for.

DataStoreProtectionRecovery point
Request logsClickHouserequests_fallback queue in Supabase (7 day TTL)0 while the queue holds
Trace eventsClickHouseevents_fallback queue in Supabase (7 day TTL)0 while the queue holds
Accounts, keys, billingSupabase PostgresManaged daily backups + PITRProvider backup cadence
Outbound webhooksSupabase Postgres5 retries with backoff, then dead-letteredAt-least-once while the endpoint is up

ClickHouse is down or paused

This is the most common incident on the ClickHouse Cloud Development tier, which auto-pauses when idle. The proxy keeps serving traffic. Log inserts that fail are written to the requests_fallback / events_fallback tables in Supabase instead of being lost.

Automatic recovery: the /cron/replay-fallback job drains those queues back into ClickHouse every 5 minutes, in batches, skipping any row already present (idempotent). Once ClickHouse is reachable the backlog clears on its own.

Manual steps if the backlog is not draining:

  1. Confirm ClickHouse is reachable and un-paused (ClickHouse Cloud console, or GET /health/ready which pings it).
  2. Check the queue depth in GET /health/deep under fallback.queue and fallback.eventsQueue. A rising number means the replay cron is not firing (see cron dropout below).
  3. Trigger a drain by hand:
    curl -X GET https://server.spanlens.io/cron/replay-fallback \
      -H "Authorization: Bearer $CRON_SECRET"
    bash
  4. If the outage exceeds the 7 day queue TTL, rows past the TTL are expired to bound Supabase storage. That is the only window in which request-log data is permanently lost. Upgrade the ClickHouse tier off Development so it does not auto-pause, which removes this class of incident entirely.

When the queue exceeds 1000 rows an internal_alerts row (kind fallback_queue_high) is raised and shown at /admin/alerts.

Supabase is down

Supabase holds accounts, API keys, provider keys, billing, and the fallback queues themselves. While it is down:

  • The dashboard and REST API are unavailable. Proxy auth uses a short in-memory cache, so in-flight keys keep working briefly, but new key lookups fail closed.
  • The fallback queues cannot absorb ClickHouse failures, because they live in Supabase. A simultaneous ClickHouse + Supabase outage is the one case where new request logs can be lost (see below).

Recovery:

  1. Restore Supabase from the managed backup or point-in-time recovery. See Restore Postgres.
  2. Because migrations are additive and the deploy pipeline runs migrate before deploy, the server code tolerates a schema that is briefly behind. Verify the schema version after restore and re-run supabase db push --linked if needed.
  3. After restore, watch /health/deep for the fallback queues to begin draining again.

ClickHouse and Supabase both down

This is the only total-loss window for new request logs: there is nowhere to queue a failed insert. Your application traffic is unaffected because the proxy still returns provider responses. The mitigation is to keep the two on independent providers (they already are) so a correlated outage is unlikely, and to run managed backups on both. There is no in-app queue that survives losing both stores at once; do not design new write paths that assume one is always available.

Scheduled jobs stop firing

Vercel's cron scheduler is known to silently drop short-interval jobs (as low as a few percent fire rate for */5 schedules). If the replay, self-monitor, or pending-deletion crons stop, backlogs build up with no error.

Detection: query how often each job actually ran in the last day.

SELECT job_name, count(*) AS runs, max(ran_at) AS last_run
FROM cron_job_runs
WHERE ran_at > now() - interval '24 hours'
GROUP BY job_name
ORDER BY runs;
sql

Compare the run counts to the schedule in apps/server/vercel.json. A job that is defined but missing from this list, or running far below its schedule, is being dropped.

Mitigation (defense in depth):

  • GitHub Actions re-fires the critical routes on a schedule (.github/workflows/cron-server.yml). GitHub also throttles short intervals, so this is a partial backstop, not a full replacement.
  • External heartbeat monitor is the reliable fix. Register a monitor (for example Better Stack) that calls the critical endpoints on a fixed interval with the Authorization: Bearer $CRON_SECRET header. Because it runs outside Vercel and GitHub, it is unaffected by their scheduler gaps and fires at close to 100%. Cover at least /cron/replay-fallback (3 min) and /cron/self-monitor (30 min).

Keep CRON_SECRET synchronized across the three schedulers (Vercel env, GitHub Actions secret, and the external monitor header) whenever it is rotated.

A background migration is stuck

Large backfills run as chunked background migrations with a Postgres advisory lock and a heartbeat, driven by /cron/run-background-migrations. If that cron stops firing (see above) the queue stalls with no error.

  1. Check the queue:
    SELECT name, status, progress_current, progress_total, last_heartbeat_at
    FROM background_migrations
    WHERE status IN ('pending', 'running')
    ORDER BY created_at;
    sql
  2. A row stuck in running with a stale last_heartbeat_at (older than a few minutes) means the worker died mid-chunk. The next cron tick reclaims the lock and resumes from where it left off, so the usual fix is simply to make the cron fire again.
  3. Trigger one run by hand to resume:
    curl -X GET https://server.spanlens.io/cron/run-background-migrations \
      -H "Authorization: Bearer $CRON_SECRET"
    bash

Webhook deliveries are dead-lettering

Outbound webhooks retry 5 times with exponential backoff. A delivery that exhausts its retries, or whose endpoint was deleted, is dead-lettered: marked with dlq_at and a dlq_reason instead of retrying forever. A dead-letter count that climbs means a customer endpoint has been down long enough to burn through every retry.

  1. Watch webhooks.dlq_count in GET /health/deep. When it crosses the threshold an internal_alerts row (kind webhook_backlog) is raised at /admin/alerts.
  2. Inspect what is dead-lettered and why:
    SELECT webhook_id, dlq_reason, count(*)
    FROM webhook_deliveries
    WHERE dlq_at IS NOT NULL
    GROUP BY webhook_id, dlq_reason
    ORDER BY count DESC;
    sql
  3. exhausted means the endpoint returned errors or timed out for the full retry window (contact the customer). webhook_deleted and payload_missing are terminal and need no action.

Restore drills

Backups are only real if a restore has been tested. On a schedule (quarterly is a reasonable default), restore the latest Supabase backup and a ClickHouse backup into a throwaway environment and confirm the dashboard renders, using the exact commands in Backup and restore. Record how long the restore took; that is your real recovery time, not an estimate.