Reliability

The Spanlens proxy sits in the critical path of your LLM calls. This page covers what we guarantee, what degrades when, and how to detect each failure mode from your side without waiting for our status page.

What the proxy is on the critical path for

The proxy passes your request to OpenAI / Anthropic / Gemini and streams the response back. Logging to ClickHouse happens afterthe response leaves for the client, via Vercel's waitUntil(). Concretely:

Critical for your user-facing latency: proxy auth, provider key decrypt, upstream fetch, stream pump back to your client.
Not critical for your user: writing the log row, computing cost, parsing usage. These happen after the bytes are on the wire.

So even when ClickHouse is unhappy, your application keeps returning responses to end users. The visible symptom is missing rows in /requests, not failed API calls.

Failure modes and what happens

Failure	End-user impact	Dashboard impact	Auto-recovery
Upstream provider 5xx (OpenAI down)	Same as direct call: the SDK surfaces the 5xx.	Request still logged with the 5xx status_code.	Provider SDKs retry by default.
Provider 429 rate limit	Same as direct call: 429 returned.	Logged with status_code=429.	Provider SDKs retry with backoff.
Stream exceeds 290s budget	Stream closes gracefully; client sees an end-of-stream without sentinel.	Logged with `truncated: true`, partial response body kept.	Use `stream: true` with smaller `max_tokens`, or self-host (no Vercel 300s limit).
Non-streaming > 35s	504 returned.	Logged with status_code=504.	Switch to streaming; first byte still arrives in ~200ms.
ClickHouse unreachable	None. Response already streamed.	Log row queued in Supabase `requests_fallback`.	Cron drains the queue every 5 min once ClickHouse is healthy.
Supabase Postgres down	None for the proxy itself. /api/v1/* endpoints (dashboard, key management) return 5xx.	Dashboard reads fail; proxy keeps logging to ClickHouse.	Supabase managed availability (cloud) or your HA setup (self-host).
Both ClickHouse and Supabase down	None. Response already streamed.	Log row LOST (no queue to land in).	Manual replay impossible. Self-host with HA Postgres + ClickHouse to avoid.

The fallback queue

When ClickHouse insert throws, the logger catches it and INSERTs the row into a Supabase table named requests_fallback. A cron route POST /cron/replay-fallback runs every 5 minutes, pulls up to 50 rows from the queue, and tries to insert them into ClickHouse. Successful inserts are deleted from the queue; failed ones increment retry_count and stay queued.

Expiry: rows are dropped after 7 days or 100 retries, whichever comes first.
Ordering: queue is FIFO by created_at, not strict per-organization.
Duplicates: ClickHouse has no UNIQUE constraint on the requests table. Race conditions can produce duplicate rows. Trade-off is accepted today; we prefer to lose fewer rows than dedupe in the hot path.

Source: apps/server/src/lib/fallback-replay.ts and apps/server/src/lib/logger.ts.

Health endpoints

Two endpoints, two purposes. Both are public; no auth required.

Endpoint	Purpose	Returns
`GET /health`	Process liveness. Cheap; safe to poll every 10s.	`200` always (if process is up).
`GET /health/deep`	Component health. Pings ClickHouse, checks fallback queue size.	`200` if all healthy, `503` if ClickHouse is unreachable.

Sample response from /health/deep:

{
  "status": "ok",
  "timestamp": "2026-05-31T03:14:22.000Z",
  "clickhouse": { "ok": true, "latencyMs": 42 },
  "fallback": { "queue": 0 }
}

json

Monitor these from your own observability stack (Better Stack, UptimeRobot, Pingdom, Sentry Crons, anything that supports HTTP probes). We recommend two probes:

GET /health every 60s, alert if 2 consecutive failures.
GET /health/deep every 5 min, alert on 503 OR if fallback.queue > 1000 (queue not draining).

Status page

Public status: status.spanlens.io (when the service is down our marketing pages may be down too; bookmark this URL directly). The page tracks the proxy (liveness + deep health) and the dashboard independently, and posts incident updates within 15 minutes of first detection.

Subscribe by email or RSS directly on the status page (Subscribe button, top right). For real-time pages on critical work, set up your own probe against /health/deep as well, the status page lags real detection by minutes.

What you should do client-side

Retry on 5xx and 429 from the proxy

The official OpenAI / Anthropic SDKs already do this. If you wrote a raw HTTP client, add at least 2 retries with exponential backoff on 5xx and 429.

Do not retry on 401 / 403 / 400

401 means your Spanlens key is wrong. 403 means the key lacks permission (e.g. wrong project). 400 typically means missing provider key for the requested provider. None of these benefit from a retry; surface to the user.

Tolerate missing logs

Your application code should not block waiting for a Spanlens log to appear. A request returns to the user before the log is written; downstream features that depend on the log (e.g. real-time cost display) should poll with a small delay or accept eventual consistency.

Self-host if data residency matters more than ops effort

Self-hosting removes our cloud as a failure mode entirely. You take on running Postgres + ClickHouse, but the latency budget shifts entirely under your control. See Self-hosting.

Incident response checklist

If you see missing rows in /requests:

Check status.spanlens.io.
curl https://api.spanlens.io/health/deep. If fallback.queue > 0, the rows are queued and will replay automatically; no action needed.
Verify your application is hitting the proxy (Network tab in the browser, or your APM trace). If requests are not reaching api.spanlens.io, the gap is on your side.
If status page is green AND /health/deep returns 200 AND your requests are reaching us, email support@spanlens.io with the request id (x-spanlens-request-id response header) and we will trace the missing row.

SLOs (cloud, hobby and paid)

Metric	Target	How measured
Proxy availability	99.9% monthly	`GET /health` success rate from external probe.
Logging completeness	99.95% of calls eventually logged	Compared against upstream provider invoice token counts daily.
Proxy overhead (p95)	< 50 ms	`proxy_overhead_ms` column on every Request row.
Fallback drain (p95)	< 15 min after ClickHouse recovers	Time between queue size peak and queue size 0.

Targets above are for the cloud product. Self-host SLOs are whatever you achieve; the code is the same.

Next: scaling for high throughput.