Reliability
The Spanlens proxy sits in the critical path of your LLM calls. This page covers what we guarantee, what degrades when, and how to detect each failure mode from your side without waiting for our status page.
What the proxy is on the critical path for
The proxy passes your request to OpenAI / Anthropic / Gemini and streams the response back. Logging to ClickHouse happens afterthe response leaves for the client, via Vercel's waitUntil(). Concretely:
- Critical for your user-facing latency: proxy auth, provider key decrypt, upstream fetch, stream pump back to your client.
- Not critical for your user: writing the log row, computing cost, parsing usage. These happen after the bytes are on the wire.
So even when ClickHouse is unhappy, your application keeps returning responses to end users. The visible symptom is missing rows in /requests, not failed API calls.
Failure modes and what happens
| Failure | End-user impact | Dashboard impact | Auto-recovery |
|---|---|---|---|
| Upstream provider 5xx (OpenAI down) | Same as direct call: the SDK surfaces the 5xx. | Request still logged with the 5xx status_code. | Provider SDKs retry by default. |
| Provider 429 rate limit | Same as direct call: 429 returned. | Logged with status_code=429. | Provider SDKs retry with backoff. |
| Stream exceeds 290s budget | Stream closes gracefully; client sees an end-of-stream without sentinel. | Logged with truncated: true, partial response body kept. | Use stream: true with smaller max_tokens, or self-host (no Vercel 300s limit). |
| Non-streaming > 35s | 504 returned. | Logged with status_code=504. | Switch to streaming; first byte still arrives in ~200ms. |
| ClickHouse unreachable | None. Response already streamed. | Log row queued in Supabase requests_fallback. | Cron drains the queue every 5 min once ClickHouse is healthy. |
| Supabase Postgres down | None for the proxy itself. /api/v1/* endpoints (dashboard, key management) return 5xx. | Dashboard reads fail; proxy keeps logging to ClickHouse. | Supabase managed availability (cloud) or your HA setup (self-host). |
| Both ClickHouse and Supabase down | None. Response already streamed. | Log row LOST (no queue to land in). | Manual replay impossible. Self-host with HA Postgres + ClickHouse to avoid. |
The fallback queue
When ClickHouse insert throws, the logger catches it and INSERTs the row into a Supabase table named requests_fallback. A cron route POST /cron/replay-fallback runs every 5 minutes, pulls up to 50 rows from the queue, and tries to insert them into ClickHouse. Successful inserts are deleted from the queue; failed ones increment retry_count and stay queued.
- Expiry: rows are dropped after 7 days or 100 retries, whichever comes first.
- Ordering: queue is FIFO by
created_at, not strict per-organization. - Duplicates: ClickHouse has no UNIQUE constraint on the requests table. Race conditions can produce duplicate rows. Trade-off is accepted today; we prefer to lose fewer rows than dedupe in the hot path.
Source: apps/server/src/lib/fallback-replay.ts and apps/server/src/lib/logger.ts.
Health endpoints
Two endpoints, two purposes. Both are public; no auth required.
| Endpoint | Purpose | Returns |
|---|---|---|
GET /health | Process liveness. Cheap; safe to poll every 10s. | 200 always (if process is up). |
GET /health/deep | Component health. Pings ClickHouse, checks fallback queue size. | 200 if all healthy, 503 if ClickHouse is unreachable. |
Sample response from /health/deep:
{
"status": "ok",
"timestamp": "2026-05-31T03:14:22.000Z",
"clickhouse": { "ok": true, "latencyMs": 42 },
"fallback": { "queue": 0 }
}jsonMonitor these from your own observability stack (Better Stack, UptimeRobot, Pingdom, Sentry Crons, anything that supports HTTP probes). We recommend two probes:
GET /healthevery 60s, alert if 2 consecutive failures.GET /health/deepevery 5 min, alert on 503 OR iffallback.queue > 1000(queue not draining).
Status page
Public status: status.spanlens.io (when the service is down our marketing pages may be down too; bookmark this URL directly). The page tracks the proxy (liveness + deep health) and the dashboard independently, and posts incident updates within 15 minutes of first detection.
Subscribe by email or RSS directly on the status page (Subscribe button, top right). For real-time pages on critical work, set up your own probe against /health/deep as well, the status page lags real detection by minutes.
What you should do client-side
Retry on 5xx and 429 from the proxy
The official OpenAI / Anthropic SDKs already do this. If you wrote a raw HTTP client, add at least 2 retries with exponential backoff on 5xx and 429.
Do not retry on 401 / 403 / 400
401 means your Spanlens key is wrong. 403 means the key lacks permission (e.g. wrong project). 400 typically means missing provider key for the requested provider. None of these benefit from a retry; surface to the user.
Tolerate missing logs
Your application code should not block waiting for a Spanlens log to appear. A request returns to the user before the log is written; downstream features that depend on the log (e.g. real-time cost display) should poll with a small delay or accept eventual consistency.
Self-host if data residency matters more than ops effort
Self-hosting removes our cloud as a failure mode entirely. You take on running Postgres + ClickHouse, but the latency budget shifts entirely under your control. See Self-hosting.
Incident response checklist
If you see missing rows in /requests:
- Check status.spanlens.io.
curl https://server.spanlens.io/health/deep. Iffallback.queue > 0, the rows are queued and will replay automatically; no action needed.- Verify your application is hitting the proxy (Network tab in the browser, or your APM trace). If requests are not reaching
server.spanlens.io, the gap is on your side. - If status page is green AND
/health/deepreturns 200 AND your requests are reaching us, email support@spanlens.io with the request id (x-spanlens-request-idresponse header) and we will trace the missing row.
SLOs (cloud, hobby and paid)
| Metric | Target | How measured |
|---|---|---|
| Proxy availability | 99.9% monthly | GET /health success rate from external probe. |
| Logging completeness | 99.95% of calls eventually logged | Compared against upstream provider invoice token counts daily. |
| Proxy overhead (p95) | < 50 ms | proxy_overhead_ms column on every Request row. |
| Fallback drain (p95) | < 15 min after ClickHouse recovers | Time between queue size peak and queue size 0. |
Targets above are for the cloud product. Self-host SLOs are whatever you achieve; the code is the same.
Next: scaling for high throughput.