Scaling
At small volume Spanlens runs at default settings without thought. Past a few requests per second you start hitting trade-offs between log fidelity, latency, and cost. This page is the explicit map of those trade-offs and the levers available.
Latency budget
| Step | Typical | p95 | Notes |
|---|---|---|---|
| DNS + TLS handshake (first call) | ~80 ms | ~250 ms | Amortized to ~0 ms for keep-alive connections. |
| Auth (API key lookup) | ~5 ms | ~15 ms | Hash + DB lookup, cached in-process per warm container. |
| Provider key decrypt | ~2 ms | ~5 ms | AES-256-GCM via Web Crypto. |
| Upstream provider call | varies | varies | This is your model latency. Spanlens does not add to it. |
| Stream pump (per chunk) | < 1 ms | ~3 ms | tee() into our log buffer, passthrough to client. |
| Log write (async, off critical path) | ~30 ms | ~150 ms | Runs after response leaves; does not delay your user. |
Bottom line: for warm connections the user-visible overhead is ~10 ms typical, ~50 ms p95. That number is recorded on every Request row in theproxy_overhead_ms column so you can audit it directly in /requests.
Lever 1: log body mode
The biggest log size driver is the request and response body. By default we keep them fully. For high-throughput apps where bodies are large but you only need cost and trace structure, drop to meta or none.
| Mode | request_body / response_body | user_id / session_id | When to use |
|---|---|---|---|
full (default) | kept | kept | Most teams. Bodies are essential for debugging. |
meta | empty string | kept | You only need cost / latency / token / user-level analytics. |
none | empty string | null | Strict PII zones. You still get cost and structure, no identifying data. |
Set per-call via header or SDK helper:
import { createOpenAI, withLogBody } from '@spanlens/sdk/openai'
const openai = createOpenAI()
const res = await openai.chat.completions.create(
{ ... },
{ headers: withLogBody('meta').headers },
)tsOr set process-wide via the SDK:
import { observeOpenAI } from '@spanlens/sdk/openai'
await observeOpenAI(openai, { logBody: 'meta' }, async () => { ... })tsStorage impact: full mode adds ~2 KB per call (ZSTD compressed).meta mode is ~150 bytes per call. At 10M calls/month the difference is roughly 20 GB vs 1.5 GB of column-store space.
Lever 2: sampling
For very high volumes (10k+ rps), even compressed bodies add up. Sample at the SDK level:
import { createOpenAI } from '@spanlens/sdk/openai'
const openai = createOpenAI({
sampling: {
rate: 0.1, // log 10% of calls
alwaysLogErrors: true, // override sampling for status >= 400
},
})tsSampled-out calls still flow through the proxy normally; they are just not persisted. Errors are always logged regardless of sample rate so debugging stays intact. Aggregate metrics (cost, latency) are scaled up by the inverse of the sample rate when displayed.
Sampling is a per-call decision made before the request is sent. Use alwaysLogErrors: true to ensure you never miss a 5xx.
Lever 3: trace sampling
Traces are usually orders of magnitude lighter than request bodies, so most teams log every trace. If you need to sample, do it at the trace-creation site:
const shouldTrace = Math.random() < 0.2 // 20% sample
const trace = shouldTrace
? client.startTrace({ name: 'agent' })
: null
// ... wrap observe() calls only if trace is non-nulltsSampled-out traces simply do not exist. There is no per-span sampling; if a trace is on, all its spans are kept.
Lever 4: streaming for long generations
Anything that might exceed the 290s stream deadline (Vercel Pro cap) should use stream: true. The proxy streams chunks straight through; first byte arrives in ~200 ms regardless of total duration. If the stream does hit the deadline, the Request is logged with truncated: true and the partial response body is kept.
For non-streaming requests, the upstream fetch is gated at UPSTREAM_TIMEOUT_MS = 35000 for initial headers. Bigger jobs should use streaming, period.
Connection reuse
Most provider SDKs maintain an HTTPS keep-alive pool. Make sure yours does:
- OpenAI Node SDK: keep-alive on by default.
- Anthropic Node SDK: keep-alive on by default.
- Raw fetch: no keep-alive by default in some runtimes. In Node, use
undici's default Agent.
TLS handshake cost dominates first-call latency. With keep-alive, the per-call overhead drops to the <15 ms numbers in the table above.
Concurrency on the proxy
Spanlens cloud runs on Vercel Pro with a per-region invocation pool. There is no per-account concurrency limit we enforce at the proxy layer; the upstream provider's rate limit is the real ceiling.
If you are bursting hard enough to saturate Vercel's pool, you will see 503 from the proxy. The provider SDKs retry these. For sustained high traffic, talk to us about a dedicated deployment, or self-host on infra you control.
Self-host tuning
When you self-host, the bottlenecks shift to your Postgres + ClickHouse setup. Defaults work for thousands of req/s on a single ClickHouse node; past that:
ClickHouse
- Partition by month is plenty up to ~100M rows per month per project.
- ORDER BY (organization_id, project_id, created_at, id) is tuned for tenant-scoped time queries. Do not change without re-bench.
- ZSTD(3) on bodies is the sweet spot. ZSTD(9) buys ~15% more compression at 2x more CPU.
- Asynchronous inserts with
async_insert=1reduces write amplification on bursty workloads. Trade-off: up to 1s additional log latency, no data loss.
Supabase Postgres
Postgres handles traces, spans, prompts, evals. None are append-only at the request volume; even at 1M traces/month one Supabase project handles it comfortably.
- RLS adds ~2 ms per query. Worth it for the multi-tenant isolation.
spans_refresh_trace_aggregatestrigger fires on every span INSERT / UPDATE. Heavy span churn on a single trace amplifies. If you measure this as a hotspot, switch to a periodic recompute.
Replay queue
The requests_fallback queue drains 50 rows per 5-minute cron tick by default (~10 rows/sec). For higher recovery throughput, change the cron schedule in vercel.json or run the replay handler as a long-lived worker instead.
What to monitor
| Metric | Where | Alert threshold |
|---|---|---|
proxy_overhead_ms p95 | aggregate /requests column | > 80 ms for 10 minutes |
| 5xx rate | group requests by status_code | > 1% for 5 minutes |
| Fallback queue size | GET /health/deep | > 1000 sustained for 30 min |
| Truncated streams | requests where truncated=true | > 5% of streaming requests |
Cost optimization tactics
- Switch to
metafor chatty internal services. A customer support bot that gets the same 50 messages over and over does not need bodies stored 50 times. - Use sampling for high-volume embeddings. Embedding calls are often 100x more frequent than completions and contain less debugging value. Sample at 10% and you keep all the signal at 1/10 the storage.
- Self-host if you do millions of calls per day. Cloud pricing crosses over with self-host TCO somewhere around 10M calls/month for most teams.
Next: reliability for failure modes and recovery, or self-hosting for full control.