Scaling

At small volume Spanlens runs at default settings without thought. Past a few requests per second you start hitting trade-offs between log fidelity, latency, and cost. This page is the explicit map of those trade-offs and the levers available.

Latency budget

Step	Typical	p95	Notes
DNS + TLS handshake (first call)	~80 ms	~250 ms	Amortized to ~0 ms for keep-alive connections.
Auth (API key lookup)	~5 ms	~15 ms	Hash + DB lookup, cached in-process per warm container.
Provider key decrypt	~2 ms	~5 ms	AES-256-GCM via Web Crypto.
Upstream provider call	varies	varies	This is your model latency. Spanlens does not add to it.
Stream pump (per chunk)	< 1 ms	~3 ms	tee() into our log buffer, passthrough to client.
Log write (async, off critical path)	~30 ms	~150 ms	Runs after response leaves; does not delay your user.

Bottom line: for warm connections the user-visible overhead is ~10 ms typical, ~50 ms p95. That number is recorded on every Request row in theproxy_overhead_ms column so you can audit it directly in /requests.

Lever 1: log body mode

The biggest log size driver is the request and response body. By default we keep them fully. For high-throughput apps where bodies are large but you only need cost and trace structure, drop to meta or none.

Mode	request_body / response_body	user_id / session_id	When to use
`full` (default)	kept	kept	Most teams. Bodies are essential for debugging.
`meta`	empty string	kept	You only need cost / latency / token / user-level analytics.
`none`	empty string	null	Strict PII zones. You still get cost and structure, no identifying data.

Set per-call via header or SDK helper:

import { createOpenAI, withLogBody } from '@spanlens/sdk/openai'

const openai = createOpenAI()

const res = await openai.chat.completions.create(
  { ... },
  { headers: withLogBody('meta').headers },
)

Or set process-wide via the SDK:

import { observeOpenAI } from '@spanlens/sdk/openai'

await observeOpenAI(openai, { logBody: 'meta' }, async () => { ... })

Storage impact: full mode adds ~2 KB per call (ZSTD compressed).meta mode is ~150 bytes per call. At 10M calls/month the difference is roughly 20 GB vs 1.5 GB of column-store space.

Lever 2: sampling

For very high volumes (10k+ rps), even compressed bodies add up. Sample at the SDK level:

import { createOpenAI } from '@spanlens/sdk/openai'

const openai = createOpenAI({
  sampling: {
    rate: 0.1,           // log 10% of calls
    alwaysLogErrors: true, // override sampling for status >= 400
  },
})

Sampled-out calls still flow through the proxy normally; they are just not persisted. Errors are always logged regardless of sample rate so debugging stays intact. Aggregate metrics (cost, latency) are scaled up by the inverse of the sample rate when displayed.

Sampling is a per-call decision made before the request is sent. Use alwaysLogErrors: true to ensure you never miss a 5xx.

Lever 3: trace sampling

Traces are usually orders of magnitude lighter than request bodies, so most teams log every trace. If you need to sample, do it at the trace-creation site:

const shouldTrace = Math.random() < 0.2  // 20% sample
const trace = shouldTrace
  ? client.startTrace({ name: 'agent' })
  : null

// ... wrap observe() calls only if trace is non-null

Sampled-out traces simply do not exist. There is no per-span sampling; if a trace is on, all its spans are kept.

Lever 4: streaming for long generations

Anything that might exceed the 290s stream deadline (Vercel Pro cap) should use stream: true. The proxy streams chunks straight through; first byte arrives in ~200 ms regardless of total duration. If the stream does hit the deadline, the Request is logged with truncated: true and the partial response body is kept.

For non-streaming requests, the upstream fetch is gated at UPSTREAM_TIMEOUT_MS = 35000 for initial headers. Bigger jobs should use streaming, period.

Connection reuse

Most provider SDKs maintain an HTTPS keep-alive pool. Make sure yours does:

OpenAI Node SDK: keep-alive on by default.
Anthropic Node SDK: keep-alive on by default.
Raw fetch: no keep-alive by default in some runtimes. In Node, use undici's default Agent.

TLS handshake cost dominates first-call latency. With keep-alive, the per-call overhead drops to the <15 ms numbers in the table above.

Concurrency on the proxy

Spanlens cloud runs on Vercel Pro with a per-region invocation pool. There is no per-account concurrency limit we enforce at the proxy layer; the upstream provider's rate limit is the real ceiling.

If you are bursting hard enough to saturate Vercel's pool, you will see 503 from the proxy. The provider SDKs retry these. For sustained high traffic, talk to us about a dedicated deployment, or self-host on infra you control.

Self-host tuning

When you self-host, the bottlenecks shift to your Postgres + ClickHouse setup. Defaults work for thousands of req/s on a single ClickHouse node; past that:

ClickHouse

Partition by month is plenty up to ~100M rows per month per project.
ORDER BY (organization_id, project_id, created_at, id) is tuned for tenant-scoped time queries. Do not change without re-bench.
ZSTD(3) on bodies is the sweet spot. ZSTD(9) buys ~15% more compression at 2x more CPU.
Asynchronous inserts with async_insert=1 reduces write amplification on bursty workloads. Trade-off: up to 1s additional log latency, no data loss.

Supabase Postgres

Postgres handles traces, spans, prompts, evals. None are append-only at the request volume; even at 1M traces/month one Supabase project handles it comfortably.

RLS adds ~2 ms per query. Worth it for the multi-tenant isolation.
spans_refresh_trace_aggregates trigger fires on every span INSERT / UPDATE. Heavy span churn on a single trace amplifies. If you measure this as a hotspot, switch to a periodic recompute.

Replay queue

The requests_fallback queue drains 50 rows per 5-minute cron tick by default (~10 rows/sec). For higher recovery throughput, change the cron schedule in vercel.json or run the replay handler as a long-lived worker instead.

What to monitor

Metric	Where	Alert threshold
`proxy_overhead_ms` p95	aggregate /requests column	> 80 ms for 10 minutes
5xx rate	group requests by status_code	> 1% for 5 minutes
Fallback queue size	`GET /health/deep`	> 1000 sustained for 30 min
Truncated streams	requests where `truncated=true`	> 5% of streaming requests

Cost optimization tactics

Switch to meta for chatty internal services. A customer support bot that gets the same 50 messages over and over does not need bodies stored 50 times.
Use sampling for high-volume embeddings. Embedding calls are often 100x more frequent than completions and contain less debugging value. Sample at 10% and you keep all the signal at 1/10 the storage.
Self-host if you do millions of calls per day. Cloud pricing crosses over with self-host TCO somewhere around 10M calls/month for most teams.

Next: reliability for failure modes and recovery, or self-hosting for full control.