Health & Metrics
Three operational probes at the version root, with aliases at the server root for load
balancers. They do not require auth by default, so a load balancer can poll liveness
and readiness without a key — set TOPICS_PROBE_AUTH=true to require auth on all three.
Distinguish the two health states: liveness (/v0/health) answers “can this process
serve at all?” and is 200 whenever the process is up. Readiness (/v0/ready)
answers “should traffic be routed here now?” and returns 503 during WAL replay on
boot and while draining on shutdown. Route real traffic on readiness; restart on
liveness.
GET /v0/health — liveness
Alias: GET /healthz. Returns 200 always while
the process can serve. Use it for the container/orchestrator liveness probe (restart on
failure).
curl localhost:4000/v0/health{ "status": "ok", "version": "0.1.0", "uptime_ms": 84012 }| Field | Type | Meaning |
|---|---|---|
status | string | Always "ok" while serving. |
version | string | The running server version. |
uptime_ms | u64 | Milliseconds since process start. |
Health is 200 even during WAL replay — the process is alive and will become ready. Use
/v0/ready (below) to gate traffic during replay; use /v0/health only to decide
whether to restart the process.
GET /v0/ready — readiness
Alias: GET /readyz. Returns 200 when the
server is serving normally, and 503 while it is not yet (or no longer) able to take
traffic. This is the load-balancer / Kubernetes readiness gate: on boot it stays 503
until WAL replay finishes, then flips to 200; on SIGINT/SIGTERM it returns 503 shutting_down while draining.
Ready (200)
curl localhost:4000/v0/ready{ "status": "ready", "wal_replay_complete": true, "topics": 42 }| Field | Type | Meaning |
|---|---|---|
status | string | "ready" when serving. |
wal_replay_complete | bool | true once boot-time WAL replay has finished. |
topics | u64 | Number of topics currently registered. |
Not ready — during WAL replay (503)
While replaying the write-ahead log on boot, the probe returns 503 not_ready with a
Retry-After header and a replay-progress fraction in the error detail:
{ "error": {
"code": "not_ready",
"message": "WAL replay in progress",
"detail": { "replay_progress": 0.62 } } }error.detail.replay_progress runs 0.0–1.0. The probe flips to 200 the moment
replay completes. While draining on shutdown it returns 503 shutting_down (also with
Retry-After).
| Status | error.code | When |
|---|---|---|
200 | — | Serving normally. |
503 | not_ready | Boot-time WAL replay in progress (detail.replay_progress). |
503 | shutting_down | Graceful drain on SIGINT/SIGTERM. |
GET /v0/metrics — Prometheus / JSON metrics
Returns operational metrics: Prometheus text exposition (text/plain; version=0.0.4)
by default, or a JSON snapshot when you send Accept: application/json. Returns 200
always — even when the server is not yet ready, since metrics describe the recovering
process.
The metric surface is a full catalog, not a stub. /v0/metrics emits process/aggregate
gauges (topics_topics, topics_topics_by_class{class=…}, topics_routers,
topics_records_live, topics_bytes_live, topics_queue_topics,
topics_queue_leases_in_flight, topics_sse_connections, topics_watch_sessions,
topics_ready, topics_recovery_progress, topics_uptime_ms), per-topic gauges
(topics_topic_head_seq / _earliest_seq / _records_live / _bytes_live /
_queue_ready / _queue_in_flight, labelled {topic=…}, bounded — topics_topic_metrics_truncated
flags a cap), the real WAL metrics (topics_wal_frames_total, _batches_total,
_fsyncs_total, _bytes_written_total, _rotations_total, _queue_depth, _queue_depth_peak,
_submit_full_total, _read_only), and a fsync-latency histogram
topics_wal_fsync_latency_us (with _bucket{le=…} / _sum / _count). There are no
per-topic append/read/eviction/tombstone counters and no scheduler-throttle metric. The
performance block on every response remains the most
detailed per-request view.
Prometheus text (default)
curl localhost:4000/v0/metrics# HELP topics_topics Number of topics.
# TYPE topics_topics gauge
topics_topics 42
# HELP topics_topic_head_seq Highest assigned seq per topic.
# TYPE topics_topic_head_seq gauge
topics_topic_head_seq{topic="orders"} 480231
topics_topic_earliest_seq{topic="orders"} 468188
# HELP topics_wal_fsyncs_total WAL fsyncs.
# TYPE topics_wal_fsyncs_total counter
topics_wal_fsyncs_total 88241
topics_wal_fsync_latency_us_bucket{le="500"} 84120
topics_wal_fsync_latency_us_count 88241JSON snapshot (mirrors the same series in one object)
curl localhost:4000/v0/metrics -H 'accept: application/json'{ "topics": 42, "topics_memory": 3, "topics_disk": 30, "topics_fsync": 9,
"routers": 5, "records_live": 1843201, "bytes_live": 734003200,
"queue_topics": 2, "queue_leases_in_flight": 286,
"sse_connections": 41, "watch_sessions": 44, "ready": true,
"replay_progress": 1.0, "uptime_ms": 360123,
"wal": { "fsyncs": 88241, "frames": 1843290, "batches": 90011,
"bytes_written": 812340992, "rotations": 12, "queue_depth": 0,
"queue_depth_peak": 1280, "submit_full_total": 0, "read_only": 0,
"fsync_count": 88241, "fsync_micros_total": 441205000 } }Metrics are auth-gated by default when keys are configured. Unlike /v0/health and
/v0/ready (which stay unauthenticated so a load balancer can poll liveness/readiness),
/v0/metrics exposes operational state and therefore needs a key with the
read scope when TOPICS_API_KEYS is set (a full-access key
suffices). In dev mode (no keys) it is open.
Auth on probes — TOPICS_PROBE_AUTH
By default the probes skip auth so an external load balancer can poll them without a credential:
/v0/healthand/v0/ready— always unauthenticated./v0/metrics— auth-gated (read scope) when keys are configured; open in dev.
Set TOPICS_PROBE_AUTH=true to require auth on all three, including health and
readiness. Use this when even liveness/readiness should be behind a credential (for
example, the probes are reachable from an untrusted network). See
Configuration for the full env var set and
Observability for how to wire these into a monitoring stack.
See also
- Observability — wiring probes and metrics into your stack.
- Configuration —
TOPICS_PROBE_AUTHand the full env var set. - Recovery — what WAL replay does behind the
not_readygate. - Errors — the
503envelope and theperformanceblock.