Skip to Content
API ReferenceHealth & Metrics

Health & Metrics

Three operational probes at the version root, with aliases at the server root for load balancers. They do not require auth by default, so a load balancer can poll liveness and readiness without a key — set TOPICS_PROBE_AUTH=true to require auth on all three.

Distinguish the two health states: liveness (/v0/health) answers “can this process serve at all?” and is 200 whenever the process is up. Readiness (/v0/ready) answers “should traffic be routed here now?” and returns 503 during WAL replay on boot and while draining on shutdown. Route real traffic on readiness; restart on liveness.

GET /v0/health — liveness

GET/v0/health

Alias: GET /healthz. Returns 200 always while the process can serve. Use it for the container/orchestrator liveness probe (restart on failure).

curl localhost:4000/v0/health
{ "status": "ok", "version": "0.1.0", "uptime_ms": 84012 }
FieldTypeMeaning
statusstringAlways "ok" while serving.
versionstringThe running server version.
uptime_msu64Milliseconds since process start.

Health is 200 even during WAL replay — the process is alive and will become ready. Use /v0/ready (below) to gate traffic during replay; use /v0/health only to decide whether to restart the process.

GET /v0/ready — readiness

GET/v0/ready

Alias: GET /readyz. Returns 200 when the server is serving normally, and 503 while it is not yet (or no longer) able to take traffic. This is the load-balancer / Kubernetes readiness gate: on boot it stays 503 until WAL replay finishes, then flips to 200; on SIGINT/SIGTERM it returns 503 shutting_down while draining.

Ready (200)

curl localhost:4000/v0/ready
{ "status": "ready", "wal_replay_complete": true, "topics": 42 }
FieldTypeMeaning
statusstring"ready" when serving.
wal_replay_completebooltrue once boot-time WAL replay has finished.
topicsu64Number of topics currently registered.

Not ready — during WAL replay (503)

While replaying the write-ahead log on boot, the probe returns 503 not_ready with a Retry-After header and a replay-progress fraction in the error detail:

{ "error": { "code": "not_ready", "message": "WAL replay in progress", "detail": { "replay_progress": 0.62 } } }

error.detail.replay_progress runs 0.01.0. The probe flips to 200 the moment replay completes. While draining on shutdown it returns 503 shutting_down (also with Retry-After).

Statuserror.codeWhen
200Serving normally.
503not_readyBoot-time WAL replay in progress (detail.replay_progress).
503shutting_downGraceful drain on SIGINT/SIGTERM.

GET /v0/metrics — Prometheus / JSON metrics

GET/v0/metrics

Returns operational metrics: Prometheus text exposition (text/plain; version=0.0.4) by default, or a JSON snapshot when you send Accept: application/json. Returns 200 always — even when the server is not yet ready, since metrics describe the recovering process.

The metric surface is a full catalog, not a stub. /v0/metrics emits process/aggregate gauges (topics_topics, topics_topics_by_class{class=…}, topics_routers, topics_records_live, topics_bytes_live, topics_queue_topics, topics_queue_leases_in_flight, topics_sse_connections, topics_watch_sessions, topics_ready, topics_recovery_progress, topics_uptime_ms), per-topic gauges (topics_topic_head_seq / _earliest_seq / _records_live / _bytes_live / _queue_ready / _queue_in_flight, labelled {topic=…}, bounded — topics_topic_metrics_truncated flags a cap), the real WAL metrics (topics_wal_frames_total, _batches_total, _fsyncs_total, _bytes_written_total, _rotations_total, _queue_depth, _queue_depth_peak, _submit_full_total, _read_only), and a fsync-latency histogram topics_wal_fsync_latency_us (with _bucket{le=…} / _sum / _count). There are no per-topic append/read/eviction/tombstone counters and no scheduler-throttle metric. The performance block on every response remains the most detailed per-request view.

Prometheus text (default)

curl localhost:4000/v0/metrics
# HELP topics_topics Number of topics. # TYPE topics_topics gauge topics_topics 42 # HELP topics_topic_head_seq Highest assigned seq per topic. # TYPE topics_topic_head_seq gauge topics_topic_head_seq{topic="orders"} 480231 topics_topic_earliest_seq{topic="orders"} 468188 # HELP topics_wal_fsyncs_total WAL fsyncs. # TYPE topics_wal_fsyncs_total counter topics_wal_fsyncs_total 88241 topics_wal_fsync_latency_us_bucket{le="500"} 84120 topics_wal_fsync_latency_us_count 88241

JSON snapshot (mirrors the same series in one object)

curl localhost:4000/v0/metrics -H 'accept: application/json'
{ "topics": 42, "topics_memory": 3, "topics_disk": 30, "topics_fsync": 9, "routers": 5, "records_live": 1843201, "bytes_live": 734003200, "queue_topics": 2, "queue_leases_in_flight": 286, "sse_connections": 41, "watch_sessions": 44, "ready": true, "replay_progress": 1.0, "uptime_ms": 360123, "wal": { "fsyncs": 88241, "frames": 1843290, "batches": 90011, "bytes_written": 812340992, "rotations": 12, "queue_depth": 0, "queue_depth_peak": 1280, "submit_full_total": 0, "read_only": 0, "fsync_count": 88241, "fsync_micros_total": 441205000 } }

Metrics are auth-gated by default when keys are configured. Unlike /v0/health and /v0/ready (which stay unauthenticated so a load balancer can poll liveness/readiness), /v0/metrics exposes operational state and therefore needs a key with the read scope when TOPICS_API_KEYS is set (a full-access key suffices). In dev mode (no keys) it is open.

Auth on probes — TOPICS_PROBE_AUTH

By default the probes skip auth so an external load balancer can poll them without a credential:

  • /v0/health and /v0/ready — always unauthenticated.
  • /v0/metrics — auth-gated (read scope) when keys are configured; open in dev.

Set TOPICS_PROBE_AUTH=true to require auth on all three, including health and readiness. Use this when even liveness/readiness should be behind a credential (for example, the probes are reachable from an untrusted network). See Configuration for the full env var set and Observability for how to wire these into a monitoring stack.

See also

  • Observability — wiring probes and metrics into your stack.
  • ConfigurationTOPICS_PROBE_AUTH and the full env var set.
  • Recovery — what WAL replay does behind the not_ready gate.
  • Errors — the 503 envelope and the performance block.
Last updated on