Health & Metrics

Three operational probes at the version root, with aliases at the server root for load balancers. They do not require auth by default, so a load balancer can poll liveness and readiness without a key — set TOPICS_PROBE_AUTH=true to require auth on all three.

Distinguish the two health states: liveness (/v0/health) answers “can this process serve at all?” and is 200 whenever the process is up. Readiness (/v0/ready) answers “should traffic be routed here now?” and returns 503 during WAL replay on boot and while draining on shutdown. Route real traffic on readiness; restart on liveness.

GET /v0/health — liveness

GET/v0/health

Alias: GET /healthz. Returns 200 always while the process can serve. Use it for the container/orchestrator liveness probe (restart on failure).


curl localhost:4000/v0/health


{ "status": "ok", "version": "0.1.0", "uptime_ms": 84012 }

Field	Type	Meaning
`status`	string	Always `"ok"` while serving.
`version`	string	The running server version.
`uptime_ms`	`u64`	Milliseconds since process start.

Health is 200 even during WAL replay — the process is alive and will become ready. Use /v0/ready (below) to gate traffic during replay; use /v0/health only to decide whether to restart the process.

GET /v0/ready — readiness

GET/v0/ready

Alias: GET /readyz. Returns 200 when the server is serving normally, and 503 while it is not yet (or no longer) able to take traffic. This is the load-balancer / Kubernetes readiness gate: on boot it stays 503 until WAL replay finishes, then flips to 200; on SIGINT/SIGTERM it returns 503 shutting_down while draining.

Ready (200)


curl localhost:4000/v0/ready


{ "status": "ready", "wal_replay_complete": true, "topics": 42 }

Field	Type	Meaning
`status`	string	`"ready"` when serving.
`wal_replay_complete`	bool	`true` once boot-time WAL replay has finished.
`topics`	`u64`	Number of topics currently registered.

Not ready — during WAL replay (503)

While replaying the write-ahead log on boot, the probe returns 503 not_ready with a Retry-After header and a replay-progress fraction in the error detail:


{ "error": {
    "code": "not_ready",
    "message": "WAL replay in progress",
    "detail": { "replay_progress": 0.62 } } }

error.detail.replay_progress runs 0.0–1.0. The probe flips to 200 the moment replay completes. While draining on shutdown it returns 503 shutting_down (also with Retry-After).

Status	`error.code`	When
`200`	—	Serving normally.
`503`	`not_ready`	Boot-time WAL replay in progress (`detail.replay_progress`).
`503`	`shutting_down`	Graceful drain on `SIGINT`/`SIGTERM`.

GET /v0/metrics — Prometheus / JSON metrics

GET/v0/metrics

Returns operational metrics: Prometheus text exposition (text/plain; version=0.0.4) by default, or a JSON snapshot when you send Accept: application/json. Returns 200 always — even when the server is not yet ready, since metrics describe the recovering process.

The metric surface is a full catalog, not a stub. /v0/metrics emits process/aggregate gauges (topics_topics, topics_topics_by_class{class=…}, topics_routers, topics_records_live, topics_bytes_live, topics_queue_topics, topics_queue_leases_in_flight, topics_sse_connections, topics_watch_sessions, topics_ready, topics_recovery_progress, topics_uptime_ms), per-topic gauges (topics_topic_head_seq / _earliest_seq / _records_live / _bytes_live / _queue_ready / _queue_in_flight, labelled {topic=…}, bounded — topics_topic_metrics_truncated flags a cap), the real WAL metrics (topics_wal_frames_total, _batches_total, _fsyncs_total, _bytes_written_total, _rotations_total, _queue_depth, _queue_depth_peak, _submit_full_total, _read_only), and a fsync-latency histogram topics_wal_fsync_latency_us (with _bucket{le=…} / _sum / _count). There are no per-topic append/read/eviction/tombstone counters and no scheduler-throttle metric. The performance block on every response remains the most detailed per-request view.

Prometheus text (default)


curl localhost:4000/v0/metrics


# HELP topics_topics Number of topics.
# TYPE topics_topics gauge
topics_topics 42
# HELP topics_topic_head_seq Highest assigned seq per topic.
# TYPE topics_topic_head_seq gauge
topics_topic_head_seq{topic="orders"} 480231
topics_topic_earliest_seq{topic="orders"} 468188
# HELP topics_wal_fsyncs_total WAL fsyncs.
# TYPE topics_wal_fsyncs_total counter
topics_wal_fsyncs_total 88241
topics_wal_fsync_latency_us_bucket{le="500"} 84120
topics_wal_fsync_latency_us_count 88241

JSON snapshot (mirrors the same series in one object)


curl localhost:4000/v0/metrics -H 'accept: application/json'


{ "topics": 42, "topics_memory": 3, "topics_disk": 30, "topics_fsync": 9,
  "routers": 5, "records_live": 1843201, "bytes_live": 734003200,
  "queue_topics": 2, "queue_leases_in_flight": 286,
  "sse_connections": 41, "watch_sessions": 44, "ready": true,
  "replay_progress": 1.0, "uptime_ms": 360123,
  "wal": { "fsyncs": 88241, "frames": 1843290, "batches": 90011,
           "bytes_written": 812340992, "rotations": 12, "queue_depth": 0,
           "queue_depth_peak": 1280, "submit_full_total": 0, "read_only": 0,
           "fsync_count": 88241, "fsync_micros_total": 441205000 } }

Metrics are auth-gated by default when keys are configured. Unlike /v0/health and /v0/ready (which stay unauthenticated so a load balancer can poll liveness/readiness), /v0/metrics exposes operational state and therefore needs a key with the read scope when TOPICS_API_KEYS is set (a full-access key suffices). In dev mode (no keys) it is open.

Auth on probes — `TOPICS_PROBE_AUTH`

By default the probes skip auth so an external load balancer can poll them without a credential:

/v0/health and /v0/ready — always unauthenticated.
/v0/metrics — auth-gated (read scope) when keys are configured; open in dev.

Set TOPICS_PROBE_AUTH=true to require auth on all three, including health and readiness. Use this when even liveness/readiness should be behind a credential (for example, the probes are reachable from an untrusted network). See Configuration for the full env var set and Observability for how to wire these into a monitoring stack.