WAL & Group Commit

The write-ahead log is the durability boundary of topics: an acknowledged write to a durable topic is a complete, checksum-valid frame on disk, and recovery loses only data that never reached the WAL. This page covers the frame format, the control-frame timeline, the sharded writer threads (each fed by its own MPSC channel), the adaptive group-commit window, and the off-lock fsync that lets many concurrent writers coalesce into one flush.

The WAL is sharded (TOPICS_WAL_SHARDS, default min(num_cpus, 8)): there are N independent shard writers, each an ordered append stream over its own file set with its own group-commit loop, so durable write throughput scales ~linearly with shard count. Each topic routes to exactly one shard by a stable hash of its interned id, so within a shard appends are still one ordered sequential stream (a deliberate match to the hardware — making group commit trivial and removing write-side lock contention), and per-topic ordering and every durability guarantee still hold because a topic always lives on the same shard. On disk each shard (when shards > 1) lives under wal/shard-NN/; TOPICS_WAL_SHARDS=1 is the flat single-writer layout. Recovery is shard-count-agnostic — it replays all shards by topic_id, so the shard count may change between restarts.

Frame format

Every record and every control mutation is a length-prefixed, checksum-protected frame. Multi-byte integers are little-endian. A multi-record write produces many frames committed as one batch.


 off  size  field
   0    4   frame_len   u32   bytes of this frame EXCLUDING this field
   4    1   type        u8    1=Append 2=TopicCreate 3=TopicDelete 4=RouterCreate
                                5=RouterDelete 6=Delete 7=EvictWatermark
                                8=CheckpointMark 9=ConfigUpdate 10=Lease
                                11=HeadWatermark
   5    1   flags       u8    bit0=has_tag bit1=has_node bit2=durable
   6    8   topic_id    u64   interned numeric topic id (name<->id in the meta store)
  14    8   seq         u64   server-assigned (0 for non-Append control frames)
  22    8   ts          u64   server commit ms
  30    2   node_len    u16
  32    2   tag_len     u16
  34    4   data_len    u32
  38    N   node        bytes (node_len)
   .    M   tag         bytes (tag_len)
   .    P   data+meta   bytes (data_len)   -- opaque JSON payload
   .    8   xxh3        u64   XXH3-64 over bytes [4 .. checksum_start)

Three properties make this crash-safe:

frame_len comes first. Recovery can validate frame boundaries without parsing the body. A frame_len that overruns the file is a torn tail — the write was interrupted — so the log ends there.
XXH3-64, not CRC32. The 8-byte trailer is an XXH3-64 hash over everything between frame_len and the checksum. It is fast and 64-bit, with roughly 2³² lower false-accept than a 32-bit CRC. A mismatch means a torn or partial frame — the logical end of the log. This is the crash-consistency anchor.
topic_id is an interned u64, not the string name, so frames stay small; the name↔id mapping lives in the metadata store.

The same XXH3-64 checksum protects WAL frames, segment frames, and metadata snapshots — one integrity scheme across the whole on-disk format. On the WAL a checksum mismatch means a torn tail and the log is truncated; on a sealed segment it means corruption, which is surfaced rather than silently truncated. See Crash Recovery.

Control frames: one ordered timeline

The WAL is not just record data. Config changes, deletes, and eviction watermarks are all control frames on the same ordered log, so there is exactly one truth — WAL order — for data and metadata together. The type byte selects the frame:

`type`	Frame	Records
`1`	`Append`	A record write (carries `seq`, `ts`, `node`, `tag`, payload).
`2`	`TopicCreate`	A topic created with its config.
`3`	`TopicDelete`	A topic dropped.
`4`	`RouterCreate`	A router defined.
`5`	`RouterDelete`	A router removed.
`6`	`Delete`	A permanent delete (`before_seq` and/or `match`), replayed deterministically on recovery.
`7`	`EvictWatermark`	A cap/TTL watermark advance, so the tombstone boundary survives restart.
`8`	`CheckpointMark`	A checkpoint barrier: which seqs are durably absorbed into segments (per-shard position).
`9`	`ConfigUpdate`	A topic config mutation.
`10`	`Lease`	A queue lease-log event (claim/ack/nack/extend), when `leases_durable`.
`11`	`HeadWatermark`	A `disk`-class fsynced seq-reservation ceiling, so an already-acked seq is never re-handed after a crash dropped the un-fsynced frame.

Because a Delete is logged as an operation (its before_seq / match predicate) rather than as a list of individual seqs, recovery re-derives the deleted set from the rebuilt index and tag index — the removal is replayed, not stored record-by-record.

The sharded writer threads

Appends funnel into the topic’s WAL-shard writer thread, each shard fed by its own bounded MPSC channel. The request path does the cheap, parallel work; the shard writer does the one thing that must be serialized per shard — touching that shard’s file.


write request (per HTTP task)
  -> validate; resolve topic_id; assign seq = head_seq.fetch_add(n)   (after a discard:"reject" cap check)
  -> serialize frame(s) into a reusable per-writer scratch buffer
  -> enqueue (frames, durability class, commit token) on the topic's WAL-shard MPSC channel
  -> [that shard's WAL writer] drain the channel, write() the batch, group-commit
  -> on commit: resolve the commit tokens; publish records into the in-memory index;
     Notify watchers
  -> respond { seqs, head_seq, performance }

The seq is assigned before the WAL commit so it can be returned in the response, but the record is only visible to readers and acked after its commit class is satisfied. The guarantee is one-directional and absolute: if a write was acked, it is in the WAL.

Adaptive group commit

Durability is per-topic across three commit classes, and each shard writer collapses them into one loop. See the durability classes for the consumer-facing semantics.

Topic class	Commit behavior
`memory`	The same group-committed WAL write path as `disk` (no special-casing) but best-effort: never fsync-gated and no durability guarantee — after a restart its records may survive or be lost. Lowest latency.
`disk` (`durable:false`)	The batch is `write()`-en into the page cache and acked immediately. A background `fdatasync` hardens the tail on a timer; writers never wait. Loss window on power loss = the un-fsynced tail.
`fsync` (`durable:true`)	The ack is held until the group `fdatasync()` returns. Frames are still group-committed: the shard writer coalesces all pending durable frames in the current window into one `write()` + one `fdatasync()`, then acks them all.

The commit window is adaptive between a floor and ceiling:

gc_min = 500 µs — a lone durable write fsyncs almost immediately, so group commit never penalizes a quiet workload.
gc_max = 10 ms — under load the window widens toward this ceiling, amortizing one fdatasync across hundreds of durable writes.

Three deliberate choices keep the fsync cost near the NVMe floor:

fdatasync, not fsync — no inode-metadata flush per commit.
64 MiB preallocated WAL files — the active file is fallocate’d to its full size up front, so an append never extends the inode; the next file is preallocated ahead of rotation, and rotation happens when a batch would exceed the active file’s size.
No O_DIRECT — the page cache is wanted for fast read-back and OS write coalescing; durability comes from the explicit fdatasync.

On NVMe a single fdatasync is roughly 50–500 µs, so a 500 µs floor barely registers when quiet and the 10 ms ceiling only matters under sustained durable load — exactly when one fsync is amortizing the most writes. On a laptop with an APFS NVMe the fdatasync floor is closer to ~5 ms; that is a hardware property of the disk, not a design cost. See Performance.

Off-lock fsync

The detail that makes durable throughput hold up under concurrency is where the fsync wait happens. The per-topic lock is held only for the microsecond-scale work; the slow part runs without it.

Under the per-topic lock (microseconds): assign the seq and enqueue the frame on the WAL channel. This is short, so it does not serialize unrelated writers for long.
Off the lock: the writer waits for the group fdatasync. Because the per-topic lock is already released, many concurrent writers’ fsync waits coalesce into one batch flush — the shard writer turns N pending durable frames into one write() + one fdatasync().
Publish in ticket order: each submission holds a commit token; tokens resolve in the order frames were enqueued, so a record becomes visible to readers and the response is returned in seq order, never out of order, even though the waits overlapped.

This is why a durable topic can sustain far more than 1 / fsync_latency writes per second: the fsyncs are shared, not serialized per-write. The measured coalescing factor is about 8.4× versus a per-write baseline (see Performance).

Each shard writer is fed by a bounded MPSC channel. If durable writes arrive faster than the disk can flush them, that bound is the backpressure point — it shows up as write-ack latency (and, at saturation, a 503), never as silent loss. The cardinal rule holds: degrade latency, never correctness.

What survives a crash

Because a frame is acked only after it is committed (and fdatasync’d, for a durable topic), and because a torn or partial frame fails its frame_len/XXH3 check on replay, the boundary is exact:

fsync topic: an acked write waited for fdatasync, so it is a complete checksum-valid frame ⇒ never lost.
disk topic: an acked write is in the page cache but may not be fsynced; a power loss can drop the un-fsynced tail — the documented fast-path tradeoff, surfacing on read as an ordinary eviction-style gap.
memory topic: written to the WAL on the same group-committed path as disk, but with no durability guarantee — after a restart its records may survive or be lost (best-effort, gradual recovery); the topic config always persists.

The full replay-and-truncate procedure is in Crash Recovery.