WAL & Group Commit
The write-ahead log is the durability boundary of topics: an acknowledged write to a durable topic is a complete, checksum-valid frame on disk, and recovery loses only data that never reached the WAL. This page covers the frame format, the control-frame timeline, the sharded writer threads (each fed by its own MPSC channel), the adaptive group-commit window, and the off-lock fsync that lets many concurrent writers coalesce into one flush.
The WAL is sharded (TOPICS_WAL_SHARDS, default min(num_cpus, 8)): there are N
independent shard writers, each an ordered append stream over its own file set with its own
group-commit loop, so durable write throughput scales ~linearly with shard count. Each topic
routes to exactly one shard by a stable hash of its interned id, so within a shard appends
are still one ordered sequential stream (a deliberate match to the hardware — making group
commit trivial and removing write-side lock contention), and per-topic ordering and every
durability guarantee still hold because a topic always lives on the same shard. On disk each
shard (when shards > 1) lives under wal/shard-NN/; TOPICS_WAL_SHARDS=1 is the flat
single-writer layout. Recovery is shard-count-agnostic — it replays all shards by
topic_id, so the shard count may change between restarts.
Frame format
Every record and every control mutation is a length-prefixed, checksum-protected frame. Multi-byte integers are little-endian. A multi-record write produces many frames committed as one batch.
off size field
0 4 frame_len u32 bytes of this frame EXCLUDING this field
4 1 type u8 1=Append 2=TopicCreate 3=TopicDelete 4=RouterCreate
5=RouterDelete 6=Delete 7=EvictWatermark
8=CheckpointMark 9=ConfigUpdate 10=Lease
11=HeadWatermark
5 1 flags u8 bit0=has_tag bit1=has_node bit2=durable
6 8 topic_id u64 interned numeric topic id (name<->id in the meta store)
14 8 seq u64 server-assigned (0 for non-Append control frames)
22 8 ts u64 server commit ms
30 2 node_len u16
32 2 tag_len u16
34 4 data_len u32
38 N node bytes (node_len)
. M tag bytes (tag_len)
. P data+meta bytes (data_len) -- opaque JSON payload
. 8 xxh3 u64 XXH3-64 over bytes [4 .. checksum_start)Three properties make this crash-safe:
frame_lencomes first. Recovery can validate frame boundaries without parsing the body. Aframe_lenthat overruns the file is a torn tail — the write was interrupted — so the log ends there.- XXH3-64, not CRC32. The 8-byte trailer is an XXH3-64
hash over everything between
frame_lenand the checksum. It is fast and 64-bit, with roughly 2³² lower false-accept than a 32-bit CRC. A mismatch means a torn or partial frame — the logical end of the log. This is the crash-consistency anchor. topic_idis an internedu64, not the string name, so frames stay small; the name↔id mapping lives in the metadata store.
The same XXH3-64 checksum protects WAL frames, segment frames, and metadata snapshots — one integrity scheme across the whole on-disk format. On the WAL a checksum mismatch means a torn tail and the log is truncated; on a sealed segment it means corruption, which is surfaced rather than silently truncated. See Crash Recovery.
Control frames: one ordered timeline
The WAL is not just record data. Config changes, deletes, and eviction watermarks are all
control frames on the same ordered log, so there is exactly one truth — WAL order — for
data and metadata together. The type byte selects the frame:
type | Frame | Records |
|---|---|---|
1 | Append | A record write (carries seq, ts, node, tag, payload). |
2 | TopicCreate | A topic created with its config. |
3 | TopicDelete | A topic dropped. |
4 | RouterCreate | A router defined. |
5 | RouterDelete | A router removed. |
6 | Delete | A permanent delete (before_seq and/or match), replayed deterministically on recovery. |
7 | EvictWatermark | A cap/TTL watermark advance, so the tombstone boundary survives restart. |
8 | CheckpointMark | A checkpoint barrier: which seqs are durably absorbed into segments (per-shard position). |
9 | ConfigUpdate | A topic config mutation. |
10 | Lease | A queue lease-log event (claim/ack/nack/extend), when leases_durable. |
11 | HeadWatermark | A disk-class fsynced seq-reservation ceiling, so an already-acked seq is never re-handed after a crash dropped the un-fsynced frame. |
Because a Delete is logged as an operation (its before_seq / match predicate) rather
than as a list of individual seqs, recovery re-derives the deleted set from the rebuilt
index and tag index — the removal is replayed, not stored record-by-record.
The sharded writer threads
Appends funnel into the topic’s WAL-shard writer thread, each shard fed by its own bounded MPSC channel. The request path does the cheap, parallel work; the shard writer does the one thing that must be serialized per shard — touching that shard’s file.
write request (per HTTP task)
-> validate; resolve topic_id; assign seq = head_seq.fetch_add(n) (after a discard:"reject" cap check)
-> serialize frame(s) into a reusable per-writer scratch buffer
-> enqueue (frames, durability class, commit token) on the topic's WAL-shard MPSC channel
-> [that shard's WAL writer] drain the channel, write() the batch, group-commit
-> on commit: resolve the commit tokens; publish records into the in-memory index;
Notify watchers
-> respond { seqs, head_seq, performance }The seq is assigned before the WAL commit so it can be returned in the response, but
the record is only visible to readers and acked after its commit class is satisfied. The
guarantee is one-directional and absolute: if a write was acked, it is in the WAL.
Adaptive group commit
Durability is per-topic across three commit classes, and each shard writer collapses them into one loop. See the durability classes for the consumer-facing semantics.
| Topic class | Commit behavior |
|---|---|
memory | The same group-committed WAL write path as disk (no special-casing) but best-effort: never fsync-gated and no durability guarantee — after a restart its records may survive or be lost. Lowest latency. |
disk (durable:false) | The batch is write()-en into the page cache and acked immediately. A background fdatasync hardens the tail on a timer; writers never wait. Loss window on power loss = the un-fsynced tail. |
fsync (durable:true) | The ack is held until the group fdatasync() returns. Frames are still group-committed: the shard writer coalesces all pending durable frames in the current window into one write() + one fdatasync(), then acks them all. |
The commit window is adaptive between a floor and ceiling:
gc_min= 500 µs — a lone durable write fsyncs almost immediately, so group commit never penalizes a quiet workload.gc_max= 10 ms — under load the window widens toward this ceiling, amortizing onefdatasyncacross hundreds of durable writes.
Three deliberate choices keep the fsync cost near the NVMe floor:
fdatasync, notfsync— no inode-metadata flush per commit.- 64 MiB preallocated WAL files — the active file is
fallocate’d to its full size up front, so an append never extends the inode; the next file is preallocated ahead of rotation, and rotation happens when a batch would exceed the active file’s size. - No
O_DIRECT— the page cache is wanted for fast read-back and OS write coalescing; durability comes from the explicitfdatasync.
On NVMe a single fdatasync is roughly 50–500 µs, so a 500 µs floor barely registers when
quiet and the 10 ms ceiling only matters under sustained durable load — exactly when one
fsync is amortizing the most writes. On a laptop with an APFS NVMe the fdatasync floor is
closer to ~5 ms; that is a hardware property of the disk, not a design cost. See
Performance.
Off-lock fsync
The detail that makes durable throughput hold up under concurrency is where the fsync wait happens. The per-topic lock is held only for the microsecond-scale work; the slow part runs without it.
- Under the per-topic lock (microseconds): assign the seq and enqueue the frame on the WAL channel. This is short, so it does not serialize unrelated writers for long.
- Off the lock: the writer waits for the group
fdatasync. Because the per-topic lock is already released, many concurrent writers’ fsync waits coalesce into one batch flush — the shard writer turns N pending durable frames into onewrite()+ onefdatasync(). - Publish in ticket order: each submission holds a commit token; tokens resolve in the order frames were enqueued, so a record becomes visible to readers and the response is returned in seq order, never out of order, even though the waits overlapped.
This is why a durable topic can sustain far more than 1 / fsync_latency writes per second:
the fsyncs are shared, not serialized per-write. The measured coalescing factor is about
8.4× versus a per-write baseline (see Performance).
Each shard writer is fed by a bounded MPSC channel. If durable writes arrive faster
than the disk can flush them, that bound is the backpressure point — it shows up as
write-ack latency (and, at saturation, a 503), never as silent loss. The
cardinal rule holds: degrade latency, never correctness.
What survives a crash
Because a frame is acked only after it is committed (and fdatasync’d, for a durable topic),
and because a torn or partial frame fails its frame_len/XXH3 check on replay, the boundary
is exact:
fsynctopic: an acked write waited forfdatasync, so it is a complete checksum-valid frame ⇒ never lost.disktopic: an acked write is in the page cache but may not be fsynced; a power loss can drop the un-fsynced tail — the documented fast-path tradeoff, surfacing on read as an ordinary eviction-style gap.memorytopic: written to the WAL on the same group-committed path asdisk, but with no durability guarantee — after a restart its records may survive or be lost (best-effort, gradual recovery); the topic config always persists.
The full replay-and-truncate procedure is in Crash Recovery.
See also
- Segments & Snapshots — how WAL frames are checkpointed into the long-term per-topic store.
- Crash Recovery — snapshot load, WAL replay, and torn-tail truncation.
- Durability — the
ephemeral/memory/disk/fsyncclasses from the consumer’s side. - Performance — measured fsync latency and group-commit throughput.