Skip to Content
How It WorksSegments & Snapshots

Segments & Snapshots

The WAL is fast, append-ordered, and short-lived; it mixes every topic’s frames into one stream. To keep recovery bounded and reads efficient, a background compactor periodically applies WAL frames into per-topic segment files — the long-term store and the read source for diff reads. This page covers the checkpoint process, the .data/.idx segment format and its constant-time seq → entry arithmetic, the three sealing triggers, how reads choose mmap versus buffered I/O, the atomic metadata snapshot, and the full on-disk layout.

Segments are a derivable materialization of the WAL — the WAL remains the durability boundary. Losing a segment is recoverable by replay; losing a WAL frame is not.

Checkpointing the WAL into segments

The compactor runs on a timer (and on WAL rotation). For each topic with new frames since its last checkpoint:

for each topic with new frames since last checkpoint: append those records, in seq order, to the topic's active segment file (segment frames are a buffered copy of contiguous WAL byte ranges, split by topic — no re-serialization) update the topic's .idx file fsync the touched .data + .idx files write a CheckpointMark frame to the WAL (per topic: highest seq checkpointed, watermarks, active-segment positions); fsync the WAL WAL files whose every frame's seq <= the global min checkpointed seq become deletable

Two things make this cheap. First, a segment record frame is byte-identical to a WAL Append frame minus the type byte (every segment frame is an Append), so the compactor copies contiguous byte ranges rather than re-serializing records. Second, the CheckpointMark is itself checksum-protected and fsynced, so a crash anywhere in the checkpoint is safe: WAL frames already absorbed into segments are replayed-and-skipped on restart (a seq already in the segment index is ignored). See Crash Recovery.

Segment format: .data + .idx

Each topic owns a directory of numbered segment pairs, named by the segment’s first seq (seg-<first_seq>, zero-padded so they sort into seq order). A segment covers a contiguous range [first_seq, end_seq].

seg-<first_seq>.data append-ordered record frames (the §2.1 WAL Append frame minus the type byte: frame_len + flags + seq/ts + node/tag/payload + XXH3-64) seg-<first_seq>.idx fixed-stride 20 bytes/entry; entry i <=> seq (first_seq + i)

The .idx is the on-disk twin of the in-memory location vector — a fixed 20-byte stride per entry:

// One .idx entry: 20 bytes, one per seq in the segment's range. struct IdxEntry { offset: u32, // byte offset of the record's frame within the .data file len: u32, // framed length (read one record without touching its neighbors) ts: u64, // server commit ms — kept inline for the TTL boundary binary search flags: u8, // has_tag / has_node / deleted probe, without touching .data // 3 bytes of padding to the 20-byte stride }

Because the stride is fixed, seq → entry is pure arithmetic: entry i lives at byte offset (seq - first_seq) * 20. No scan, no per-record index structure — a direct seek. Three consequences fall out of that:

  • Fast restart. Rebuilding the in-memory index is a bulk sequential read of the .idx files, not a re-parse of all record data.
  • Cheap TTL boundary. The inline ts lets eviction binary-search the TTL crossing without touching .data.
  • Cheap presence probes. The inline flags answer tag/node/deleted questions without a data read.

Every .data frame carries an XXH3-64 checksum, the same crash anchor as the WAL. A sealed segment is immutable, so a checksum mismatch there is genuine corruption and is surfaced — not silently truncated the way a torn WAL tail is. Only the active (still appended) segment can have a torn tail.

Sealing triggers

The newest segment is active (still appended); older segments are sealed and immutable. A segment seals when any one of three triggers fires:

TriggerEnv varDefault
Record countTOPICS_SEGMENT_MAX_EVENTS10000
Byte size of .dataTOPICS_SEGMENT_MAX_BYTES67108864 (64 MiB)
Age of an idle/partial segmentTOPICS_SEGMENT_MAX_AGE_MS3600000 (1 h); 0 disables

Sealing matters because it is the granularity of two lazy operations: cap/TTL eviction drops whole sealed segments whose highest seq is below earliest_seq (the active segment is never dropped), and the hot→cold relocator only ever moves sealed segments. See Storage & Tiering.

Serving reads: mmap vs buffered

A diff read is served from whichever layer still holds the record, chosen for locality:

  • Sealed segments → mmap (memmap2). The .data file is mapped once; each record is a zero-copy slice [offset .. offset+len], page-cache-backed. A diff bound-checks against evict_floor (tombstone?) and earliest_seq (live floor), slices the index range, and copies framed bytes out of the mmap, skipping deleted/expired/own-node slots, bounded by limit.
  • Active segment → buffered pread. The growing file is usually still in the page cache from the write, and mapping past EOF is unsafe, so the active segment is read with buffered reads rather than mmap.
  • Newest records (written, not yet checkpointed) → straight from WAL bytes, via the same location mechanism. A consumer a few milliseconds behind head reads from the WAL/page cache and never waits for a checkpoint — essential to the latency target.

Metadata snapshots

Topic config, the name↔id map, routers, per-topic watermarks, and the checkpoint lower bound live in a metadata store that mirrors the WAL philosophy: every mutation is a control frame in the WAL (ordered and crash-consistent with data), and a periodic snapshot lets recovery start without replaying the WAL from time zero.

// The snapshotted metadata (compact bincode). Tiny and changes rarely. struct Meta { topics: HashMap<String, TopicId>, // name -> interned u64 id (stable across restart) topic_cfg: HashMap<TopicId, TopicConfig>, watermarks: HashMap<TopicId, (u64, u64)>, // persisted (evict_floor, earliest_seq) per topic delete_below: HashMap<TopicId, u64>, // max before_seq applied (snapshot delete) routers: Vec<Router>, epochs: HashMap<TopicId, u64>, // delete+recreate detection next_topic_id: u64, current_wal: String, last_checkpoint_seq: u64, // global lower bound for WAL replay }

A snapshot is triggered when either of two thresholds is crossed:

  • 64 MiB of WAL bytes written since the last snapshot, or
  • 60 seconds elapsed.

The write is atomic, so a crash anywhere in the sequence falls back cleanly to the previous snapshot:

encode Meta -> write to snapshot-<n>.bin.tmp -> fsync the tmp file -> rename over the final name -> fsync the directory -> remove the previous snapshot (kept until the new one is durable)

The directory fsync after the rename is what makes the swap crash-atomic: until it returns, the rename may roll back on power loss, and recovery simply loads the older snapshot. The previous file is removed only after the new one is durably in place.

Deletes are not standing state. A permanent delete leaves no per-topic rule to persist — it is a one-shot Delete control frame, reflected immediately in the index and in the two persisted watermarks plus delete_below. The only deletion-related runtime structure is the per-topic tag index used to find matching seqs; it is rebuilt from live records on recovery, never snapshotted.

On-disk layout

Putting it together: the WAL is sharded (TOPICS_WAL_SHARDS, default min(num_cpus, 8)) into N independent ordered topics — each topic maps to one shard, and each shard (when shards >

  1. is its own wal/shard-NN/ directory with its own CURRENT + files. Segments are per-topic (independent eviction, per-topic mmap, locality for diff reads). Topic directories are named by the interned topic_id (hex), never the topic name — so there is no path traversal from a user-supplied name.
      • snapshot-7.bin (latest atomic snapshot)
      • snapshot-6.bin (previous, kept until next fsync)

When TOPICS_COLD_DIR is set, older sealed segments relocate to a mirror of the per-topic layout under that directory, keeping the same seg-<first_seq> identity. When it is unset (the default), tiering is disabled and nothing relocates — everything stays hot, behavior unchanged by construction. See Storage & Tiering.

See also

Last updated on