Storage & Tiering
topics persists everything under one data directory: a sharded write-ahead log, periodic metadata snapshots, and per-topic segment files that are the long-term store and read source. Segments seal (become immutable) after a size, count, or age trigger. An optional cold tier relocates older sealed segments to a second directory — disabled by default, and governed by one hard invariant: cold reads never affect writes or live delivery.
The WAL is the durability boundary — “only data not yet in the WAL is lost.” Everything downstream (the in-memory index, segments, snapshots) is a derivable cache of the WAL plus checkpoints. See WAL & Group Commit and Recovery for the mechanics.
On-disk layout
Everything lives under TOPICS_DATA_DIR (default ./topics-data), the hot tier. Three
top-level subtrees: meta/, wal/, and topics/.
- .topics.lock
- snapshot.0007.bin
- snapshot.0006.bin
- seg-0000000000000001.data
- seg-0000000000000001.idx
- seg-0000000000010001.data
- seg-0000000000010001.idx
wal/— the WAL is sharded (TOPICS_WAL_SHARDS, defaultmin(num_cpus, 8)): N independent ordered, append-only, mixed-topic topics, each a single sequential stream (so group commit stays trivial per shard) and each topic mapped to exactly one shard by a stable hash of its id. When shards > 1 each shard is its ownwal/shard-NN/directory;TOPICS_WAL_SHARDS=1is the flatwal/layout. Within a shard, files are namedwal-<first-frame-seq>.log(zero-padded); the highest-numbered is active, and the tinyCURRENTfile (atomically renamed) names it. Files are 64 MiB preallocated so appends never extend the inode. Recovery is shard-count-agnostic (it replays all shards bytopic_id), so the shard count may change between restarts.meta/— atomic metadata snapshots (snapshot.<n>.bin): the topic name↔id mapping, per-topic config, the dual watermarks (evict_floor,earliest_seq),delete_below, routers, epochs, and the WAL replay floor. The previous snapshot is kept until the next is durably written. Snapshots are written via temp → fsync → rename → dir-fsync for a crash-atomic swap.topics/— one directory per topic, named by the interned numeric topic id (hex), not the topic name (this is what makes path traversal impossible). Segments are per-topic so eviction, mmap, and read locality are independent across topics..topics.lock— an advisory exclusive lock held by the running process. A second process pointed at the same data directory fails startup instead of appending to the same WAL.
The data directory holds every record’s payload in the clear (WAL frames and segment
.data files). topics does not encrypt payloads. Protect it with filesystem permissions and
at-rest disk encryption — see Security.
Segments
WAL frames are periodically checkpointed into per-topic segment files to keep recovery bounded and reads efficient. Each segment is a numbered pair, named by its first seq so the files sort into seq order:
| File | Contents |
|---|---|
seg-<first_seq>.data | Append-ordered record frames (a close variant of the WAL frame: every frame is an Append, so there is no type byte). Each frame ends in an XXH3-64 checksum — the same crash anchor as the WAL. |
seg-<first_seq>.idx | A fixed-stride index, 20 bytes per entry: [offset:u32, len:u32, ts:u64, flags:u8, pad:3]. Entry i ↔ seq first_seq + i. |
The .idx is the on-disk twin of the in-memory topic index. Because it is fixed-stride,
seq → entry is direct arithmetic — (seq − first_seq) × 20 — a seek, never a scan. That makes
rebuilding the in-memory index on restart a bulk read of .idx files rather than a re-parse
of all payload data. The inline ts enables binary search for the TTL boundary, and the inline
flags a cheap tag/node presence probe, without touching the .data file.
Sealing triggers
The newest segment for a topic is “active” — still being appended. A segment seals (becomes immutable) when any of three triggers fires:
| Trigger | Default | Env knob |
|---|---|---|
| Record count | 10000 records | TOPICS_SEGMENT_MAX_EVENTS |
.data size | 64 MiB | TOPICS_SEGMENT_MAX_BYTES |
| Age (idle/partial) | 1 h (3600000 ms) | TOPICS_SEGMENT_MAX_AGE_MS (0 disables) |
The byte trigger is a guard for topics with large payloads (the count trigger alone could let a segment grow huge); the age trigger seals an idle topic’s partial segment so it can be relocated or reclaimed.
A sealed segment is immutable, and a checksum mismatch on it is treated as corruption — surfaced, not silently truncated. This differs from the WAL’s torn tail, where a checksum mismatch at the end is the expected logical end-of-log and is truncated cleanly on recovery.
Cap/TTL eviction and deletion reclaim disk at segment granularity, lazily: a whole sealed
segment whose highest seq is below the live floor is dropped (its .data + .idx unlinked); a
partially-deleted segment is reclaimed by a background rewrite. The active segment is never
dropped. A topic may therefore briefly retain slightly more than its cap (only whole sealed
segments drop) — the documented, accepted approximation. See
Segments & Snapshots.
Hot/cold tiering
A topic’s segments can be split across two tiers:
- HOT — the active segment plus recent sealed segments, on fast local NVMe (the per-topic
directory under
TOPICS_DATA_DIR). The live tail (active segment + in-memory index + a bounded recent-record cache) is always hot and independent of cold access. - COLD — older sealed segments, on a slower tier. Today the cold tier is simply a different
configured folder,
TOPICS_COLD_DIR. (An object-store backend such as S3 is explicitly future work; only the local-folder cold store exists now. Both tiers sit behind oneSegmentStoretrait, so S3 drops in later without touching the engine.)
Tiering is disabled by default. When TOPICS_COLD_DIR is unset, nothing relocates and
behavior is unchanged by construction — every segment stays hot. You opt in by setting the cold
directory.
When enabled, the cold tier mirrors the per-topic layout under TOPICS_COLD_DIR using the same
seg-<first_seq> naming, so a relocated segment keeps its identity:
- seg-0000000000000001.data
- seg-0000000000000001.idx
Hot retention
Sealed segments beyond the hot-retention bound relocate to cold:
| Knob | Default | Meaning |
|---|---|---|
TOPICS_HOT_RETAIN_SEGMENTS | 4 | Keep this many newest sealed segments hot before relocating older ones to cold. |
TOPICS_HOT_RETAIN_BYTES | 0 (off) | Optional hot sealed-byte bound. When both are set, the stricter wins. |
Relocation is crash-safe and idempotent: copy the segment to cold → fsync → durably flip the tier pointer → delete the hot copy. If interrupted, restart prefers the surviving copy (a per-topic tier resolver favors the HOT copy when both exist during the transient mid-relocation window), so a segment is never lost. Cap/TTL/delete reclaim drops a whole segment file in either tier. The WAL remains the durability boundary throughout; segments are a derivable materialization.
The hard invariant
Cold reads MAY degrade getDifference / historical reads, but MUST NOT affect writes or
live delivery (SSE / tail). Cold I/O and the relocator run on a separate blocking/IO
pool; they never hold a topic write lock or block an SSE push during a slow cold fetch. The
live tail is always hot.
This is the load-bearing rule of the tiering design. A consumer reading deep history from a slow
cold segment can be slower, and that latency surfaces honestly in the response (a
cold_segments_read field appears in the performance block when a read touched cold storage —
see Observability). But a writer appending to the topic, and a watcher
tailing the live head, are never slowed by that cold read. The in-memory index keeps memory
bounded by the index entry count, not the payload volume — older payloads are read from
segments on demand.
Segment config knobs
The complete set of storage-related environment variables. All are also documented in Configuration.
| Env var | Default | Meaning |
|---|---|---|
TOPICS_DATA_DIR | ./topics-data | Hot tier + WAL + meta root. |
TOPICS_WAL_SHARDS | min(num_cpus, 8) | Number of independent WAL shards (each its own writer thread / file set / group commit). Each topic maps to one shard; recovery is shard-count-agnostic. 1 = the flat single-writer layout. |
TOPICS_COLD_DIR | (unset) | Cold tier root. Unset ⇒ tiering disabled (all hot). |
TOPICS_SEGMENT_MAX_EVENTS | 10000 | Seal a segment after this many records. |
TOPICS_SEGMENT_MAX_BYTES | 67108864 (64 MiB) | Seal after this many .data bytes. |
TOPICS_SEGMENT_MAX_AGE_MS | 3600000 (1 h) | Seal an idle/partial segment after this age; 0 disables. |
TOPICS_HOT_RETAIN_SEGMENTS | 4 | Newest sealed segments kept hot before relocating to cold. |
TOPICS_HOT_RETAIN_BYTES | 0 (off) | Optional hot sealed-byte bound; stricter of the two wins. |
See also
- Segments & Snapshots — how checkpointing, sealing, and reclaim work internally.
- WAL & Group Commit — the durability boundary and the commit path.
- Recovery — how the data directory is replayed on restart.
- Configuration — the full
TOPICS_*env-var set. - Durability — the
ephemeral/memory/disk/fsynccommit classes.