Skip to Content
DeploymentStorage & Tiering

Storage & Tiering

topics persists everything under one data directory: a sharded write-ahead log, periodic metadata snapshots, and per-topic segment files that are the long-term store and read source. Segments seal (become immutable) after a size, count, or age trigger. An optional cold tier relocates older sealed segments to a second directory — disabled by default, and governed by one hard invariant: cold reads never affect writes or live delivery.

The WAL is the durability boundary — “only data not yet in the WAL is lost.” Everything downstream (the in-memory index, segments, snapshots) is a derivable cache of the WAL plus checkpoints. See WAL & Group Commit and Recovery for the mechanics.

On-disk layout

Everything lives under TOPICS_DATA_DIR (default ./topics-data), the hot tier. Three top-level subtrees: meta/, wal/, and topics/.

    • .topics.lock
      • snapshot.0007.bin
      • snapshot.0006.bin
        • seg-0000000000000001.data
        • seg-0000000000000001.idx
        • seg-0000000000010001.data
        • seg-0000000000010001.idx
  • wal/ — the WAL is sharded (TOPICS_WAL_SHARDS, default min(num_cpus, 8)): N independent ordered, append-only, mixed-topic topics, each a single sequential stream (so group commit stays trivial per shard) and each topic mapped to exactly one shard by a stable hash of its id. When shards > 1 each shard is its own wal/shard-NN/ directory; TOPICS_WAL_SHARDS=1 is the flat wal/ layout. Within a shard, files are named wal-<first-frame-seq>.log (zero-padded); the highest-numbered is active, and the tiny CURRENT file (atomically renamed) names it. Files are 64 MiB preallocated so appends never extend the inode. Recovery is shard-count-agnostic (it replays all shards by topic_id), so the shard count may change between restarts.
  • meta/ — atomic metadata snapshots (snapshot.<n>.bin): the topic name↔id mapping, per-topic config, the dual watermarks (evict_floor, earliest_seq), delete_below, routers, epochs, and the WAL replay floor. The previous snapshot is kept until the next is durably written. Snapshots are written via temp → fsync → rename → dir-fsync for a crash-atomic swap.
  • topics/ — one directory per topic, named by the interned numeric topic id (hex), not the topic name (this is what makes path traversal impossible). Segments are per-topic so eviction, mmap, and read locality are independent across topics.
  • .topics.lock — an advisory exclusive lock held by the running process. A second process pointed at the same data directory fails startup instead of appending to the same WAL.

The data directory holds every record’s payload in the clear (WAL frames and segment .data files). topics does not encrypt payloads. Protect it with filesystem permissions and at-rest disk encryption — see Security.

Segments

WAL frames are periodically checkpointed into per-topic segment files to keep recovery bounded and reads efficient. Each segment is a numbered pair, named by its first seq so the files sort into seq order:

FileContents
seg-<first_seq>.dataAppend-ordered record frames (a close variant of the WAL frame: every frame is an Append, so there is no type byte). Each frame ends in an XXH3-64 checksum — the same crash anchor as the WAL.
seg-<first_seq>.idxA fixed-stride index, 20 bytes per entry: [offset:u32, len:u32, ts:u64, flags:u8, pad:3]. Entry i ↔ seq first_seq + i.

The .idx is the on-disk twin of the in-memory topic index. Because it is fixed-stride, seq → entry is direct arithmetic — (seq − first_seq) × 20 — a seek, never a scan. That makes rebuilding the in-memory index on restart a bulk read of .idx files rather than a re-parse of all payload data. The inline ts enables binary search for the TTL boundary, and the inline flags a cheap tag/node presence probe, without touching the .data file.

Sealing triggers

The newest segment for a topic is “active” — still being appended. A segment seals (becomes immutable) when any of three triggers fires:

TriggerDefaultEnv knob
Record count10000 recordsTOPICS_SEGMENT_MAX_EVENTS
.data size64 MiBTOPICS_SEGMENT_MAX_BYTES
Age (idle/partial)1 h (3600000 ms)TOPICS_SEGMENT_MAX_AGE_MS (0 disables)

The byte trigger is a guard for topics with large payloads (the count trigger alone could let a segment grow huge); the age trigger seals an idle topic’s partial segment so it can be relocated or reclaimed.

A sealed segment is immutable, and a checksum mismatch on it is treated as corruption — surfaced, not silently truncated. This differs from the WAL’s torn tail, where a checksum mismatch at the end is the expected logical end-of-log and is truncated cleanly on recovery.

Cap/TTL eviction and deletion reclaim disk at segment granularity, lazily: a whole sealed segment whose highest seq is below the live floor is dropped (its .data + .idx unlinked); a partially-deleted segment is reclaimed by a background rewrite. The active segment is never dropped. A topic may therefore briefly retain slightly more than its cap (only whole sealed segments drop) — the documented, accepted approximation. See Segments & Snapshots.

Hot/cold tiering

A topic’s segments can be split across two tiers:

  • HOT — the active segment plus recent sealed segments, on fast local NVMe (the per-topic directory under TOPICS_DATA_DIR). The live tail (active segment + in-memory index + a bounded recent-record cache) is always hot and independent of cold access.
  • COLD — older sealed segments, on a slower tier. Today the cold tier is simply a different configured folder, TOPICS_COLD_DIR. (An object-store backend such as S3 is explicitly future work; only the local-folder cold store exists now. Both tiers sit behind one SegmentStore trait, so S3 drops in later without touching the engine.)

Tiering is disabled by default. When TOPICS_COLD_DIR is unset, nothing relocates and behavior is unchanged by construction — every segment stays hot. You opt in by setting the cold directory.

When enabled, the cold tier mirrors the per-topic layout under TOPICS_COLD_DIR using the same seg-<first_seq> naming, so a relocated segment keeps its identity:

        • seg-0000000000000001.data
        • seg-0000000000000001.idx

Hot retention

Sealed segments beyond the hot-retention bound relocate to cold:

KnobDefaultMeaning
TOPICS_HOT_RETAIN_SEGMENTS4Keep this many newest sealed segments hot before relocating older ones to cold.
TOPICS_HOT_RETAIN_BYTES0 (off)Optional hot sealed-byte bound. When both are set, the stricter wins.

Relocation is crash-safe and idempotent: copy the segment to cold → fsync → durably flip the tier pointer → delete the hot copy. If interrupted, restart prefers the surviving copy (a per-topic tier resolver favors the HOT copy when both exist during the transient mid-relocation window), so a segment is never lost. Cap/TTL/delete reclaim drops a whole segment file in either tier. The WAL remains the durability boundary throughout; segments are a derivable materialization.

The hard invariant

Cold reads MAY degrade getDifference / historical reads, but MUST NOT affect writes or live delivery (SSE / tail). Cold I/O and the relocator run on a separate blocking/IO pool; they never hold a topic write lock or block an SSE push during a slow cold fetch. The live tail is always hot.

This is the load-bearing rule of the tiering design. A consumer reading deep history from a slow cold segment can be slower, and that latency surfaces honestly in the response (a cold_segments_read field appears in the performance block when a read touched cold storage — see Observability). But a writer appending to the topic, and a watcher tailing the live head, are never slowed by that cold read. The in-memory index keeps memory bounded by the index entry count, not the payload volume — older payloads are read from segments on demand.

Segment config knobs

The complete set of storage-related environment variables. All are also documented in Configuration.

Env varDefaultMeaning
TOPICS_DATA_DIR./topics-dataHot tier + WAL + meta root.
TOPICS_WAL_SHARDSmin(num_cpus, 8)Number of independent WAL shards (each its own writer thread / file set / group commit). Each topic maps to one shard; recovery is shard-count-agnostic. 1 = the flat single-writer layout.
TOPICS_COLD_DIR(unset)Cold tier root. Unset ⇒ tiering disabled (all hot).
TOPICS_SEGMENT_MAX_EVENTS10000Seal a segment after this many records.
TOPICS_SEGMENT_MAX_BYTES67108864 (64 MiB)Seal after this many .data bytes.
TOPICS_SEGMENT_MAX_AGE_MS3600000 (1 h)Seal an idle/partial segment after this age; 0 disables.
TOPICS_HOT_RETAIN_SEGMENTS4Newest sealed segments kept hot before relocating to cold.
TOPICS_HOT_RETAIN_BYTES0 (off)Optional hot sealed-byte bound; stricter of the two wins.

See also

Last updated on