Architecture Overview
topics is a single Rust process on a single machine: an append-only log engine that keeps a write-ahead log as its durability boundary and materializes everything else — the in-memory index, the per-topic segment files, the metadata snapshots — as a derivable cache of that log. This page covers the design principles the whole system is built on and the layered shape that enforces them, then points you at the deep dives.
The assumptions are deliberate and narrow: one process, one machine, a good CPU, and local NVMe (not a spinning disk, not a network volume). Everything below trades the generality of a distributed system for the latency and crash-consistency you can get when the disk is a single sequential resource you fully control.
Design principles
Four ideas drive every layer. They are not aspirations — they are invariants the storage layer is structured to make cheap to enforce.
1. Never silent loss
Every involuntary loss that passes a consumer’s cursor — a cap eviction or a TTL
expiry — surfaces as an in-band tombstone carrying the exact
[gap_from, gap_to] range, at HTTP 200, never a silent skip. The engine keeps two floors
per topic and makes both cheaply queryable as atomics: earliest_seq (the first live seq)
and evict_floor (the sole tombstone trigger, advanced only by involuntary cap/TTL loss).
A voluntary delete advances earliest_seq but never evict_floor,
so a purely-deleted gap reads silently while an evicted gap tombstones. That single
distinction is the reason the data model has two watermarks instead of one.
2. The WAL is the durability boundary
“Only data not yet in the WAL is lost.” The write-ahead log is the one source of truth; the in-memory index, the segment files, and the metadata snapshots are all materializations of WAL frames plus checkpoints. An acked write on a durable topic is a complete, checksum-valid frame on disk — so it is recoverable by replaying the WAL on the next restart. See WAL & Group Commit and durability classes.
3. Trim at segment granularity, lazily
Cap/TTL eviction never rewrites data or deletes individual records on the hot path. It
advances a watermark (an atomic store plus a front-drain of the in-memory index) and drops
whole sealed segment files in the background. A topic may therefore briefly retain
slightly more than its cap — only whole sealed segments drop — which is the documented,
accepted approximation (the same one Redis ~ and Kafka make). Voluntary deletion is
likewise logically immediate but physically lazy: the payload is freed and earliest_seq
advances synchronously, while disk reclaim happens off the hot path. See
Segments & Snapshots.
4. Single machine, sequential disk
Seqs are mostly-sequential u64, so the seq→location map is a base+offset vector, not a
hash map — O(1) lookup with no hashing on the read path. The WAL is sharded
(TOPICS_WAL_SHARDS, default min(num_cpus, 8)) into independent ordered shard writers —
each shard is a single sequential stream matching the hardware (trivial group commit, no
write-side lock contention), and a topic maps to one shard so per-topic ordering still holds.
Topics are also sharded across cores for delivery work.
The load-bearing invariant: involuntary loss you didn’t ask for tombstones; voluntary removal you did ask for is silent. Every structural choice — the dual floor, the tag index, in-place delete-flag marking (no compaction / no reclaim; whole cleared segments drop) — exists to keep that distinction crisp. See the Core Guarantees.
The layered shape
A write flows down through four layers; a read is served from whichever layer still holds the bytes.
HTTP / SSE (axum + hyper)
│ validate, resolve topic_id, assign seq
▼
Engine (per-topic index, dual floor, tag index, scheduler)
│ enqueue frame under the per-topic lock
▼
WAL (sharded writer threads, adaptive group commit, fdatasync)
│ checkpoint: copy contiguous byte ranges per topic
▼
Segments (.data + .idx, mmap reads) + metadata snapshots- HTTP / SSE. A typed
/v0API over axum/hyper. The request path validates the body, resolves the topic name to an internedu64id, and assigns the monotonic$seqbefore the frame is handed off — so the seq can be returned even though the record is not yet visible until its commit class is satisfied. - Engine. Each topic holds a base+offset location vector, the dual floor as atomics
(
head_seq,earliest_seq,evict_floor), a per-topic tag index for efficient match deletes, and the wakeup primitive that drives SSE and long-poll diffs without polling. A priority scheduler governs the post-write propagation that has to hit the 1–5 ms target. - WAL. A sharded append-only log of length-prefixed, XXH3-64-checksummed frames (N independent shard writers; each topic maps to one shard). Seq assignment and frame enqueue happen under the per-topic lock (microseconds); the fsync wait happens off the lock, so many writers’ waits coalesce into one group flush per shard.
- Segments & snapshots. A background compactor copies committed WAL bytes into per-topic
segment files (
.data+ a fixed-stride.idx), and a metadata snapshot lets recovery start without replaying the WAL from time zero. Sealed segments are immutable and served by mmap.
Reads are served from the layer that still holds the record: the newest records (written,
not yet checkpointed) come straight from the WAL/page cache; the active segment is read with
buffered pread; sealed segments are mmap’d zero-copy. A consumer a few milliseconds behind
head never waits for a checkpoint.
Concurrency in one paragraph
State is sharded, not globally locked. Topics are partitioned across cores via a
DashMap registry; each topic has its own RwLock so two writers on two topics proceed in
parallel, and each shard has a small ready-set mutex held only for an O(1) splice. The
serialization points are the per-shard WAL writer threads, each fed by its own MPSC
channel — correct because each shard is one sequential stream. Cold-tier I/O and the segment
relocator run on a separate blocking pool under a hard invariant: cold reads may degrade
historical reads but never block writes or live SSE delivery.
Where to read the internals
The frame format, the sharded writer threads, adaptive group commit, and off-lock fsync.
WAL & Group CommitCheckpointing the WAL into per-topic .data/.idx segments and atomic metadata snapshots.
Snapshot load, WAL replay, torn-tail truncation, and orphan-segment reclaim.
Crash RecoveryThe banded weighted-fair scheduler and the elastic throttle that sheds latency, not data.
Scheduler & BackpressureMeasured engine, HTTP, fsync, and SSE numbers, framed honestly with their caveats.
Performance