Scheduler & Backpressure
After a write commits, topics still has to propagate it — wake the topic’s SSE watchers, run its routers, flush its commit batch. That post-write work is what the scheduler governs, and it is what has to hit the 1–5 ms delivery target. The scheduler is a banded weighted-fair queue over dirty topics, with aging to prevent starvation and a recency-based priority so recently-consumed topics are served first. When the machine is under CPU pressure, an elastic governor degrades latency along a defined ladder — and the cardinal rule is that it degrades latency, never correctness.
Scheduling affects delivery only — never seq order, retention, which records are
visible, or whether a write is durable. A deprioritized or deferred topic stays fully
consistent: its records are durably stored and immediately readable with
POST /v0/topics/:topic/diff. Priority changes when the push happens,
nothing else.
The schedulable unit is a dirty topic
The schedulable entity is a topic, not a record or a watcher. A write (or a router forward) marks its topic dirty and inserts it into its shard’s ready set — at most once; a membership bit prevents duplicates. This bounds the queue to the number of dirty topics and coalesces a burst of writes to one topic into a single unit of work.
A worker that picks up topic B drains B fully — wakes all its SSE watchers, forwards to all its router destinations, flushes its commit batch — before moving on. Draining one topic at a time under its own lock preserves per-topic ordering and amortizes the lock acquisition. Under unsaturated load the ready set is near-empty, so a topic is serviced within microseconds of being marked dirty.
Banded weighted-fair queue (DWRR)
A plain max-heap on priority would starve low-priority topics. Instead, effective priority buckets into five bands, drained by deficit weighted round-robin (DWRR):
| Band | Effective priority | Weight |
|---|---|---|
B4 | >= 750 | 8 |
B3 | 500..749 | 4 |
B2 | 250..499 | 2 |
B1 | 0..249 | 1 |
B0 | < 0 (explicitly deprioritized) | 1 |
Within a band, topics are served FIFO by enqueue time. Across bands, each round grants
credit proportional to weight: with these defaults, for every one B1/B0 topic serviced,
up to eight top-band topics may be. High priority is strongly favored, but B1 and B0
always make forward progress every round — no band is ever frozen out.
Aging prevents starvation
A topic stuck at the bottom of a busy band shouldn’t wait forever, so it climbs over time:
age_boost = +100 per second waited, capped at +1000 after 10 secondsA 50 ms aging tick promotes topics across band boundaries, and enqueued_at resets only when
the topic is actually serviced — so a continuously-rewritten topic still ages. The combined
guarantee: no topic waits more than 10 s before reaching the top band, and DWRR drains the
top band every round, so worst-case scheduling latency is bounded even under sustained
high-priority load.
Effective priority
Every topic has an effective priority combining a manual component, an automatic recency component, and the aging boost above:
P_eff = clamp(priority, -1000, 1000) # manual, when set — overrides
+ auto_recency # recency, when priority is null
+ age_boost # anti-starvation climb
auto_recency = 500 * 2^( -(now - last_consumed_ms) / 30000 ) # 0 after a 5-min floor- Manual (
priorityset) — clamped to[-1000, 1000]and overrides the auto term. An operator can force a topic above all auto traffic withpriority: 1000(or below it with-1000). - Auto recency (
priority: null, the default withauto_priority: true) — a freshly-consumed topic scores ~500(≈ a+500manual topic); the bonus halves every 30 s of inactivity, so an actively-polling consumer keeps its topic “hot” with negligible upkeep while a quiet topic sheds priority within a couple of minutes. After a 5-minute floor (300 s untouched) the auto term is forced to0— the math is skipped entirely.
P_eff is never stored as ground truth; it is computed on demand at enqueue time and on the
50 ms aging tick, using fixed-point math (a lookup table for 2^-x, no powf on the hot
path). Higher P_eff is served first.
The pressure governor
A governor task samples three cheap signals every 100 ms into a single
pressure ∈ [0, 1]: ready-set depth versus the worker count, EWMA scheduling latency versus
the 5 ms ceiling, and the blocking/compute-pool busy ratio. pressure is published as a
lock-free atomic and drives an escalating, composable ladder:
| Trigger | What the governor does | Loss? |
|---|---|---|
pressure > 0.2 | Coalesce batches. Stop waking watchers per-record; pack a topic’s pending records into one multi-record frame/diff. Cheap, lossless, often improves throughput. Window grows 0–20 ms with pressure. | No |
pressure > 0.4 | Widen the group-commit window. commit_window_ms lerps from 0.5 ms up toward 10 ms — fewer fsyncs/sec, more headroom; the cost is up to +9.5 ms write-ack latency, observed as latency, never loss. | No |
pressure > 0.8 (sustained) | Defer the lowest-value work. Routers are enqueued one band lower; B0/negative-priority topics stop receiving DWRR credit (hysteresis releases at pressure < 0.6) — their data is still durably stored and fully pollable, only the push is paused. At pressure ≈ 1.0 with a full ingest channel, the write endpoint returns 429 + Retry-After. | No |
The cardinal rule: throttling degrades latency and push-eagerness, never correctness.
A deferred topic is always fully consistent on the next
diff. All data loss remains the explicit, configured cap/TTL path with
in-band tombstones; a full-write rejection is synchronous
(422 topic_full / 429 throttled), never ack-then-drop.
What a client sees, by pressure level
| Condition | Client-visible effect | Loss? |
|---|---|---|
Healthy (pressure < 0.2) | 1–5 ms delivery, per-record frames | No |
| Mild pressure | Coalesced multi-record frames, ~5–15 ms | No |
| Heavy pressure | Slower write-acks; low-priority pushes paused but pollable | No |
| Saturation on write | 429 + Retry-After (write rejected synchronously) | No |
| Cap/TTL crosses a cursor (independent of pressure) | In-band tombstone with [gap_from, gap_to] | Explicit, never silent |
A writer that must not be throttled can bypass the 429 with disable_backpressure: true
on the write. The 429 carries detail.retry_after_ms for CPU pressure (and
detail.limit when a resource cap is the cause); see Errors.
Slow-consumer isolation
Each SSE connection has a bounded outbound channel. If a worker can’t enqueue a frame, it
does not block the topic drain: the connection is marked lagged, the server stops
buffering for it, and on the next successful send it emits a tombstone for the skipped range
(on lossy topics) so the client catches up via diff. One slow client is contained to its
own connection — the topic and every other watcher proceed at full speed. This is the same
boundedness that lets the governor trade latency for throughput without unbounded memory
growth.
See also
- WAL & Group Commit — the adaptive commit window the governor widens under pressure.
- Performance — measured delivery latency and the throughput ceilings the scheduler operates within.
- Watch (SSE) — the live-delivery surface the scheduler drains topics into.
- Errors — the
429 throttledbody and itsRetry-After/detailfields.