Skip to Content
How It WorksScheduler & Backpressure

Scheduler & Backpressure

After a write commits, topics still has to propagate it — wake the topic’s SSE watchers, run its routers, flush its commit batch. That post-write work is what the scheduler governs, and it is what has to hit the 1–5 ms delivery target. The scheduler is a banded weighted-fair queue over dirty topics, with aging to prevent starvation and a recency-based priority so recently-consumed topics are served first. When the machine is under CPU pressure, an elastic governor degrades latency along a defined ladder — and the cardinal rule is that it degrades latency, never correctness.

Scheduling affects delivery only — never seq order, retention, which records are visible, or whether a write is durable. A deprioritized or deferred topic stays fully consistent: its records are durably stored and immediately readable with POST /v0/topics/:topic/diff. Priority changes when the push happens, nothing else.

The schedulable unit is a dirty topic

The schedulable entity is a topic, not a record or a watcher. A write (or a router forward) marks its topic dirty and inserts it into its shard’s ready set — at most once; a membership bit prevents duplicates. This bounds the queue to the number of dirty topics and coalesces a burst of writes to one topic into a single unit of work.

A worker that picks up topic B drains B fully — wakes all its SSE watchers, forwards to all its router destinations, flushes its commit batch — before moving on. Draining one topic at a time under its own lock preserves per-topic ordering and amortizes the lock acquisition. Under unsaturated load the ready set is near-empty, so a topic is serviced within microseconds of being marked dirty.

Banded weighted-fair queue (DWRR)

A plain max-heap on priority would starve low-priority topics. Instead, effective priority buckets into five bands, drained by deficit weighted round-robin (DWRR):

BandEffective priorityWeight
B4>= 7508
B3500..7494
B2250..4992
B10..2491
B0< 0 (explicitly deprioritized)1

Within a band, topics are served FIFO by enqueue time. Across bands, each round grants credit proportional to weight: with these defaults, for every one B1/B0 topic serviced, up to eight top-band topics may be. High priority is strongly favored, but B1 and B0 always make forward progress every round — no band is ever frozen out.

Aging prevents starvation

A topic stuck at the bottom of a busy band shouldn’t wait forever, so it climbs over time:

age_boost = +100 per second waited, capped at +1000 after 10 seconds

A 50 ms aging tick promotes topics across band boundaries, and enqueued_at resets only when the topic is actually serviced — so a continuously-rewritten topic still ages. The combined guarantee: no topic waits more than 10 s before reaching the top band, and DWRR drains the top band every round, so worst-case scheduling latency is bounded even under sustained high-priority load.

Effective priority

Every topic has an effective priority combining a manual component, an automatic recency component, and the aging boost above:

P_eff = clamp(priority, -1000, 1000) # manual, when set — overrides + auto_recency # recency, when priority is null + age_boost # anti-starvation climb auto_recency = 500 * 2^( -(now - last_consumed_ms) / 30000 ) # 0 after a 5-min floor
  • Manual (priority set) — clamped to [-1000, 1000] and overrides the auto term. An operator can force a topic above all auto traffic with priority: 1000 (or below it with -1000).
  • Auto recency (priority: null, the default with auto_priority: true) — a freshly-consumed topic scores ~500 (≈ a +500 manual topic); the bonus halves every 30 s of inactivity, so an actively-polling consumer keeps its topic “hot” with negligible upkeep while a quiet topic sheds priority within a couple of minutes. After a 5-minute floor (300 s untouched) the auto term is forced to 0 — the math is skipped entirely.

P_eff is never stored as ground truth; it is computed on demand at enqueue time and on the 50 ms aging tick, using fixed-point math (a lookup table for 2^-x, no powf on the hot path). Higher P_eff is served first.

The pressure governor

A governor task samples three cheap signals every 100 ms into a single pressure ∈ [0, 1]: ready-set depth versus the worker count, EWMA scheduling latency versus the 5 ms ceiling, and the blocking/compute-pool busy ratio. pressure is published as a lock-free atomic and drives an escalating, composable ladder:

TriggerWhat the governor doesLoss?
pressure > 0.2Coalesce batches. Stop waking watchers per-record; pack a topic’s pending records into one multi-record frame/diff. Cheap, lossless, often improves throughput. Window grows 0–20 ms with pressure.No
pressure > 0.4Widen the group-commit window. commit_window_ms lerps from 0.5 ms up toward 10 ms — fewer fsyncs/sec, more headroom; the cost is up to +9.5 ms write-ack latency, observed as latency, never loss.No
pressure > 0.8 (sustained)Defer the lowest-value work. Routers are enqueued one band lower; B0/negative-priority topics stop receiving DWRR credit (hysteresis releases at pressure < 0.6) — their data is still durably stored and fully pollable, only the push is paused. At pressure ≈ 1.0 with a full ingest channel, the write endpoint returns 429 + Retry-After.No

The cardinal rule: throttling degrades latency and push-eagerness, never correctness. A deferred topic is always fully consistent on the next diff. All data loss remains the explicit, configured cap/TTL path with in-band tombstones; a full-write rejection is synchronous (422 topic_full / 429 throttled), never ack-then-drop.

What a client sees, by pressure level

ConditionClient-visible effectLoss?
Healthy (pressure < 0.2)1–5 ms delivery, per-record framesNo
Mild pressureCoalesced multi-record frames, ~5–15 msNo
Heavy pressureSlower write-acks; low-priority pushes paused but pollableNo
Saturation on write429 + Retry-After (write rejected synchronously)No
Cap/TTL crosses a cursor (independent of pressure)In-band tombstone with [gap_from, gap_to]Explicit, never silent

A writer that must not be throttled can bypass the 429 with disable_backpressure: true on the write. The 429 carries detail.retry_after_ms for CPU pressure (and detail.limit when a resource cap is the cause); see Errors.

Slow-consumer isolation

Each SSE connection has a bounded outbound channel. If a worker can’t enqueue a frame, it does not block the topic drain: the connection is marked lagged, the server stops buffering for it, and on the next successful send it emits a tombstone for the skipped range (on lossy topics) so the client catches up via diff. One slow client is contained to its own connection — the topic and every other watcher proceed at full speed. This is the same boundedness that lets the governor trade latency for throughput without unbounded memory growth.

See also

  • WAL & Group Commit — the adaptive commit window the governor widens under pressure.
  • Performance — measured delivery latency and the throughput ceilings the scheduler operates within.
  • Watch (SSE) — the live-delivery surface the scheduler drains topics into.
  • Errors — the 429 throttled body and its Retry-After / detail fields.
Last updated on