AllocDB Architecture
Scope
This document defines the execution model, logical-time handling, boundedness rules, and trusted-core system shape.
Current Target
The implementation target is intentionally narrow:
- single process
- single shard
- single writer and executor thread
- deterministic TTL expiration through logged events
Design Constraints
- The WAL is the source of truth.
- Only the executor mutates allocation state.
- Every state transition must be replayable from persisted input.
- The state machine must not read wall-clock time, random numbers, or thread interleavings.
- All hot-path queues, maps, tables, buffers, and retention windows must have explicit bounds.
- Expected operating failures return deterministic result codes. Assertions are for programmer error, data corruption, and broken invariants.
- Correctness beats reclaim latency. A resource may become reusable late, but never early.
- The trusted core targets allocation-free steady-state execution after startup.
Core Components
- API ingress
- bounded submission queue
- sequencer and WAL writer
- executor
- expiration scheduler
- snapshot writer
Ingress or networking code may be async if needed. The trusted core boundary is synchronous and explicit: once a command enters the sequencer and executor path, no async suspension or lock-based interleaving is allowed in the core state machine.
Write Path
client
-> ingress validation
-> submission queue
-> sequencer assigns lsn and request_slot
-> append WAL record
-> fsync or group commit
-> executor applies state transition
-> publish result
Rules:
- the executor is single-threaded for one shard
- live execution and replay use the same apply logic
- publish never rewrites the result after the command is applied
Current implementation anchor:
allocdb_node::SingleNodeEngineincrates/allocdb-node/src/engine.rsallocdb_node::apiincrates/allocdb-node/src/api.rsfor the current transport-neutral alpha request/response boundary
Read Path
client
-> read request
-> check applied_lsn >= required_lsn
-> answer from in-memory state or return fence_not_applied
For the trusted core, this is enough for strict reads in the current implementation.
If the live engine halts after a WAL-path ambiguity, reads must fail closed until recovery reconstructs memory from durable state.
Current implementation anchor:
SingleNodeEngine::enforce_read_fence(required_lsn)incrates/allocdb-node/src/engine_observe.rs
Expiration Path
slot ticker
-> inspect due reservations
-> enqueue internal expire commands
-> append to WAL
-> executor applies expire
Rules:
- the scheduler never mutates allocation state directly
- at most
MAX_EXPIRATIONS_PER_TICKexpirations are enqueued - lag must be observable as an explicit metric outside the trusted core
Current implementation anchor:
SingleNodeEngine::tick_expirations(current_wall_clock_slot)incrates/allocdb-node/src/engine.rs
Time and TTL Model
The state machine never reads the system clock directly.
Required configuration:
slot_duration_ms : u64
max_ttl_slots : u64
max_client_retry_window_slots : u64
reservation_history_window_slots : u64
max_expiration_bucket_len : u32
Rules:
- external APIs may accept
ttl_ms, but the WAL and executor operate only on slots max_ttl_slots * slot_duration_ms <= 3_600_000reservation_history_window_slots <= max_ttl_slots
Crossing a deadline does not instantly free the resource. A resource becomes reusable only after
the corresponding expire command is committed and applied.
Retention and Capacity Model
The design uses one fixed-capacity reservation table for both active and recently terminal reservations.
Rules:
- active reservations occupy entries until they terminate
- terminal reservations keep their entry until
retire_after_slot - retirement frees the slot for reuse
- retirement also advances a bounded retired-lookup watermark so later reservation lookups stay
distinct from
not_foundafter the full record is dropped
This keeps history bounded and prevents the product-level history policy from silently making the core unbounded.
Backpressure and Bounds
At minimum define:
MAX_SUBMISSION_QUEUEMAX_BATCH_SIZEMAX_COMMAND_BYTESMAX_RESOURCESMAX_RESERVATIONSMAX_OPERATION_RECORDSMAX_TTL_SLOTSRESERVATION_HISTORY_WINDOW_SLOTSMAX_EXPIRATION_BUCKET_LENMAX_EXPIRATIONS_PER_TICK
Expected behavior under pressure:
- new writes fail fast with
overloadedor a more specific capacity error - reads remain available where possible
- expirations may lag, but lag must be observable
Required operational signals:
logical_slot_lag = max(0, current_wall_clock_slot - last_request_slot)- expiration backlog, for example the number of due expirations not yet applied
operation_table_utilization, so dedupe-window pressure is visible beforeoperation_table_full- recovery and checkpoint status, including:
- how the current process started (
fresh_start,wal_only,snapshot_only, orsnapshot_and_wal) - which snapshot LSN was loaded at startup, if any
- how many WAL frames were replayed at startup
- what snapshot LSN is currently the active durable anchor
- how the current process started (
Current implementation anchor:
AllocDb::health_metrics(current_wall_clock_slot)incrates/allocdb-core/src/state_machine_metrics.rsSingleNodeEngine::metrics(current_wall_clock_slot)incrates/allocdb-node/src/engine_observe.rs
Delayed expiration is acceptable. Premature reuse is not.
Expiration Index
The expiration index is a fixed-capacity timing wheel keyed by deadline_slot.
Rules:
- each slot holds a bounded list of reservation references
- the wheel size is derived from
MAX_TTL_SLOTS - if a slot bucket reaches
MAX_EXPIRATION_BUCKET_LEN, new reserves fail fast withexpiration_index_full
This is a fundamental design decision, not an open question.