M0 semantics freeze: complete enough for core work
M1 pure state machine: implemented
M1H constant-time core hardening: complete
M2 durability and recovery: implemented
M3 submission pipeline: implemented
M4 simulation: implemented
M5 single-node alpha surface: implemented
M6 replication design: implemented
M7 replicated core prototype: in progress
M8 external cluster validation: in progress
M9 generic lease-kernel follow-on: implementation merged on main
M10 second-engine proof: merged on main; shared runtime extraction deferred
M11 third-engine proof: merged on main; broad shared runtime still deferred, first micro-extraction now justified
M12 first internal runtime extractions: planned
Latest completed implementation chunks:
4156a80Bootstrap AllocDB core and docs
f84a641Add WAL file and snapshot recovery primitives
d87c9a7Add repo guardrails and status tracking
79ae34fAdd snapshot persistence and replay recovery
1583d67Use fixed-capacity maps in allocator core
3d6ff0fFail closed on WAL corruption
39f103bDefer conditional confirm and add health metrics
82cb8d8Add single-node submission engine crate
current validated chunk: seeded crash-point and WAL-fault coverage across submit, checkpoint,
and recovery boundaries; checked slot and LSN overflow handling; deterministic simulation over
contention, retry timing, and due-expiration ordering; replicated metadata bootstrap and
fail-closed faulted-state entry; majority-backed quorum writes with primary-only reads,
quorum-loss demotion, and higher-view takeover; suffix and snapshot-based stale-replica rejoin
with divergent prepared-suffix discard; promoted partition and primary-crash scenarios that
preserve fail-closed behavior and retry/read continuity after failover; the local
three-replica cluster runner, fault-control harness, and QEMU testbed around the real replica
daemon; the first trusted-core bundle-commit slice with bundle membership, bundle-aware
confirm/release/expire, and bundle regression coverage; the first fencing slice with
lease-epoch propagation, stale-holder rejection, and epoch-aware retry/read coverage; explicit
revoke/reclaim with late-not-early reuse preserved across replay and failover; lease-shaped
node API exposure for bundle membership and authority state; replicated preservation for
committed bundle membership and stale-holder rejection across failover and suffix/snapshot
rejoin; and live KubeVirt Jepsen lease-safety control and 1800s crash-restart runs with
blockers=0
explicit restart-and-retry handling for ambiguous WAL failures within the dedupe window
explicit lsn_exhausted write rejection after the engine commits the last representable LSN
node-level metrics for queue pressure, write acceptance, startup recovery status, and active
snapshot anchor
Deterministic benchmark harness:
CLI entrypoint at cargo run -p allocdb-bench -- --scenario all
one-resource-many-contenders scenario for hot-spot reserve contention
high-retry-pressure scenario for duplicate replay, conflict replay, full dedupe table
rejection, and post-window recovery
scenario reports include elapsed time, throughput, metrics snapshots, and WAL byte counts
Alpha API surface:
transport-neutral request and response types in crates/allocdb-node::api
binary request and response codec with fixed-width little-endian encoding
explicit wire-level mapping for definite vs indefinite submission failures
strict-read fence responses plus halt-safe read rejection for resource and reservation queries
retired reservation lookups remain distinct from not_found across later writes and snapshot
restore through bounded retired-watermark metadata
bounded tick_expirations maintenance request for live TTL enforcement
metrics exposure through the same API boundary
Operator documentation:
operator-facing runbook for the single-node alpha, local replicated cluster runner, local QEMU testbed, and first Kubernetes deployment shape
Kubernetes deployment packaging:
one container build, one DNS-backed layout generator for cluster-layout.txt, and one first deploy/kubernetes install shape with a bootstrap-primary service and per-replica PVCs
one GitHub Actions image-publish workflow for Docker Hub staging and release tags
Follow-on planning:
one draft lease-kernel follow-on plan that narrows the next trusted-core additions to bundle
ownership, fencing, revoke, and an explicit liveness boundary, framed as generic
scarce-resource semantics rather than product-specific behavior
one draft lease-kernel design-decision document that chooses a first-class lease authority
object, bundle size 1 as the single-resource semantic special case, a lease-scoped fencing
token, and a two-stage revoke -> reclaim safety model
one merged authoritative-docs pass under issue #80 that rewrote semantics, API,
architecture, and fault-model docs to the approved lease-centric contract while keeping the
current reservation-centric implementation explicitly marked as compatibility surface
one merged M9-T08 planning note that narrows revoke/reclaim implementation scope before the
code-bearing revoke branch
Replication design draft:
VSR-style primary/backup replicated log with fixed membership and majority quorums
primary-only reads in the first replicated release
protocol invariants that preserve single-node idempotency, strict-read, TTL, and
reservation-ID semantics across failover
Replicated validation planning:
deterministic cluster-simulation plan that extends seeded simulation to partitions, primary
crash, and rejoin without a mock semantics layer
Jepsen gate with explicit contention, ambiguity, failover, and expiration workloads
supplementary Jepsen lease-safety coverage for bundle reserve, revoke/reclaim, and stale-holder rejection without changing the documented release-gate matrix
retry-aware history interpretation and release-blocking invariants for duplicate execution,
stale successful reads, double allocation, early reuse, and stale-holder acceptance
Host-side Jepsen harness slice:
one release-gate matrix planner, one retry-aware history codec/analyzer, one host-side artifact bundler for duplicate-execution, double-allocation, stale-read, early-expiration, unresolved-ambiguity, and fetched external-cluster log checks, plus explicit verify-qemu-surface and verify-kubevirt-surface probes that exercise one real metrics round trip on every replica and one real primary submit/read round trip through the live replicated protocol surface
one supplementary lease_safety workload family with control and crash-restart runs that exercises bundle reserve, explicit revoke/reclaim, and stale-holder release against the live Jepsen surface without promoting that workload into the release-blocking matrix yet
one real run-qemu and one real run-kubevirt executor for the full documented release-gate matrix, with persisted histories and artifact bundles for control, crash-restart, partition-heal, and mixed-failover runs, plus host-side failover/rejoin orchestration built from replica workspace export/import and staged ReplicaNode::recover(...) rewrites
one capture-kubevirt-layout helper that records the live KubeVirt VM IPs, namespace, helper-pod settings, and SSH key path needed to drive the matrix from the host
Replicated node scaffolding:
dedicated replica metadata file with temp-write, rename, and directory-sync durability
persisted replica identity, role, view, commit point, snapshot anchor, last-normal view, and
optional durable vote metadata
startup bootstrap for missing metadata on both fresh-open and recover paths
fail-closed faulted state when metadata bytes are corrupt, identity is mismatched, or local
applied/snapshot state contradicts the persisted replicated metadata
configurable normal-mode primary and backup roles for one current view
explicit view_uncertain role plus durable higher-view voting for replicas that lost quorum
or are participating in failover
durable prepared-entry sidecar for pre-commit replicated client commands
prepare append, commit-through, and strict primary-read guards built around the existing
single-node executor rather than a second apply path
Local multi-process cluster runner:
CLI entrypoint at cargo run -p allocdb-node --bin allocdb-local-cluster -- <start|stop|status|crash|restart|isolate|heal> ... with one persisted cluster-layout.txt
stable replica identities, local bounds, and three external replica processes from one command surface
per-replica loopback control, client, and protocol listeners with status and stop hooks on control
per-replica pid, log, WAL, snapshot, metadata, and prepared-log paths exposed through status, with restart through the real ReplicaNode::recover path and stable durable workspace reuse
one persisted cluster-faults.txt file that marks whole-replica client/protocol isolation without affecting control reachability, plus one append-only cluster-timeline.log for later checker/debug reuse
reserved client and protocol listeners now fail with explicit isolation errors when the local fault harness marks that replica isolated
real primary-side client/protocol transport for external submit, get_resource, get_reservation, get_metrics, and replicated tick_expirations, with majority append before publish and backup reads still failing closed as not primary
structured daemon-side logging for successful prepare quorum formation, commit-broadcast acknowledgements, accepted protocol prepare/commit traffic, expiration batch planning, and applied expiration commands
Durability primitives:
WAL frame codec and recovery scan
file-backed WAL append, sync, recovery, and torn-tail truncation
fail-closed recovery on middle-of-log corruption
fail-closed recovery on non-monotonic WAL replay metadata and malformed decoded snapshot
semantics
fail-closed recovery on replayed commands whose derived slot windows overflow configured
bounds
snapshot encode, decode, capture, restore
file-backed snapshot write and load
explicit WAL command payload encoding and live-path replay recovery
checkpoint path that writes the new snapshot first, then rewrites retained WAL history
one-checkpoint WAL overlap and snapshot_marker retention for safe checkpoint replacement
Deterministic simulation support:
reusable simulation harness in crates/allocdb-node/src/simulation.rs
explicit simulated slot advancement under test control, with no wall-clock reads in the
exercised engine path
seeded same-slot ready-set scheduling with reproducible transcripts
seeded labeled schedule actions that resolve candidate slot windows into replayable
submit/tick transcripts
seeded due-expiration selection over the real internal-expire path, bounded by the production
per-tick expiration limit
seeded one-shot crash plans over named client-submit, internal-apply, checkpoint, and
recovery boundaries
one-shot storage fault helpers over append failure, sync failure, checksum mismatch, and
torn-tail WAL mutation against real on-disk recovery
checkpoint, restart, and live write-fault helpers over the real SingleNodeEngine
regression coverage for crash-selected post-sync submit replay, crash-after-snapshot-write
checkpoint recovery, replay-interrupted recovery restart, sync-failure retry recovery,
checksum-corruption fail-closed restart, torn-tail truncation retry, ingress contention winner
order, same-deadline expiration order, mixed-deadline earliest-first expiration priority, and
retry timing across the dedupe window
reusable replicated cluster harness in crates/allocdb-node/src/replicated_simulation.rs
three real ReplicaNodes with independent WAL, snapshot, and metadata workspaces
explicit replica-to-replica and client-to-replica connectivity matrix under test control
explicit protocol-message queue plus replayable transcripts for queue, deliver, drop, crash,
and restart actions
real prepare, prepare_ack, and commit protocol payload delivery on that queue
configured-primary client submit flow with result publication only after majority durable
append
retry-aware client submit helper that returns one cached committed result on the current
primary instead of assigning a fresh replicated LSN
backup replicas that durably append prepares but do not apply allocator state until commit
primary-only resource reads guarded by the existing strict-read fence after local commit
automatic quorum-loss detection that demotes a stranded primary out of service
explicit higher-view takeover that records durable votes from a reachable majority,
reconstructs the safe committed prefix on the new primary, discards stale uncommitted suffix,
and drops old-view protocol messages
replica crash as loss of volatile state with restart through real ReplicaNode::recover
checkpoint-assisted rejoin that rewrites one stale replica from suffix-only WAL catch-up or
snapshot transfer, then restarts through the real recovery path before returning the replica
to backup mode
regression coverage for quorum-loss fail-closed reads and writes, higher-view takeover with
stale-primary read rejection, prepared-suffix recovery from another voter during takeover,
isolated-backup partition heal and catch-up, non-quorum split fail-closed behavior with later
rejoin convergence, primary crash before quorum append, primary crash after majority append,
primary crash after reply, suffix-only rejoin, snapshot-transfer rejoin, and faulted rejoin
rejection
Validation:
core durability: cargo test -p allocdb-core wal -- --nocapture, cargo test -p allocdb-core snapshot -- --nocapture, cargo test -p allocdb-core recovery -- --nocapture, cargo test -p allocdb-core snapshot_restores_retired_lookup_watermark
node runtime: cargo test -p allocdb-node api_reservation_reports_retired_history, cargo test -p allocdb-node engine -- --nocapture, cargo test -p allocdb-node replica -- --nocapture
simulation: cargo test -p allocdb-node simulation -- --nocapture, cargo test -p allocdb-node replicated_simulation -- --nocapture
local cluster, qemu assets, Jepsen harness, and benchmarks: cargo test -p allocdb-node local_cluster -- --nocapture, cargo test -p allocdb-node qemu_testbed -- --nocapture, cargo test -p allocdb-node jepsen -- --nocapture, cargo test -p allocdb-node --bin allocdb-jepsen -- --nocapture, cargo run -p allocdb-node --bin allocdb-jepsen -- plan, cargo run -p allocdb-bench -- --scenario all
PR #82 merged the #70 maintainability follow-up, including live KubeVirt reservation_contention-control and full 1800sreservation_contention-crash-restart reruns on allocdb-a with blockers=0
M9-T01 through M9-T05 are merged on main via PR #81, and the planning issues are closed on the AllocDB project
PRs #89, #90, #92, #93, #94, and #95 merged the full M9-T06 through M9-T11
implementation chain on main: bundle commit, lease-epoch fencing, explicit revoke /
reclaim, lease-shaped node API exposure, replication-preserved failover behavior, and broader
simulation coverage are now all in the mainline implementation
PR #97 merged issue #96, extending Jepsen history generation and analysis for bundle
reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirt
lease_safety-control and full 1800slease_safety-crash-restart evidence on allocdb-a with blockers=0
the next recommended step remains downstream real-cluster e2e work such as gpu_control_plane, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster StatefulSet shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish skel84/allocdb from GitHub Actions rather than relying on the local Docker engine
PR #107 merged the M10 quota-engine proof on main, and PRs #116, #117, and #118 merged the full M11 reservation-core chain on main: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs
PRs #132, #133, and #134 merged the first M12 runtime extractions on main: retire_queue, wal, and wal_file are now shared internal substrate instead of copied engine-local modules, while M12-T04 closed as a defer decision because snapshot_file is still only a clean seam inside the quota-core / reservation-core pair and allocdb-core keeps the simpler file format
the next roadmap step is now M13: define the internal engine authoring boundary in runtime-extraction-roadmap.md, keep shared runtime below the semantic line, keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local, publish the focused runtime-vs-engine-contract note as the shorter authoring reference, and narrow the next proof shape so M14 still matters but no longer defaults to a full fourth engine