AllocDB Replication Protocol
Status
This document chooses the protocol family, states the safety invariants replication must preserve, and narrows the replicated release. The replicated simulation plan and Jepsen gate now live in testing.md.
Scope
The first replicated design target is intentionally narrow:
- one shard
- one fixed membership replica group
- one primary at a time
- one replicated log order per shard
- one deterministic allocator executor per replica
Sharding, reconfiguration, follower reads, and flexible quorum rules are deferred.
Design Goal
Replication exists to add availability and failover without rewriting the single-node allocator
semantics already fixed in M0 through M5.
The replication layer is allowed to:
- replicate the log
- choose a primary
- recover and rejoin replicas
- delay visibility during failover
It is not allowed to:
- change command meanings
- change result-code meanings
- invent a second execution path that bypasses the trusted-core state machine
- make resources reusable earlier than the single-node rules permit
Single-Node Semantics That Must Stay Fixed
Replication must preserve these rules exactly:
- command application stays deterministic
- reservation IDs stay derived from committed log position
operation_idretries remain the way clients resolve indefinite outcomes- strict reads stay defined as "up to a specified applied LSN"
- TTL stays logical-slot based
- expiration may free a resource late, but never early
- bounded retention and bounded retired-history semantics stay part of the product contract
The current authoritative definitions remain:
Chosen Protocol Family
The first replicated AllocDB design is a viewstamped-replication-style primary/backup protocol with majority quorums.
This draft uses the VSR vocabulary:
viewprimarybackuppreparecommitview change
Why this is the right fit for AllocDB:
- the current single-node engine already has one explicit sequencer and one explicit apply path
- the core safety boundary is already "one ordered WAL, one executor, one result per LSN"
- the product already treats log position as a first-class identifier because
reservation_idderives from committedlsn - view change and protocol-aware recovery matter more to AllocDB than follower-read ergonomics or configuration churn in the first replicated release
This is intentionally not a Flexible Paxos design and not a reconfiguration-heavy Raft variant in the first replicated version. Majority quorums and fixed membership are the simpler bounded choice.
Replica Group Shape
The first replicated release assumes:
- fixed odd-sized replica groups
- majority quorum for normal replication
- majority quorum for view change
- no witness nodes
- no learner-only replicas in the core protocol
- no online membership change
Recommended first deployment sizes:
3replicas for the minimum fault-tolerant production shape5replicas only when the higher write quorum latency is acceptable
Replica Roles
Each replica is always in one of these protocol states:
primary: accepts writes for the current viewbackup: durably appends and later applies committed log entriesrecovering: rebuilding local state from validated durable state plus cluster catch-upfaulted: excluded from voting and catch-up because local durable state failed validation
Only the primary may:
- assign new log positions
- publish committed write results to clients
- synthesize internal
expirecommands - advance the replicated commit point
Backups never mutate allocator state from local timers, local wall clock, or speculative client execution.
Persistent Replica State
Every replica must durably persist at least:
replica_idshard_idcurrent_view- durable log entries keyed by
lsn commit_lsn- local snapshot anchor metadata
- enough election/view-change state to guarantee at most one durable vote per view
The exact persisted election record may look like voted_for, last_normal_view, or an
equivalent durable view-participation marker. The implementation detail can vary; the safety
obligation cannot.
The first M7-T01 implementation persists this wrapper state in one dedicated local metadata file
alongside the replica workspace. The current persisted fields are:
replica_idshard_idcurrent_viewrolecommit_lsnactive_snapshot_lsnlast_normal_view- optional durable vote record
(view, voted_for)
Replica startup validates identity, vote/view ordering, commit-versus-snapshot consistency, and
that the local applied LSN plus snapshot anchor exactly match this metadata before a replica can
join. Decode failure, metadata inconsistency, unreadable or permission-denied sidecars, applied
state that is ahead of or behind the persisted commit_lsn, or an explicitly persisted faulted
role leaves the replica in faulted state until repaired.
Replicated Log Contents
The replicated log contains allocator-relevant entries only:
- client commands
- internal
expirecommands
Each replicated client entry carries:
viewlsnoperation_idclient_idrequest_slot- encoded command payload
Each replicated internal expiration entry carries:
viewlsnreservation_iddeadline_slotrequest_slot
Local checkpoint markers, snapshot rewrite metadata, and other storage housekeeping remain local durability details. They must not become separate replicated state-machine commands.
Normal Write Path
The normal replicated write path is:
client
-> current primary
-> deterministic ingress validation
-> assign next lsn in current view
-> append locally
-> send prepare(view, lsn, prev_lsn, commit_lsn, entry) to backups
-> majority durable append
-> mark committed
-> apply through allocator executor
-> reply to client
Rules:
- the primary must not publish a committed result before a majority has durably appended the entry
- backups must not apply an entry before it is known committed
- committed entries are applied in
lsnorder only - live execution and replay still share the same allocator apply logic
- internal
expirecommands use the same replicated path as client commands
This keeps the single-node rule intact: one committed command, one log position, one replay result.
Read Path
The first replicated release serves reads only from the current primary.
Follower reads are deliberately out of scope because they would force extra semantics around leases, read-index confirmation, or stale-read modes before the basic protocol has been validated.
Read rules:
- the primary serves a strict-read only from locally applied committed state
- the
required_lsnfence keeps the same meaning as in single-node mode - a replica in
backup,recovering,faulted, or view-uncertain state does not serve API reads - during view change or quorum ambiguity, reads fail closed instead of guessing
View Change And Failover
AllocDB needs explicit failover rules because client-visible ambiguity is already part of the single-node design.
The protocol rules are:
- there is at most one primary in normal mode for any given view
- a replica that observes a higher view immediately stops acting as primary in the older view
- a primary that loses quorum stops accepting writes and stops serving reads
- a new primary must gather enough state from a majority to reconstruct the latest safe log prefix before entering normal mode
- committed entries survive the view change unchanged
- uncommitted suffix entries from the old primary may be discarded
The current replicated prototype realizes those rules with one explicit view_uncertain role and
durable higher-view vote metadata:
- a replica that loses quorum or votes for a higher view leaves normal mode and stops serving reads or accepting fresh client writes
- the new primary records one durable vote from a reachable majority before it re-enters normal mode
- view change reconstructs the latest committed prefix on the new primary, discards stale uncommitted suffix state, and drops old-view protocol messages instead of trying to finish them
Client impact:
- if the old primary fails before replying, the client still has an indefinite outcome
- the client resolves that ambiguity by retrying the same
operation_id - the new primary must return the already committed result if the command committed in an earlier view
- the protocol must never create a second fresh execution for the same committed
operation_id
Recovery And Rejoin
Replication does not weaken the existing local durability rules.
A restarting replica first validates its own local durable state using the same fail-closed rules already required in single-node mode:
- malformed snapshot input is rejected
- invalid WAL framing or checksum failure is rejected or truncated only at valid tail boundaries
- semantically invalid recovered allocator state is rejected
After local validation:
- a valid but stale replica catches up from the primary by log suffix or snapshot-plus-suffix
- a replica with divergent uncommitted suffix may truncate that suffix during catch-up
- a replica must not discard committed history unless a validated snapshot replaces the same committed prefix
- a replica with irreconcilable corruption enters
faultedstate and must not vote or serve until repaired
Protocol-aware recovery rule:
- the primary may use knowledge of committed
lsnand snapshot anchor to decide whether suffix catch-up is sufficient or snapshot transfer is required - suffix-only catch-up is allowed only when the stale replica already retains one committed durable prefix recent enough for the primary's current retained WAL window
- when the target is older than that retained-WAL floor, the primary must transfer one validated snapshot plus its retained WAL suffix instead of assuming the missing prefix still exists locally
- rejoin clears any prepared-but-uncommitted suffix before the replica returns to backup mode, and it also drops stale protocol messages that still reference the rejoined replica's old state
- rejoin must not move one replica backward in durable view knowledge; a target that has already observed or voted in a higher view than the current primary stays out of service until a compatible higher-view recovery path is available
- a replica already in
faultedstate is not auto-repaired by rejoin; it stays out of service until an operator repairs or replaces it - recovery must preserve the same committed prefix seen by healthy replicas
Expiration And Logical Time Under Replication
Replication must preserve the single-node TTL rule: late reuse is acceptable; early reuse is not.
That means:
- only the primary may inspect external logical time and decide which reservations are due
- the primary logs internal
expirecommands as ordinary replicated entries - backups never expire reservations directly from local time
- a view change may delay expiration work, but it must not allow premature reuse
- a newly elected primary must inspect overdue reservations after it is in normal mode and append
any needed
expirecommands through the replicated log
Logical time therefore remains an input to the primary-controlled scheduler, not a follower-side state-machine dependency.
Safety Invariants
Replication must satisfy all single-node invariants plus these protocol invariants:
- At most one log entry can become committed at a given
(view, lsn)position. - A committed
lsnnever changes payload across views. - Any two majorities intersect, so two different primaries cannot both commit divergent entries at
the same
lsn. - Every replica that applies the same committed log prefix reaches the same allocator state and command outcomes.
reservation_id = (shard_id << 64) | lsnremains valid because committedlsnorder is global within a shard.- A client-visible write result is published only after the corresponding log entry is committed.
- Retrying the same
operation_idafter failover returns the original committed result or an indefinite outcome; it never creates a silent second execution. - Reads are served only from replicas that know they are in the current view and have locally
applied at least the requested
lsn. - Expiration remains log-driven. No replica may free a resource from local wall-clock observation alone.
- A replica with unvalidated or corrupted local durable state must not vote, lead, or serve until repaired.
What The First Replicated Release Deliberately Does Not Do
This draft intentionally leaves these areas out of the first replicated implementation:
- follower reads
- lease-read optimization
- flexible or asymmetric quorums
- online membership change
- cross-shard transactions or cross-shard reservations
- leaderless write paths
- background speculative execution on backups
Those may be revisited only if the simpler majority-primary design proves insufficient.
Follow-On Work
This draft now pairs with the replicated simulation plan and Jepsen gate in
testing.md. The M6 replication design gate is complete; further work should
reopen as tracked implementation tasks rather than silent protocol drift.
Research Anchors
This draft is shaped primarily by:
- Viewstamped Replication Revisited
- Flexible Paxos
- Protocol-Aware Recovery for Consensus-Based Storage
- Can Applications Recover from fsync Failures?
The rule is still the same as elsewhere in the repository: research informs the design, but boundedness, determinism, and the existing single-node semantics stay authoritative.