AllocDB Testing Strategy
Scope
This document defines the v1 testing model and the additional testing gate required before any replicated release.
Principle
Unit tests are necessary but not sufficient.
AllocDB should follow the TigerBeetle, FoundationDB, and Dropbox line of thinking:
- deterministic execution is a design property
- simulation should run the real code
- failures should be injected systematically, not only reproduced after the fact
v1 Testing Layers
State-Machine Tests
Required coverage:
- every legal transition
- every illegal transition
- resource and reservation state agreement
- terminal-state behavior
Replay and Recovery Tests
Required coverage:
- WAL replay equivalence
- crash during WAL append
- torn WAL tails
- snapshot plus WAL recovery
- corruption detection and fail-closed behavior
Idempotency and Submission Tests
Required coverage:
- duplicate
operation_id operation_idreuse with different payload- indefinite outcomes resolved by retry with the same
operation_id - behavior at dedupe-window expiry
Capacity Tests
Required coverage:
- full submission queue
- full reservation table
- full expiration bucket
- maximum TTL and retention settings
Contention Tests
Required coverage:
- many contenders for one resource
- simultaneous expirations and confirms
- retry storms with reused
operation_id
Deterministic Simulation
v1 should add a deterministic simulator around the trusted core as early as practical.
The simulator should control:
- slot advancement
- ingress scheduling order
- WAL write and fsync outcomes
- crash points
- restart timing
Properties:
- seeded and reproducible
- runs real state-machine and recovery code
- supports shrinking failures to minimal cases when possible
M4-S01 Harness Direction
The M4-S01 spike selected one external scripted driver as the starting point for M4-T01
through M4-T04.
The selected shape is:
- wrap the real
allocdb_node::SingleNodeEngineinstead of adding simulator-only execution paths - keep one explicit simulated current slot in the harness and pass it into real engine calls
- model ingress, tick, checkpoint, crash, restart, and injected persistence faults as explicit driver events
- use a seed only to choose ordering among ready actions at the same logical slot
The promoted M4-T01 harness now lives in crates/allocdb-node/src/simulation.rs, with
regression coverage in crates/allocdb-node/src/simulation_tests.rs. The current evidence shows:
- the same seed reproduces the same same-slot action order and LSN transcript
- the same seed plus enabled crash-point set reproduces the same one-shot crash selection, independent of slice order
- advancing the simulated slot without ticking produces deterministic expiration backlog
- a checkpoint plus WAL-backed restart path still works when an expiration commit halts the live engine after append and before sync
- seeded crash plans now interrupt the real engine at client submit/apply, checkpoint, and recovery boundaries, with restart tests covering post-sync submit replay, snapshot-written before WAL rewrite, and replay-interrupted recovery
- seeded schedule actions can now resolve one labeled ingress or tick action into one candidate slot window, replay the same resolved schedule from seed, and record one transcript that captures both chosen slots and outcomes
- seeded schedule exploration now covers ingress contention order, same-deadline expiration selection while preserving earliest-deadline priority under bounded tick throughput, and retry timing across the dedupe window with replay from the same seed
- one-shot storage-fault helpers now cover append-failure halts, sync-failure ambiguity, checksum-mismatch fail-closed recovery, and torn-tail truncation against the real WAL and restart path
What to reuse in follow-up tasks:
- the external-driver architecture
- explicit slot advancement under test control
- seeded scheduling for same-slot ready work
- labeled schedule actions with candidate slot windows and replayable transcripts
- seeded due-expiration selection over the real internal-expire path while preserving earliest-deadline priority
- seeded one-shot crash plans over real engine and recovery boundaries
- one-shot storage-fault helpers over live WAL writes and post-crash WAL mutation
- restart helpers that reopen from snapshot plus WAL on disk
What not to promote directly:
- the original spike's ad hoc helper surface
- any scheduler choice that is not covered by deterministic transcript tests
- opaque randomized loops that do not record the resolved schedule and seed
- crash toggles that are not selected from a seed and named boundary set
- one-off layouts that hide the reusable harness from follow-on simulation tasks
This direction keeps trusted-core churn low because the real engine already exposes the slot, checkpoint, recovery, and failure-injection seams the simulator needs. It also avoids trait-heavy virtual clock or fake storage abstractions inside the core before the project has proven they are necessary.
Replicated Deterministic Simulation
M6-T02 extends the single-node simulation approach to replicated execution without introducing a
mock semantics layer.
The rule stays the same:
- run the real allocator and recovery code
- keep time, message delivery, crash, and restart under explicit test control
- make every schedule decision reproducible from seed plus transcript
Design Goal
The replicated simulator should answer one question before Jepsen exists:
can the chosen replication protocol preserve the single-node invariants under deterministic fault schedules?
That means the simulator is not a toy cluster model. It is a deterministic cluster driver around real replica state, real durable state transitions, and real retry semantics.
Cluster Harness Shape
The first replicated harness should model one fixed-membership shard with 3 real replicas.
The harness should:
- wrap one real replicated node per replica, each with its own WAL and snapshot workspace
- keep one explicit simulated slot counter shared by the cluster driver
- treat protocol messages as explicit driver-visible events
- treat timeout firing, view change, crash, restart, and rejoin as explicit driver-visible events
- use a seed only to choose among already-ready actions at the same simulated slot
- record one transcript containing seed, chosen actions, delivered messages, dropped messages, and
resulting view and
lsnobservations
What must not happen:
- no simulator-only apply path
- no fake replica state that bypasses the real durable log and recovery path
- no hidden random network loop that cannot be replayed from transcript
Driver Actions
The replicated harness should choose among explicit labeled actions such as:
client_submitdeliver_protocol_messagedrop_protocol_messageadvance_slotfire_timeoutcrash_replicarestart_replicaallow_rejoincomplete_snapshot_transfer
As with the current single-node schedule exploration, the harness should resolve one ready action into one recorded transcript step. The recorded step should include enough metadata to replay the same cluster schedule exactly.
The initial M7-T02 implementation keeps that surface intentionally narrow around the current
codebase:
queue_protocol_messagedeliver_protocol_messagedrop_protocol_messageset_connectivitycrash_replicarestart_replicaexplore_schedule
The harness already hosts three real ReplicaNodes with independent durable workspaces and one
shared seeded slot driver. M7-T03 upgrades that queue from opaque labels to real
prepare/prepare_ack/commit payloads while keeping the same deterministic delivery,
partition, crash, and replay surface. M7-T05 extends that same harness with
checkpoint_replica and rejoin_replica helpers so deterministic tests can force one retained
WAL floor on the primary, then prove suffix-only catch-up, snapshot transfer, and faulted-rejoin
rejection through the real on-disk restart path.
Network And Failure Model In Simulation
The deterministic cluster driver should model:
- connectivity as an explicit replica-to-replica and client-to-replica delivery matrix
- partitions as rule changes in that matrix, not as implicit timing guesses
- process crash as loss of volatile state with durable WAL and snapshot files left on disk
- restart as reopening the replica from its own durable state plus replicated catch-up
- rejoin as restoring connectivity and allowing suffix catch-up or snapshot-plus-suffix catch-up
The simulator does not need arbitrary packet corruption in the first replicated design pass. Message loss, partition, delay by non-delivery, crash, restart, and rejoin are the required first faults.
Required Scenario Families
The first replicated simulation plan must cover these families explicitly.
Partition Scenarios
- isolate the primary from one backup but keep quorum, so writes still commit and the minority replica later catches up
- isolate the primary from the majority, so the old primary stops serving and a new primary must win the higher view
- split the cluster into non-quorum minorities, so no side commits and reads fail closed
- heal the partition and verify that all healthy replicas converge on one committed prefix
Key checks:
- no split-brain commit
- no read served from a view-uncertain or quorum-lost replica
- retries with the same
operation_idresolve ambiguity without duplicate execution - a healed but stale replica catches up only after the committed prefix already accepted by the healthy quorum is restored locally
Primary Crash Scenarios
- crash the primary before quorum append, so the write remains uncommitted and clients see only an indefinite outcome
- crash the primary after quorum append, so retry must recover the committed result even if failover interrupts reply delivery or commit propagation
- crash the primary after reply and force later reads and retries through the new primary
- crash during expiration leadership, so overdue work may be delayed but never applied early
Key checks:
- committed entries survive failover unchanged
- uncommitted suffix does not become visible as committed history
- expiration remains log-driven and may be late but not early
Rejoin And Recovery Scenarios
- restart a stale backup and catch up by replicated suffix only
- restart a replica whose local state requires snapshot-plus-suffix transfer
- rejoin a replica that holds an uncommitted divergent suffix and verify that the suffix is discarded safely
- restart a replica with invalid local durable state and verify that it stays faulted instead of voting or serving
Key checks:
- rejoined replicas recover the committed prefix already accepted by the healthy quorum
- committed history is never rewritten during catch-up
- rejoin never regresses one replica's durable view knowledge behind the current primary
- corrupted replicas fail closed until repaired
Required Invariants In Simulation
Every promoted replicated simulation test should check some subset of these invariants:
- same seed plus same starting durable state yields the same transcript
- no two different payloads commit at the same
lsn - replicas that apply the same committed prefix reach the same allocator state and outputs
- a client-visible success is published only for a quorum-committed entry
- retry with the same
operation_idnever creates a second successful execution - reads succeed only on the current primary after the requested
required_lsnis locally applied - resource reuse after expiration is never earlier than the single-node rules allow
- rejoin never lets a stale or corrupted replica serve, vote, or lead before validation completes
Promotion Path
The recommended implementation sequence is:
- add a deterministic cluster driver that can host
3real replicas, one message queue, and one explicit connectivity map - add seeded message-delivery and timeout scheduling with replayable transcripts
- add partition scenarios that prove fail-closed reads and no split-brain commit
- add primary-crash scenarios that exercise retry semantics around quorum commit boundaries
- add rejoin scenarios for suffix catch-up, snapshot transfer, and divergent-suffix truncation
- add replicated storage-fault combinations only after the basic cluster schedule is already replayable
This keeps the first replicated simulator narrow enough to validate protocol behavior before expanding into broader randomized search.
Current executable replicated coverage already proves:
- a quorum-lost primary fails closed for new writes and strict reads even when it still has local committed state
- a higher-view takeover can reconstruct the latest committed prefix on a new primary before that replica returns to normal mode
- stale or quorum-lost primaries reject reads after failover instead of serving stale success
- the primary can keep serving through one isolated-backup partition, then heal and catch that stale backup up to the committed prefix without duplicate execution
- a full split into non-quorum minorities fails closed until one majority reforms, after which the new primary can accept writes and rejoin the stale replica back onto the committed prefix
- a primary crash before quorum append preserves indefinite ambiguity until failover, after which a
retry with the same
operation_idcommits exactly once on the new primary - a primary crash after majority append lets the next primary reconstruct the committed prefix and resolve the retry from cache instead of executing it again
- a primary crash after reply preserves strict-primary reads and cached retry results on the next primary
- a stale replica can rejoin by replicated suffix only when it still holds one recent enough committed durable prefix
- a primary checkpoint can force snapshot transfer for older replicas whose local durable state falls behind the retained WAL floor
- rejoin now fails closed if the target has already observed a higher durable view than the current primary
- rejoin discards one divergent uncommitted suffix and rejects one replica forced into
faultedstate by corrupted durable metadata
Local Multi-Process Runner Smoke Test
M8-T01 adds the first external-process smoke gate on top of the replicated node library.
The required focused validation command is:
cargo test -p allocdb-node local_cluster -- --nocapture
What this smoke test proves today:
- one operator command surface can start
3external replica processes - the chosen workspace layout, loopback addresses, and per-replica bounds persist across restart
- each replica answers one live control
statusrequest with PID, role, view, recovery, address, and path details - one operator
stopcommand shuts the cluster down cleanly and removes the live pid files
What it does not claim yet:
- it does not prove external failover, rejoin, or stale-primary rejection yet
- it does not prove background expiration or checkpoint workers inside the replica daemons
- Jepsen and QEMU-backed validation remain follow-on gates after the local process surface is in place
Local Fault-Control Harness
M8-T02 adds the first reusable disruption surface on top of the local multi-process runner.
The required focused validation command remains:
cargo test -p allocdb-node local_cluster -- --nocapture
What this harness proves today:
- one operator command surface can crash and restart one replica process without changing its configured identity, addresses, or durable workspace
- one operator command surface can isolate one replica from external
clientandprotocoltraffic while preservingcontrolreachability for later debug and recovery - the reserved
clientandprotocollisteners now fail with an explicit isolation error when the fault harness marks that replica isolated - one persisted
cluster-timeline.logrecords cluster start/stop plus replicacrash,restart,isolate, andhealevents in replayable order
What it still does not claim:
- the external process boundary still does not carry the real replicated client or protocol traffic yet
- the harness does not yet orchestrate VM-level network partitions or reboots
- Jepsen and QEMU-backed validation remain follow-on gates after this local disruption surface
Local QEMU Testbed
M8-T03 adds the first repeatable VM-backed cluster surface on top of the local process and
fault-control tools.
The focused validation commands are:
cargo test -p allocdb-node qemu_testbed -- --nocapturecargo run -p allocdb-node --bin allocdb-qemu-testbed -- prepare --workspace <path> --base-image-path <cloudimg.qcow2> --local-cluster-bin <allocdb-local-cluster>
The host-side prerequisites for that command are:
QEMU_SHARE_DIRcan point at a non-default QEMU firmware directory when the host does not keep firmware templates under the standard search pathsQEMU_ACCELcan override the accelerator embedded into the rendered QEMU command; the default ishvfon macOS andkvmon Linuxpreparehard-fails unless the host has the arch-specificqemu-system-*binary,qemu-img,ssh, andssh-keygenonPATHprepareuseshdiutilon macOS andmkisofsorgenisoimageon Linux-class hosts when it builds NoCloud seed images, and it hard-fails if the platform-appropriate ISO builder is absentpreparehard-fails unless the configuredallocdb-local-clusterbinary already exists on the host- if
--base-image-pathdoes not already exist,preparehard-fails unlesscurlis onPATHand the base-image download succeeds preparehard-fails if the QEMU firmware templates are not reachable either through the default search paths orQEMU_SHARE_DIR
What this testbed proves today:
- one host-side command surface can generate repeatable QEMU assets for
3replica guests plus1control guest - each replica guest uses one copy-on-write overlay, one generated NoCloud seed image, and one
static local-cluster layout that runs the existing
allocdb-local-cluster replica-daemon - replica
controlmoves onto one management network while replicaclientandprotocollisteners move onto one separate cluster network - one generated control-node script can drive
status,isolate,heal,crash,restart,reboot,export-replica,import-replica, andcollect-logsoperations against the replica guests - the generated workspace keeps overlay images, firmware vars, guest seeds, console logs, and SSH keys in stable paths suitable for scripted follow-on runs
What it still does not claim:
- the QEMU testbed still relies on the host-side Jepsen runner for failover/rejoin orchestration; the guest runtime itself does not contain a standalone distributed control plane
Jepsen Harness Slice
M8-T04 now adds the first host-side Jepsen harness tooling around that external replicated
surface:
cargo test -p allocdb-node jepsen -- --nocapturecargo test -p allocdb-node --bin allocdb-jepsen -- --nocapturecargo run -p allocdb-node --bin allocdb-jepsen -- plancargo run -p allocdb-node --bin allocdb-jepsen -- analyze --history-file <history.txt>cargo run -p allocdb-node --bin allocdb-jepsen -- capture-kubevirt-layout --workspace <path> --kubeconfig <path> --namespace <name> --ssh-private-key <path>cargo run -p allocdb-node --bin allocdb-jepsen -- verify-qemu-surface --workspace <path>cargo run -p allocdb-node --bin allocdb-jepsen -- verify-kubevirt-surface --workspace <path>cargo run -p allocdb-node --bin allocdb-jepsen -- run-qemu --workspace <path> --run-id <run-id> --output-root .artifacts/cargo run -p allocdb-node --bin allocdb-jepsen -- run-kubevirt --workspace <path> --run-id <run-id> --output-root .artifacts/cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt --workspace <path> --output-root .artifacts/ [--run-id <run-id>] [--follow]cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt-fleet --lane <name,workspace,output-root> [--lane <name,workspace,output-root> ...] [--refresh-millis <ms>] [--follow]cargo run -p allocdb-node --bin allocdb-jepsen -- archive-qemu --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root .artifacts/cargo run -p allocdb-node --bin allocdb-jepsen -- archive-kubevirt --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root .artifacts/
Use the repo-local .artifacts/ directory as the default Jepsen output root so release-gate
evidence stays inside the workspace tree without polluting git history.
For local debugging only, faulted run-qemu and run-kubevirt runs also honor
ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE=<secs>. That override shortens the live fault window
for fast repros, but runs executed with it do not count toward the documented release gate.
For KubeVirt fault debugging, inspect the guest-local /var/log/allocdb/replica-{1,2,3}.log
files as well as the archived journal.log. The current collect-logs bundle captures the
systemd journal and control snapshots, but the newest replica role/view transition logs land in
the daemon log file today, together with successful quorum, commit-broadcast, protocol-accept,
and expiration-batch traces.
What this harness slice proves today:
- one command can materialize the exact
15documented first-release gate runs as a stable matrix - one retry-aware history analyzer folds ambiguous attempts and later retries by stable
operation_id - the analyzer automatically blocks on the current release-blocking outcomes the repo already
documents: duplicate committed execution, double allocation, stale successful reads,
early expiration release, unresolved ambiguity, and writes that are missing
operation_id - one archive command can bundle the analyzed history with one fetched external-cluster log archive and one manifest rooted on the host
- one surface-probe command can issue one real
get_metricsrequest to every replica, then drive one realcreate_resourcesubmit plus one fencedget_resourceread through the configured primary against either the QEMU or KubeVirt cluster - one
capture-kubevirt-layoutcommand can persist the live control/replica VM IPs, helper-pod settings, SSH key path, and per-replica network addresses needed to drive the Jepsen matrix against a running KubeVirt deployment - one per-run request namespace now keeps client IDs and request slots distinct across separate
allocdb-jepseninvocations, so operators can execute multiple Jepsen runs against one persistent external cluster without tripping monotonic-slot invariants from earlier runs - one
run-qemuand onerun-kubevirtcommand can execute the full documented release-gate matrix against the live external cluster and persist one analyzed history plus one artifact bundle for each run - one
watch-kubevirtterminal command can follow the currently active or named KubeVirt run by reading one per-run status file, one per-run event log, and live replica control/metrics state while the run is in progress;--followkeeps the dashboard open across terminal run states so operators can leave one watcher running between repeated Jepsen invocations - one
watch-kubevirt-fleetterminal command can follow multiple KubeVirt Jepsen lanes at once by reading each lane'sallocdb-jepsen-latest-status.txt, then combining lane summaries, replica health, and recent events into one dashboard suitable for a parallel3-lane release-gate run - the faulted Jepsen families now loop whole scenario iterations until the documented minimum fault window is actually satisfied, with one fresh request namespace per iteration and one monotonic history sequence across the full long-running run
- those faulted iterations now also repair the external lane back to one primary plus two backups before the next scenario slice starts, so repeated crash/partition loops do not inherit a stale partially recovered runtime from the previous iteration
- staged replica export/import now uses lane-scoped unique temporary workspaces on the host, so concurrent multi-lane failover and rejoin orchestration cannot collide on shared temp paths
- those runs now exercise one real hot-resource contention sequence, one real stable
operation_idretry-cache replay, one real primary-versus-backup read-role check, one real replicatedtick_expirationspath through the external runtime, and one host-side crash/partition/mixed-failover cutover path built from replica workspace export/import plus the existingReplicaNode::recover(...)logic on staged copies - the KubeVirt path now has live proof for one captured layout, one successful
verify-kubevirt-surfacerun, and one successfulreservation_contention-controlrun-kubevirthistory plus artifact bundle against the realallocdb-{control,replica-*}VMs - the
failover_read_fences-partition-healscenario now retries its minority-partition ambiguous write after failover on the new primary, and a fresh short KubeVirt rerun with the debug fault window override passed withrelease_gate=passedandblockers=0 - the expiration-and-recovery scenarios now tolerate bounded expiration backlog on long-lived
lanes: after one committed expiration tick, the harness will issue follow-up ticks until the
specific reserved resource becomes
Available, instead of assuming the first committed tick always drained the target reservation - the full documented
15-run release-gate matrix has now completed on the rebuilt3-lane KubeVirt profile that useslonghorn-strict-local-wffcplus replica-only anti-affinity, and every run finished withrelease_gate_passed=trueandblockers=0
What it still does not claim:
- partition control still uses the existing whole-replica client/protocol isolation surface, not arbitrary packet loss or per-link delay injection
- the release gate still depends on operators running the full external-cluster matrix end to end; the unit and integration suite only proves the harness and orchestration code paths
Jepsen Validation Gate
M6-T03 defines the external validation required before any replicated release.
Design Goal
Jepsen should answer a different question than deterministic simulation:
does the deployed replicated system preserve the client contract under real network faults and client-visible ambiguity?
Deterministic simulation proves protocol behavior against explicit schedules. Jepsen validates the same contract through the real client surface, real routing mistakes, real timeout behavior, and real failover and recovery timing.
Minimum Testbed
The first Jepsen gate should target the same narrow release shape as the replication draft:
- one shard
3replicas- fixed membership
- the real external API planned for the replicated release
- clients that preserve
operation_idacross retries - a pre-created bounded resource set large enough to force contention and expiration
- logical-slot advancement and
tick_expirationsdriven through the public or operator-facing surface, not simulator internals
5-replica clusters, online reconfiguration, and multi-shard behavior are follow-on work, not
part of the first Jepsen gate.
Required Workload Families
Every release candidate should run these workload families.
Reservation Contention
- many clients reserve, confirm, and release from a hot resource set
- duplicate retries and conflicting
operation_idreuse are injected deliberately - reads sample resource and reservation state during load
Checks:
- successful operations remain linearizable
- no resource is committed to two holders at once
- one
operation_idnever produces two successful executions
Ambiguous Write Retry
- crash or isolate the primary around quorum-commit boundaries for
reserve,confirm, andrelease - let clients observe timeouts or indefinite write outcomes, then force retries with the same
operation_id
Checks:
- an ambiguous write resolves to at most one committed result
- retry returns the original committed result after failover when the first attempt already committed
- unresolved ambiguity after cleanup fails the gate
Failover And Read Fences
- mix writes with
required_lsnreads while primaries fail, elections run, and some clients route to stale nodes - keep both read-only traffic and read-after-write traffic active during failover
Checks:
- successful reads come only from the current primary after the requested
required_lsnis locally applied - stale or quorum-lost replicas fail closed instead of serving stale success
- committed state remains linearizable across primary change
Expiration And Recovery
- create expiring reservations, advance logical time, call
tick_expirations, and combine that load with crash, partition, restart, and rejoin - force rejoin after stale state, suffix catch-up, and snapshot-plus-suffix recovery paths
Checks:
- expiration may be delayed by failover but never frees a resource early
- restarted and rejoined replicas converge on committed history before serving or voting
- recovery preserves the same client-visible result for retried operations
Nemesis Families
The first Jepsen gate should explicitly cover:
- primary crash and restart
- majority-loss and minority-loss partitions plus heal
- stale-primary isolation with client misrouting to the old primary
- backup crash during catch-up and later rejoin
- mixed crash-plus-partition schedules around ambiguous writes
Clock skew, disk corruption, membership change, and multi-shard faults are not part of the first Jepsen gate. Those are either already covered by deterministic simulation and local durability testing or deferred until later replicated milestones.
History Interpretation
Jepsen must interpret AllocDB histories using the product's retry contract, not a generic "exactly once" assumption.
Rules:
- every mutating client operation carries a stable
operation_id - definite successes and definite failures enter the history directly
- timeouts, transport loss, and other indefinite outcomes are recorded as ambiguous client events, not as silent success or failure
- the checker folds an ambiguous event and all later retries with the same
operation_idinto one logical command - cleanup retries every ambiguous command within the dedupe window; any ambiguity that remains unresolved after cleanup fails the gate
- fail-closed read rejection from stale or quorum-lost replicas is an allowed outcome; stale successful read is not
Required Checkers
The Jepsen analysis should include at least:
- a linearizability checker over successful writes and successful reads captured from the QEMU-backed runtime histories
- an
operation_iduniqueness checker that rejects duplicate committed execution - a resource-safety checker that rejects double allocation
- a strict-read fence checker for successful
required_lsnreads - an expiration-safety checker that rejects early reuse
Release Gate
Jepsen is not a substitute for simulation. The deterministic replicated-simulation gate must pass before Jepsen begins.
The minimum release gate for the first replicated version is:
- one control run for each workload family with no nemesis
- one crash-restart run for each workload family
- one partition-heal run for each workload family
- one mixed failover run for the ambiguity, failover, and expiration workloads
- every faulted run lasts at least
30minutes after the first injected fault - every run archives Jepsen history, client logs, replica logs, and a cluster-timeline summary
On the current KubeVirt cluster, the practical parallel split is 3 independent AllocDB clusters,
with one Jepsen run active per cluster. The recommended execution shape is:
- use
3lanes such aslane-a,lane-b, andlane-c, each with its own VM names, helper pod, bootstrap workspace, Jepsen workspace, and artifact root - run the
4control scenarios first as a short smoke pass across those lanes - then distribute the
11faulted runs across the same3lanes, which gives one minimum of4fault-window waves instead of one serial11-run queue - reset or rebootstrap each lane between release-gate runs so one failed nemesis slice does not contaminate the next run
- use the guest-local replica log files during fault debugging to confirm which replica accepted a new role/view, which replica rejected prepare or commit traffic, and whether rejoin rewrites restarted into the expected committed prefix
Any of these outcomes blocks release:
- linearizability violation
- duplicate committed execution for one
operation_id - double allocation
- stale successful read from a non-primary or under-applied replica
- early resource reuse after expiration
- unresolved ambiguous client outcome after retry cleanup
See replication.md for the protocol draft and the replicated-simulation section above for the deterministic pre-Jepsen gate.
Research Influence
This testing strategy is informed by:
- TigerBeetle's safety and simulation emphasis
- FoundationDB's deterministic simulation approach
- Dropbox Nucleus' single-control-thread and random fault-testing work
- Jepsen's analysis of real client-visible ambiguity and fault handling