AllocDB Testing Strategy

Scope

This document defines the v1 testing model and the additional testing gate required before any replicated release.

Principle

Unit tests are necessary but not sufficient.

AllocDB should follow the TigerBeetle, FoundationDB, and Dropbox line of thinking:

deterministic execution is a design property
simulation should run the real code
failures should be injected systematically, not only reproduced after the fact

v1 Testing Layers

State-Machine Tests

Required coverage:

every legal transition
every illegal transition
resource and reservation state agreement
terminal-state behavior

Replay and Recovery Tests

Required coverage:

WAL replay equivalence
crash during WAL append
torn WAL tails
snapshot plus WAL recovery
corruption detection and fail-closed behavior

Idempotency and Submission Tests

Required coverage:

duplicate operation_id
operation_id reuse with different payload
indefinite outcomes resolved by retry with the same operation_id
behavior at dedupe-window expiry

Capacity Tests

Required coverage:

full submission queue
full reservation table
full expiration bucket
maximum TTL and retention settings

Contention Tests

Required coverage:

many contenders for one resource
simultaneous expirations and confirms
retry storms with reused operation_id

Deterministic Simulation

v1 should add a deterministic simulator around the trusted core as early as practical.

The simulator should control:

slot advancement
ingress scheduling order
WAL write and fsync outcomes
crash points
restart timing

Properties:

seeded and reproducible
runs real state-machine and recovery code
supports shrinking failures to minimal cases when possible

M4-S01 Harness Direction

The M4-S01 spike selected one external scripted driver as the starting point for M4-T01 through M4-T04.

The selected shape is:

wrap the real allocdb_node::SingleNodeEngine instead of adding simulator-only execution paths
keep one explicit simulated current slot in the harness and pass it into real engine calls
model ingress, tick, checkpoint, crash, restart, and injected persistence faults as explicit driver events
use a seed only to choose ordering among ready actions at the same logical slot

The promoted M4-T01 harness now lives in crates/allocdb-node/src/simulation.rs, with regression coverage in crates/allocdb-node/src/simulation_tests.rs. The current evidence shows:

the same seed reproduces the same same-slot action order and LSN transcript
the same seed plus enabled crash-point set reproduces the same one-shot crash selection, independent of slice order
advancing the simulated slot without ticking produces deterministic expiration backlog
a checkpoint plus WAL-backed restart path still works when an expiration commit halts the live engine after append and before sync
seeded crash plans now interrupt the real engine at client submit/apply, checkpoint, and recovery boundaries, with restart tests covering post-sync submit replay, snapshot-written before WAL rewrite, and replay-interrupted recovery
seeded schedule actions can now resolve one labeled ingress or tick action into one candidate slot window, replay the same resolved schedule from seed, and record one transcript that captures both chosen slots and outcomes
seeded schedule exploration now covers ingress contention order, same-deadline expiration selection while preserving earliest-deadline priority under bounded tick throughput, and retry timing across the dedupe window with replay from the same seed
one-shot storage-fault helpers now cover append-failure halts, sync-failure ambiguity, checksum-mismatch fail-closed recovery, and torn-tail truncation against the real WAL and restart path

What to reuse in follow-up tasks:

the external-driver architecture
explicit slot advancement under test control
seeded scheduling for same-slot ready work
labeled schedule actions with candidate slot windows and replayable transcripts
seeded due-expiration selection over the real internal-expire path while preserving earliest-deadline priority
seeded one-shot crash plans over real engine and recovery boundaries
one-shot storage-fault helpers over live WAL writes and post-crash WAL mutation
restart helpers that reopen from snapshot plus WAL on disk

What not to promote directly:

the original spike's ad hoc helper surface
any scheduler choice that is not covered by deterministic transcript tests
opaque randomized loops that do not record the resolved schedule and seed
crash toggles that are not selected from a seed and named boundary set
one-off layouts that hide the reusable harness from follow-on simulation tasks

This direction keeps trusted-core churn low because the real engine already exposes the slot, checkpoint, recovery, and failure-injection seams the simulator needs. It also avoids trait-heavy virtual clock or fake storage abstractions inside the core before the project has proven they are necessary.

Replicated Deterministic Simulation

M6-T02 extends the single-node simulation approach to replicated execution without introducing a mock semantics layer.

The rule stays the same:

run the real allocator and recovery code
keep time, message delivery, crash, and restart under explicit test control
make every schedule decision reproducible from seed plus transcript

Design Goal

The replicated simulator should answer one question before Jepsen exists:

can the chosen replication protocol preserve the single-node invariants under deterministic fault schedules?

That means the simulator is not a toy cluster model. It is a deterministic cluster driver around real replica state, real durable state transitions, and real retry semantics.

Cluster Harness Shape

The first replicated harness should model one fixed-membership shard with 3 real replicas.

The harness should:

wrap one real replicated node per replica, each with its own WAL and snapshot workspace
keep one explicit simulated slot counter shared by the cluster driver
treat protocol messages as explicit driver-visible events
treat timeout firing, view change, crash, restart, and rejoin as explicit driver-visible events
use a seed only to choose among already-ready actions at the same simulated slot
record one transcript containing seed, chosen actions, delivered messages, dropped messages, and resulting view and lsn observations

What must not happen:

no simulator-only apply path
no fake replica state that bypasses the real durable log and recovery path
no hidden random network loop that cannot be replayed from transcript

Driver Actions

The replicated harness should choose among explicit labeled actions such as:

client_submit
deliver_protocol_message
drop_protocol_message
advance_slot
fire_timeout
crash_replica
restart_replica
allow_rejoin
complete_snapshot_transfer

As with the current single-node schedule exploration, the harness should resolve one ready action into one recorded transcript step. The recorded step should include enough metadata to replay the same cluster schedule exactly.

The initial M7-T02 implementation keeps that surface intentionally narrow around the current codebase:

queue_protocol_message
deliver_protocol_message
drop_protocol_message
set_connectivity
crash_replica
restart_replica
explore_schedule

The harness already hosts three real ReplicaNodes with independent durable workspaces and one shared seeded slot driver. M7-T03 upgrades that queue from opaque labels to real prepare/prepare_ack/commit payloads while keeping the same deterministic delivery, partition, crash, and replay surface. M7-T05 extends that same harness with checkpoint_replica and rejoin_replica helpers so deterministic tests can force one retained WAL floor on the primary, then prove suffix-only catch-up, snapshot transfer, and faulted-rejoin rejection through the real on-disk restart path.

Network And Failure Model In Simulation

The deterministic cluster driver should model:

connectivity as an explicit replica-to-replica and client-to-replica delivery matrix
partitions as rule changes in that matrix, not as implicit timing guesses
process crash as loss of volatile state with durable WAL and snapshot files left on disk
restart as reopening the replica from its own durable state plus replicated catch-up
rejoin as restoring connectivity and allowing suffix catch-up or snapshot-plus-suffix catch-up

The simulator does not need arbitrary packet corruption in the first replicated design pass. Message loss, partition, delay by non-delivery, crash, restart, and rejoin are the required first faults.

Required Scenario Families

The first replicated simulation plan must cover these families explicitly.

Partition Scenarios

isolate the primary from one backup but keep quorum, so writes still commit and the minority replica later catches up
isolate the primary from the majority, so the old primary stops serving and a new primary must win the higher view
split the cluster into non-quorum minorities, so no side commits and reads fail closed
heal the partition and verify that all healthy replicas converge on one committed prefix

Key checks:

no split-brain commit
no read served from a view-uncertain or quorum-lost replica
retries with the same operation_id resolve ambiguity without duplicate execution
a healed but stale replica catches up only after the committed prefix already accepted by the healthy quorum is restored locally

Primary Crash Scenarios

crash the primary before quorum append, so the write remains uncommitted and clients see only an indefinite outcome
crash the primary after quorum append, so retry must recover the committed result even if failover interrupts reply delivery or commit propagation
crash the primary after reply and force later reads and retries through the new primary
crash during expiration leadership, so overdue work may be delayed but never applied early

Key checks:

committed entries survive failover unchanged
uncommitted suffix does not become visible as committed history
expiration remains log-driven and may be late but not early

Rejoin And Recovery Scenarios

restart a stale backup and catch up by replicated suffix only
restart a replica whose local state requires snapshot-plus-suffix transfer
rejoin a replica that holds an uncommitted divergent suffix and verify that the suffix is discarded safely
restart a replica with invalid local durable state and verify that it stays faulted instead of voting or serving

Key checks:

rejoined replicas recover the committed prefix already accepted by the healthy quorum
committed history is never rewritten during catch-up
rejoin never regresses one replica's durable view knowledge behind the current primary
corrupted replicas fail closed until repaired

Required Invariants In Simulation

Every promoted replicated simulation test should check some subset of these invariants:

same seed plus same starting durable state yields the same transcript
no two different payloads commit at the same lsn
replicas that apply the same committed prefix reach the same allocator state and outputs
a client-visible success is published only for a quorum-committed entry
retry with the same operation_id never creates a second successful execution
reads succeed only on the current primary after the requested required_lsn is locally applied
resource reuse after expiration is never earlier than the single-node rules allow
rejoin never lets a stale or corrupted replica serve, vote, or lead before validation completes

Promotion Path

The recommended implementation sequence is:

add a deterministic cluster driver that can host 3 real replicas, one message queue, and one explicit connectivity map
add seeded message-delivery and timeout scheduling with replayable transcripts
add partition scenarios that prove fail-closed reads and no split-brain commit
add primary-crash scenarios that exercise retry semantics around quorum commit boundaries
add rejoin scenarios for suffix catch-up, snapshot transfer, and divergent-suffix truncation
add replicated storage-fault combinations only after the basic cluster schedule is already replayable

This keeps the first replicated simulator narrow enough to validate protocol behavior before expanding into broader randomized search.

Current executable replicated coverage already proves:

a quorum-lost primary fails closed for new writes and strict reads even when it still has local committed state
a higher-view takeover can reconstruct the latest committed prefix on a new primary before that replica returns to normal mode
stale or quorum-lost primaries reject reads after failover instead of serving stale success
the primary can keep serving through one isolated-backup partition, then heal and catch that stale backup up to the committed prefix without duplicate execution
a full split into non-quorum minorities fails closed until one majority reforms, after which the new primary can accept writes and rejoin the stale replica back onto the committed prefix
a primary crash before quorum append preserves indefinite ambiguity until failover, after which a retry with the same operation_id commits exactly once on the new primary
a primary crash after majority append lets the next primary reconstruct the committed prefix and resolve the retry from cache instead of executing it again
a primary crash after reply preserves strict-primary reads and cached retry results on the next primary
a stale replica can rejoin by replicated suffix only when it still holds one recent enough committed durable prefix
a primary checkpoint can force snapshot transfer for older replicas whose local durable state falls behind the retained WAL floor
rejoin now fails closed if the target has already observed a higher durable view than the current primary
rejoin discards one divergent uncommitted suffix and rejects one replica forced into faulted state by corrupted durable metadata

Local Multi-Process Runner Smoke Test

M8-T01 adds the first external-process smoke gate on top of the replicated node library.

The required focused validation command is:

cargo test -p allocdb-node local_cluster -- --nocapture

What this smoke test proves today:

one operator command surface can start 3 external replica processes
the chosen workspace layout, loopback addresses, and per-replica bounds persist across restart
each replica answers one live control status request with PID, role, view, recovery, address, and path details
one operator stop command shuts the cluster down cleanly and removes the live pid files

What it does not claim yet:

it does not prove external failover, rejoin, or stale-primary rejection yet
it does not prove background expiration or checkpoint workers inside the replica daemons
Jepsen and QEMU-backed validation remain follow-on gates after the local process surface is in place

Local Fault-Control Harness

M8-T02 adds the first reusable disruption surface on top of the local multi-process runner.

The required focused validation command remains:

cargo test -p allocdb-node local_cluster -- --nocapture

What this harness proves today:

one operator command surface can crash and restart one replica process without changing its configured identity, addresses, or durable workspace
one operator command surface can isolate one replica from external client and protocol traffic while preserving control reachability for later debug and recovery
the reserved client and protocol listeners now fail with an explicit isolation error when the fault harness marks that replica isolated
one persisted cluster-timeline.log records cluster start/stop plus replica crash, restart, isolate, and heal events in replayable order

What it still does not claim:

the external process boundary still does not carry the real replicated client or protocol traffic yet
the harness does not yet orchestrate VM-level network partitions or reboots
Jepsen and QEMU-backed validation remain follow-on gates after this local disruption surface

Local QEMU Testbed

M8-T03 adds the first repeatable VM-backed cluster surface on top of the local process and fault-control tools.

The focused validation commands are:

cargo test -p allocdb-node qemu_testbed -- --nocapture
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- prepare --workspace <path> --base-image-path <cloudimg.qcow2> --local-cluster-bin <allocdb-local-cluster>

The host-side prerequisites for that command are:

QEMU_SHARE_DIR can point at a non-default QEMU firmware directory when the host does not keep firmware templates under the standard search paths
QEMU_ACCEL can override the accelerator embedded into the rendered QEMU command; the default is hvf on macOS and kvm on Linux
prepare hard-fails unless the host has the arch-specific qemu-system-* binary, qemu-img, ssh, and ssh-keygen on PATH
prepare uses hdiutil on macOS and mkisofs or genisoimage on Linux-class hosts when it builds NoCloud seed images, and it hard-fails if the platform-appropriate ISO builder is absent
prepare hard-fails unless the configured allocdb-local-cluster binary already exists on the host
if --base-image-path does not already exist, prepare hard-fails unless curl is on PATH and the base-image download succeeds
prepare hard-fails if the QEMU firmware templates are not reachable either through the default search paths or QEMU_SHARE_DIR

What this testbed proves today:

one host-side command surface can generate repeatable QEMU assets for 3 replica guests plus 1 control guest
each replica guest uses one copy-on-write overlay, one generated NoCloud seed image, and one static local-cluster layout that runs the existing allocdb-local-cluster replica-daemon
replica control moves onto one management network while replica client and protocol listeners move onto one separate cluster network
one generated control-node script can drive status, isolate, heal, crash, restart, reboot, export-replica, import-replica, and collect-logs operations against the replica guests
the generated workspace keeps overlay images, firmware vars, guest seeds, console logs, and SSH keys in stable paths suitable for scripted follow-on runs

What it still does not claim:

the QEMU testbed still relies on the host-side Jepsen runner for failover/rejoin orchestration; the guest runtime itself does not contain a standalone distributed control plane

Jepsen Harness Slice

M8-T04 now adds the first host-side Jepsen harness tooling around that external replicated surface:

cargo test -p allocdb-node jepsen -- --nocapture
cargo test -p allocdb-node --bin allocdb-jepsen -- --nocapture
cargo run -p allocdb-node --bin allocdb-jepsen -- plan
cargo run -p allocdb-node --bin allocdb-jepsen -- analyze --history-file <history.txt>
cargo run -p allocdb-node --bin allocdb-jepsen -- capture-kubevirt-layout --workspace <path> --kubeconfig <path> --namespace <name> --ssh-private-key <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- verify-qemu-surface --workspace <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- verify-kubevirt-surface --workspace <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- run-qemu --workspace <path> --run-id <run-id> --output-root .artifacts/
cargo run -p allocdb-node --bin allocdb-jepsen -- run-kubevirt --workspace <path> --run-id <run-id> --output-root .artifacts/
cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt --workspace <path> --output-root .artifacts/ [--run-id <run-id>] [--follow]
cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt-fleet --lane <name,workspace,output-root> [--lane <name,workspace,output-root> ...] [--refresh-millis <ms>] [--follow]
cargo run -p allocdb-node --bin allocdb-jepsen -- archive-qemu --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root .artifacts/
cargo run -p allocdb-node --bin allocdb-jepsen -- archive-kubevirt --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root .artifacts/

Use the repo-local .artifacts/ directory as the default Jepsen output root so release-gate evidence stays inside the workspace tree without polluting git history.

For local debugging only, faulted run-qemu and run-kubevirt runs also honor ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE=<secs>. That override shortens the live fault window for fast repros, but runs executed with it do not count toward the documented release gate.

For KubeVirt fault debugging, inspect the guest-local /var/log/allocdb/replica-{1,2,3}.log files as well as the archived journal.log. The current collect-logs bundle captures the systemd journal and control snapshots, but the newest replica role/view transition logs land in the daemon log file today, together with successful quorum, commit-broadcast, protocol-accept, and expiration-batch traces.

What this harness slice proves today:

one command can materialize the exact 15 documented first-release gate runs as a stable matrix
one retry-aware history analyzer folds ambiguous attempts and later retries by stable operation_id
the analyzer automatically blocks on the current release-blocking outcomes the repo already documents: duplicate committed execution, double allocation, stale successful reads, early expiration release, unresolved ambiguity, and writes that are missing operation_id
one archive command can bundle the analyzed history with one fetched external-cluster log archive and one manifest rooted on the host
one surface-probe command can issue one real get_metrics request to every replica, then drive one real create_resource submit plus one fenced get_resource read through the configured primary against either the QEMU or KubeVirt cluster
one capture-kubevirt-layout command can persist the live control/replica VM IPs, helper-pod settings, SSH key path, and per-replica network addresses needed to drive the Jepsen matrix against a running KubeVirt deployment
one per-run request namespace now keeps client IDs and request slots distinct across separate allocdb-jepsen invocations, so operators can execute multiple Jepsen runs against one persistent external cluster without tripping monotonic-slot invariants from earlier runs
one run-qemu and one run-kubevirt command can execute the full documented release-gate matrix against the live external cluster and persist one analyzed history plus one artifact bundle for each run
one watch-kubevirt terminal command can follow the currently active or named KubeVirt run by reading one per-run status file, one per-run event log, and live replica control/metrics state while the run is in progress; --follow keeps the dashboard open across terminal run states so operators can leave one watcher running between repeated Jepsen invocations
one watch-kubevirt-fleet terminal command can follow multiple KubeVirt Jepsen lanes at once by reading each lane's allocdb-jepsen-latest-status.txt, then combining lane summaries, replica health, and recent events into one dashboard suitable for a parallel 3-lane release-gate run
the faulted Jepsen families now loop whole scenario iterations until the documented minimum fault window is actually satisfied, with one fresh request namespace per iteration and one monotonic history sequence across the full long-running run
those faulted iterations now also repair the external lane back to one primary plus two backups before the next scenario slice starts, so repeated crash/partition loops do not inherit a stale partially recovered runtime from the previous iteration
staged replica export/import now uses lane-scoped unique temporary workspaces on the host, so concurrent multi-lane failover and rejoin orchestration cannot collide on shared temp paths
those runs now exercise one real hot-resource contention sequence, one real stable operation_id retry-cache replay, one real primary-versus-backup read-role check, one real replicated tick_expirations path through the external runtime, and one host-side crash/partition/mixed-failover cutover path built from replica workspace export/import plus the existing ReplicaNode::recover(...) logic on staged copies
the KubeVirt path now has live proof for one captured layout, one successful verify-kubevirt-surface run, and one successful reservation_contention-control run-kubevirt history plus artifact bundle against the real allocdb-{control,replica-*} VMs
the failover_read_fences-partition-heal scenario now retries its minority-partition ambiguous write after failover on the new primary, and a fresh short KubeVirt rerun with the debug fault window override passed with release_gate=passed and blockers=0
the expiration-and-recovery scenarios now tolerate bounded expiration backlog on long-lived lanes: after one committed expiration tick, the harness will issue follow-up ticks until the specific reserved resource becomes Available, instead of assuming the first committed tick always drained the target reservation
the full documented 15-run release-gate matrix has now completed on the rebuilt 3-lane KubeVirt profile that uses longhorn-strict-local-wffc plus replica-only anti-affinity, and every run finished with release_gate_passed=true and blockers=0

What it still does not claim:

partition control still uses the existing whole-replica client/protocol isolation surface, not arbitrary packet loss or per-link delay injection
the release gate still depends on operators running the full external-cluster matrix end to end; the unit and integration suite only proves the harness and orchestration code paths

Jepsen Validation Gate

M6-T03 defines the external validation required before any replicated release.

Design Goal

Jepsen should answer a different question than deterministic simulation:

does the deployed replicated system preserve the client contract under real network faults and client-visible ambiguity?

Deterministic simulation proves protocol behavior against explicit schedules. Jepsen validates the same contract through the real client surface, real routing mistakes, real timeout behavior, and real failover and recovery timing.

Minimum Testbed

The first Jepsen gate should target the same narrow release shape as the replication draft:

one shard
3 replicas
fixed membership
the real external API planned for the replicated release
clients that preserve operation_id across retries
a pre-created bounded resource set large enough to force contention and expiration
logical-slot advancement and tick_expirations driven through the public or operator-facing surface, not simulator internals

5-replica clusters, online reconfiguration, and multi-shard behavior are follow-on work, not part of the first Jepsen gate.

Required Workload Families

Every release candidate should run these workload families.

Reservation Contention

many clients reserve, confirm, and release from a hot resource set
duplicate retries and conflicting operation_id reuse are injected deliberately
reads sample resource and reservation state during load

Checks:

successful operations remain linearizable
no resource is committed to two holders at once
one operation_id never produces two successful executions

Ambiguous Write Retry

crash or isolate the primary around quorum-commit boundaries for reserve, confirm, and release
let clients observe timeouts or indefinite write outcomes, then force retries with the same operation_id

Checks:

an ambiguous write resolves to at most one committed result
retry returns the original committed result after failover when the first attempt already committed
unresolved ambiguity after cleanup fails the gate

Failover And Read Fences

mix writes with required_lsn reads while primaries fail, elections run, and some clients route to stale nodes
keep both read-only traffic and read-after-write traffic active during failover

Checks:

successful reads come only from the current primary after the requested required_lsn is locally applied
stale or quorum-lost replicas fail closed instead of serving stale success
committed state remains linearizable across primary change

Expiration And Recovery

create expiring reservations, advance logical time, call tick_expirations, and combine that load with crash, partition, restart, and rejoin
force rejoin after stale state, suffix catch-up, and snapshot-plus-suffix recovery paths

Checks:

expiration may be delayed by failover but never frees a resource early
restarted and rejoined replicas converge on committed history before serving or voting
recovery preserves the same client-visible result for retried operations

Supplementary Lease Coverage

The release-blocking matrix stays on the four workload families above. Lease-kernel follow-on work uses one separate non-blocking workload family so bundle, fencing, and revoke behavior can be exercised on the real Jepsen surface without silently changing the documented gate size.

Lease Safety

reserve one bundle atomically, confirm it, revoke it, and reclaim it later
issue one holder-authorized command with a stale lease_epoch after revoke
try one pre-reclaim bundle reuse and one post-reclaim bundle reuse
run the workload in at least one control lane and one faulted lane, currently lease_safety-control and lease_safety-crash-restart

Checks:

bundle ownership remains all-or-nothing across retries and failover
revoked ownership is not reusable before explicit reclaim
stale holders fail with stale_epoch rather than succeeding or drifting into a weaker outcome

Nemesis Families

The first Jepsen gate should explicitly cover:

primary crash and restart
majority-loss and minority-loss partitions plus heal
stale-primary isolation with client misrouting to the old primary
backup crash during catch-up and later rejoin
mixed crash-plus-partition schedules around ambiguous writes

Clock skew, disk corruption, membership change, and multi-shard faults are not part of the first Jepsen gate. Those are either already covered by deterministic simulation and local durability testing or deferred until later replicated milestones.

History Interpretation

Jepsen must interpret AllocDB histories using the product's retry contract, not a generic "exactly once" assumption.

Rules:

every mutating client operation carries a stable operation_id
definite successes and definite failures enter the history directly
timeouts, transport loss, and other indefinite outcomes are recorded as ambiguous client events, not as silent success or failure
the checker folds an ambiguous event and all later retries with the same operation_id into one logical command
cleanup retries every ambiguous command within the dedupe window; any ambiguity that remains unresolved after cleanup fails the gate
fail-closed read rejection from stale or quorum-lost replicas is an allowed outcome; stale successful read is not

Required Checkers

The Jepsen analysis should include at least:

a linearizability checker over successful writes and successful reads captured from the QEMU-backed runtime histories
an operation_id uniqueness checker that rejects duplicate committed execution
a resource-safety checker that rejects double allocation
a strict-read fence checker for successful required_lsn reads
an expiration-safety checker that rejects early reuse

Release Gate

Jepsen is not a substitute for simulation. The deterministic replicated-simulation gate must pass before Jepsen begins.

The minimum release gate for the first replicated version is:

one control run for each workload family with no nemesis
one crash-restart run for each workload family
one partition-heal run for each workload family
one mixed failover run for the ambiguity, failover, and expiration workloads
every faulted run lasts at least 30 minutes after the first injected fault
every run archives Jepsen history, client logs, replica logs, and a cluster-timeline summary

On the current KubeVirt cluster, the practical parallel split is 3 independent AllocDB clusters, with one Jepsen run active per cluster. The recommended execution shape is:

use 3 lanes such as lane-a, lane-b, and lane-c, each with its own VM names, helper pod, bootstrap workspace, Jepsen workspace, and artifact root
run the 4 control scenarios first as a short smoke pass across those lanes
then distribute the 11 faulted runs across the same 3 lanes, which gives one minimum of 4 fault-window waves instead of one serial 11-run queue
reset or rebootstrap each lane between release-gate runs so one failed nemesis slice does not contaminate the next run
use the guest-local replica log files during fault debugging to confirm which replica accepted a new role/view, which replica rejected prepare or commit traffic, and whether rejoin rewrites restarted into the expected committed prefix

Any of these outcomes blocks release:

linearizability violation
duplicate committed execution for one operation_id
double allocation
stale successful read from a non-primary or under-applied replica
early resource reuse after expiration
unresolved ambiguous client outcome after retry cleanup

See replication.md for the protocol draft and the replicated-simulation section above for the deterministic pre-Jepsen gate.

Research Influence

This testing strategy is informed by:

TigerBeetle's safety and simulation emphasis
FoundationDB's deterministic simulation approach
Dropbox Nucleus' single-control-thread and random fault-testing work
Jepsen's analysis of real client-visible ambiguity and fault handling

Scope​

Principle​

v1 Testing Layers​

State-Machine Tests​

Replay and Recovery Tests​

Idempotency and Submission Tests​

Capacity Tests​

Contention Tests​

Deterministic Simulation​

M4-S01 Harness Direction​

Replicated Deterministic Simulation​

Design Goal​

Cluster Harness Shape​

Driver Actions​

Network And Failure Model In Simulation​

Required Scenario Families​

Partition Scenarios​

Primary Crash Scenarios​

Rejoin And Recovery Scenarios​

Required Invariants In Simulation​

Promotion Path​

Local Multi-Process Runner Smoke Test​

Local Fault-Control Harness​

Local QEMU Testbed​

Jepsen Harness Slice​

Jepsen Validation Gate​

Design Goal​

Minimum Testbed​

Required Workload Families​

Reservation Contention​

Ambiguous Write Retry​

Failover And Read Fences​

Expiration And Recovery​

Supplementary Lease Coverage​

Lease Safety​

Nemesis Families​

History Interpretation​

Required Checkers​

Release Gate​

Research Influence​

Scope

Principle

v1 Testing Layers

State-Machine Tests

Replay and Recovery Tests

Idempotency and Submission Tests

Capacity Tests

Contention Tests

Deterministic Simulation

M4-S01 Harness Direction

Replicated Deterministic Simulation

Design Goal

Cluster Harness Shape

Driver Actions

Network And Failure Model In Simulation

Required Scenario Families

Partition Scenarios

Primary Crash Scenarios

Rejoin And Recovery Scenarios

Required Invariants In Simulation

Promotion Path

Local Multi-Process Runner Smoke Test

Local Fault-Control Harness

Local QEMU Testbed

Jepsen Harness Slice

Jepsen Validation Gate

Design Goal

Minimum Testbed

Required Workload Families

Reservation Contention

Ambiguous Write Retry

Failover And Read Fences

Expiration And Recovery

Supplementary Lease Coverage

Lease Safety

Nemesis Families

History Interpretation

Required Checkers

Release Gate

Research Influence