AllocDB Operator Runbook

Scope

This runbook covers the current operator surfaces implemented by crates/allocdb-node.

It is intentionally limited to behavior that exists on the current branch:

the single-node alpha engine
the first local multi-process replicated cluster runner
the first local QEMU replicated testbed
the first Kubernetes StatefulSet packaging for the replica daemon

Current operational constraints for the local replicated runner:

the cluster runner binds one loopback control, client, and protocol address per replica
control handles replica status and stop orchestration
client serves real replicated submit, strict-primary reads, get_metrics, and tick_expirations
protocol carries replica prepare and commit traffic between peers
there is a built-in local process fault-control harness and a QEMU control-node script
there is no built-in background expiration worker, checkpoint loop, or daemon-driven failover

Current operational constraints for the single-node alpha remain:

there is no standalone general-purpose node daemon beyond the local runner
the host process must choose WAL and snapshot paths, expose the API, set up logging, and drive tick_expirations

Current operational constraints for the Kubernetes deployment remain:

the deployed pods still run the existing replica daemon and do not add autonomous elections, background expiration ticks, or checkpoint loops
the bundled client service is one bootstrap-primary endpoint pinned to replica 1; failover still requires an operator or a downstream controller to repoint traffic deliberately
the per-pod layout is regenerated from StatefulSet peer DNS on every startup so peer addresses follow Kubernetes pod IP churn without changing the replicated runtime contract

Related docs:

Local Replicated Cluster Runner

Command surface:

cargo run -p allocdb-node --bin allocdb-local-cluster -- start --workspace <path>
cargo run -p allocdb-node --bin allocdb-local-cluster -- status --workspace <path>
cargo run -p allocdb-node --bin allocdb-local-cluster -- stop --workspace <path>
cargo run -p allocdb-node --bin allocdb-local-cluster -- crash --workspace <path> --replica-id <id>
cargo run -p allocdb-node --bin allocdb-local-cluster -- restart --workspace <path> --replica-id <id>
cargo run -p allocdb-node --bin allocdb-local-cluster -- isolate --workspace <path> --replica-id <id>
cargo run -p allocdb-node --bin allocdb-local-cluster -- heal --workspace <path> --replica-id <id>

What start does:

creates or reuses one stable workspace rooted at <path>
persists one cluster-layout.txt file with the chosen local bounds, replica identities, addresses, and file paths
launches 3 external replica processes from one command surface
recovers each replica through ReplicaNode::recover(...)
bootstraps a fresh workspace into view 1 with replica 1 as primary and replicas 2 and 3 as backup

What restart preserves:

replica IDs
the durable workspace layout under <path>/replica-{1,2,3}/
the persisted engine and core bounds stored in cluster-layout.txt
the chosen loopback control, client, and protocol addresses

Current workspace layout:

<path>/cluster-layout.txt
<path>/cluster-faults.txt
<path>/cluster-timeline.log
<path>/logs/replica-{1,2,3}.log
<path>/run/replica-{1,2,3}.pid
<path>/replica-{1,2,3}/replica.metadata
<path>/replica-{1,2,3}/replica.metadata.prepare
<path>/replica-{1,2,3}/state.snapshot
<path>/replica-{1,2,3}/state.wal

What status shows:

process ID
replica state, role, and current view
local network fault state for each replica
committed and active-snapshot lsn
startup recovery kind and write-acceptance flag when the wrapped engine is active
control, client, and protocol addresses
log, pid, metadata, prepare-log, snapshot, and WAL paths
fault-state and timeline file paths

What the fault-control commands do:

crash sends an abrupt local process kill to one replica and waits for the process to exit
restart spawns one replica back through the normal ReplicaNode::recover(...) path
isolate marks one replica isolated from external client and protocol traffic while leaving control reachable
heal restores that replica's external client and protocol connectivity
every command appends one ordered event to cluster-timeline.log

Current control hooks:

status
stop

Current limits:

the runner only auto-configures normal-mode roles on a fresh workspace when recovered metadata is still in recovering; after that it preserves recovered durable metadata instead of resetting views or roles
failover and rejoin still require external orchestration; replica daemons do not run autonomous elections or background catch-up loops
there is still no background expiration or checkpoint worker inside the replica daemons

Kubernetes Replica Deployment

Command surface:

docker build -t <image> .
kubectl apply -k deploy/kubernetes
kubectl -n allocdb-system set image statefulset/allocdb allocdb=<image>
kubectl -n allocdb-system get pods -l app.kubernetes.io/name=allocdb
kubectl -n allocdb-system logs statefulset/allocdb
kubectl -n allocdb-system port-forward svc/allocdb-primary-bootstrap 18000:18000

What the container image does:

builds allocdb-local-cluster and allocdb-k8s-layout
runs one small entrypoint that derives the local replica ID from the StatefulSet ordinal
resolves all peer pod DNS names from the headless service into concrete socket addresses
rewrites <workspace>/cluster-layout.txt on every pod startup
starts the existing replica-daemon path against that generated layout

What the manifest deploys:

one namespace allocdb-system
one headless service allocdb-internal for peer DNS and not-ready pod discovery
one bootstrap client service allocdb-primary-bootstrap pinned to pod allocdb-0
one 3-replica StatefulSet with per-replica PVCs mounted at /var/lib/allocdb
one emptyDir run directory for replica pid files under /run/allocdb
startup, readiness, and liveness probes driven through allocdb-local-cluster control-status

Current durable workspace layout per pod:

/var/lib/allocdb/cluster-layout.txt
/var/lib/allocdb/cluster-faults.txt
/var/lib/allocdb/replica-{1,2,3}/replica.metadata
/var/lib/allocdb/replica-{1,2,3}/replica.metadata.prepare
/var/lib/allocdb/replica-{1,2,3}/state.snapshot
/var/lib/allocdb/replica-{1,2,3}/state.wal
/run/allocdb/replica-{1,2,3}.pid

Recommended bootstrap flow:

build and publish one image from the repo root Dockerfile
update deploy/kubernetes/statefulset.yaml or use kubectl set image so the StatefulSet points at that image
apply deploy/kubernetes
wait for all 3 replica pods to become ready
send initial client traffic through allocdb-primary-bootstrap:18000
verify local runtime health with allocdb-local-cluster control-status --addr 127.0.0.1:17000 inside one pod before admitting downstream integration traffic

Current limits:

allocdb-primary-bootstrap is intentionally only a bootstrap-primary endpoint; it does not follow failover automatically
the deployed shape preserves one PVC per pod, but rejoin and failover remain external orchestration work
the replica daemon still does not drive tick_expirations or checkpoints in the background
the first manifest does not yet include NetworkPolicies, PodDisruptionBudgets, or automated upgrade/rollback orchestration

Local QEMU Testbed

Command surface:

cargo run -p allocdb-node --bin allocdb-qemu-testbed -- prepare --workspace <path> --base-image-path <cloudimg.qcow2> --local-cluster-bin <allocdb-local-cluster>
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- start --workspace <path>
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- status --workspace <path>
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- stop --workspace <path>
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- ssh-control --workspace <path>
cargo run -p allocdb-node --bin allocdb-qemu-testbed -- control --workspace <path> -- <status|isolate|heal|crash|restart|reboot|export-replica|import-replica|collect-logs> ...
cargo run -p allocdb-node --bin allocdb-jepsen -- plan
cargo run -p allocdb-node --bin allocdb-jepsen -- analyze --history-file <history.txt>
cargo run -p allocdb-node --bin allocdb-jepsen -- capture-kubevirt-layout --workspace <path> --kubeconfig <path> --namespace <name> --ssh-private-key <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- verify-qemu-surface --workspace <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- verify-kubevirt-surface --workspace <path>
cargo run -p allocdb-node --bin allocdb-jepsen -- run-qemu --workspace <path> --run-id <run-id> --output-root <artifacts>
cargo run -p allocdb-node --bin allocdb-jepsen -- run-kubevirt --workspace <path> --run-id <run-id> --output-root <artifacts>
cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt --workspace <path> --output-root <artifacts> [--run-id <run-id>] [--follow]
cargo run -p allocdb-node --bin allocdb-jepsen -- watch-kubevirt-fleet --lane <name,workspace,output-root> [--lane <name,workspace,output-root> ...] [--follow]
cargo run -p allocdb-node --bin allocdb-jepsen -- archive-qemu --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root <artifacts>
cargo run -p allocdb-node --bin allocdb-jepsen -- archive-kubevirt --workspace <path> --run-id <run-id> --history-file <history.txt> --output-root <artifacts>

For local diagnosis, operators can shorten one faulted run with ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE=<secs>. Use that only for debugging or quick repros; full release-gate evidence still requires the documented 1800s minimum fault window.

What prepare does:

persists one qemu-testbed-layout.txt file rooted at <path>
creates one copy-on-write overlay and one generated NoCloud seed image for the control guest and each replica guest
copies one per-guest firmware-vars file from the local QEMU firmware templates
writes one shared SSH keypair used by the host and control guest for replica orchestration
embeds the existing allocdb-local-cluster binary and one static local-cluster layout into the guest seed data

Host overrides and prerequisites:

QEMU_SHARE_DIR can point at a non-default QEMU firmware directory when the host does not keep firmware templates under the standard search paths
QEMU_ACCEL can override the accelerator embedded into the rendered QEMU command; the default is hvf on macOS and kvm on Linux
prepare hard-fails unless the host has the arch-specific qemu-system-* binary, qemu-img, ssh, and ssh-keygen on PATH
prepare uses hdiutil on macOS and mkisofs or genisoimage on Linux-class hosts when it builds NoCloud seed images, and it hard-fails if the platform-appropriate ISO builder is absent
prepare hard-fails unless the configured allocdb-local-cluster binary already exists on the host
if --base-image-path does not already exist, prepare hard-fails unless curl is on PATH and the base-image download succeeds
prepare hard-fails if the QEMU firmware templates are not reachable either through the default search paths or QEMU_SHARE_DIR

What start does:

boots 3 replica guests and 1 control guest under QEMU
places replica control on one management network and replica client/protocol on one separate cluster network
forwards host SSH only to the control guest; all replica orchestration then flows through the generated control-node script

Current workspace layout:

<path>/qemu-testbed-layout.txt
<path>/qemu/images/control-overlay.qcow2
<path>/qemu/images/replica-{1,2,3}-overlay.qcow2
<path>/qemu/seed/control-seed.iso
<path>/qemu/seed/replica-{1,2,3}-seed.iso
<path>/qemu/seed/allocdb-control/{meta-data,network-config,user-data}
<path>/qemu/seed/replica-{1,2,3}/{meta-data,network-config,user-data}
<path>/qemu/firmware/{control,replica-{1,2,3}}-vars.fd
<path>/qemu/run/{control,replica-{1,2,3}}.pid
<path>/qemu/logs/{control,replica-{1,2,3}}-console.log
<path>/qemu/ssh/id_ed25519
<path>/qemu/ssh/id_ed25519.pub

Current control hooks:

status queries each replica control socket from the control guest
isolate, heal, crash, and restart SSH into the target replica guest and reuse the existing local fault-control commands there
reboot reboots one target replica guest through the control node
export-replica and import-replica stream one replica workspace through the control guest so host-side tools can stage failover and rejoin rewrites without inventing a second recovery path
collect-logs gathers replica journalctl, cluster-faults.txt, cluster-timeline.log, and control-status snapshots under the control guest
the replica daemons also write structured startup, role/view transition, prepare/commit, and fail-closed rejection logs to /var/log/allocdb/replica-{1,2,3}.log; those files now also record successful prepare quorum formation, commit-broadcast acknowledgements, accepted protocol prepare/commit traffic, and expiration batch planning/application. Inspect them when debugging one failover or rejoin issue because the current archive path does not yet bundle them

Current limits:

the first QEMU testbed still depends on one supported cloud image already being available or downloadable on the host
verify-qemu-surface proves one metrics probe on every replica, one protocol reachability probe on every replica, and one primary submit/read round trip
capture-kubevirt-layout depends on one reachable kubeconfig plus one staged SSH key from the KubeVirt bootstrap path, and verify-kubevirt-surface and run-kubevirt auto-create one temporary helper pod when needed to bridge host orchestration into the KubeVirt guest network
run-qemu now executes the documented Jepsen control and nemesis runs, but it still drives one scripted fault-window loop rather than one independent-client Jepsen cluster
run-kubevirt currently uses the same scripted gate model and the same control-node allocdb-qemu-control surface as the QEMU path; it does not yet add one separate KubeVirt packet-level nemesis layer
run-kubevirt now writes one <run-id>-status.txt file, one <run-id>-events.log file, and one allocdb-jepsen-latest-status.txt pointer file under the chosen artifact root
faulted run-qemu and run-kubevirt runs now repeat whole scenario iterations until the configured minimum fault window is satisfied, while keeping one fresh request namespace per iteration and one monotonic Jepsen history sequence across the whole run
those faulted iterations now also heal isolation, restart stopped replicas, and wait for one primary plus two backups before the next scenario slice starts
during KubeVirt debugging, the authoritative per-replica transition trace is the guest-local /var/log/allocdb/replica-{1,2,3}.log file rather than the archived journal.log
watch-kubevirt --workspace <path> --output-root <artifacts> [--run-id <run-id>] [--follow] can stay open in another terminal and show the current phase, elapsed time, recent Jepsen events, and live replica health/metrics while the run is active; --follow keeps the dashboard open after completion so operators can watch repeated runs without restarting the command
watch-kubevirt-fleet --lane <name,workspace,output-root> ... [--follow] can aggregate multiple KubeVirt Jepsen lanes into one dashboard, so operators can watch one 3-lane release-gate batch without juggling one terminal per cluster
the stable homelab KubeVirt profile now uses longhorn-strict-local-wffc together with replica-only anti-affinity, so each 3-replica lane spreads one replica per host while the control VM can colocate where capacity allows
failover and rejoin now use host-side staged workspace rewrites, but those temp workspaces are lane-scoped on the host, so parallel KubeVirt lanes do not collide on shared temp paths during fetched/pushed replica archive handling
when debugging one failing faulted scenario, rebootstrap the affected lane first and then run the short repro with ALLOCDB_JEPSEN_FAULT_WINDOW_SECS_OVERRIDE; do not treat that shortened run as release-gate evidence

Single-Node Engine

Durable Files And Startup Path

The single-node alpha persists state in:

one append-only WAL file on the live path
one snapshot file written only when the host calls checkpoint

Recommended startup path:

provide the same core and engine bounds that were used to create the durable state
call SingleNodeEngine::recover(core_config, engine_config, snapshot_path, wal_path)
inspect get_metrics or metrics(current_wall_clock_slot) before admitting traffic

Why recover is the preferred restart path:

it tolerates a missing snapshot file
it tolerates an empty or missing WAL file
it reports which startup path actually happened through recovery metrics

Use SingleNodeEngine::open(...) only for an intentional fresh start when prior durable state is being discarded on purpose.

Startup States

The engine reports startup status through metrics.recovery.

`fresh_start`

Meaning:

no snapshot was loaded
no WAL frames were replayed

Expected cases:

open(...)
recover(...) against absent or empty durable state

`wal_only`

Meaning:

no snapshot was loaded
one or more WAL frames were replayed

Expected case:

restart from a live WAL without a snapshot

`snapshot_only`

Meaning:

a snapshot file was loaded
no later WAL frames were replayed

Current detail:

an empty durable snapshot still counts as snapshot_only
in that case loaded_snapshot_lsn remains none

`snapshot_and_wal`

Meaning:

a snapshot was loaded
one or more later WAL frames were replayed

Expected case:

restart after at least one successful checkpoint and at least one later committed write

Startup Checklist

After every process start or restart, verify:

accepting_writes = true
startup_kind matches the expected durable state
loaded_snapshot_lsn, replayed_wal_frame_count, and replayed_wal_last_lsn look plausible
active_snapshot_lsn matches the last successful checkpoint, if one was expected

If accepting_writes = false, do not admit traffic. The engine is in fail-closed mode and must be recovered first.

Normal Restart And Planned Checkpoint

For a normal planned restart:

stop new ingress in the host process
drain any queued work if the host uses enqueue_* plus process_next
call checkpoint(snapshot_path) if you want a fresh restart anchor and a shorter retained WAL
verify that metrics.recovery.active_snapshot_lsn advanced
stop the process
restart with recover(...)

Current checkpoint behavior:

checkpoint requires accepting_writes = true
checkpoint rejects a non-empty submission queue
checkpoint writes the new snapshot first
checkpoint then rewrites the WAL with one-checkpoint overlap
checkpoint appends a snapshot_marker at the active snapshot anchor in the retained WAL

Treat these checkpoint failures as operationally significant:

EngineHalted
QueueNotEmpty
WalNotClean
WalStateMismatch
snapshot or WAL file errors

Current checkpoint nuance:

snapshot write failure does not halt the live engine
WAL cleanliness/state failures and WAL rewrite failures can leave accepting_writes = false, which then requires recovery before traffic resumes

Do not claim a clean checkpoint if any of those occur.

Expiration Maintenance

The current alpha does not expire reservations in the background. The host process must call tick_expirations(current_wall_clock_slot) or send the matching API request.

Current tick behavior:

drain already-queued client submissions first
scan for overdue reserved reservations
sort them deterministically
truncate to max_expirations_per_tick
append internal expire commands to the WAL
sync the WAL
apply the expire commands through the normal executor path

Operational implications:

passing a reservation deadline does not free the resource by itself
the resource becomes reusable only after the internal expire command commits and applies
delayed expiration is acceptable; premature reuse is not
sustained non-zero expiration_backlog means the host is not ticking often enough or the current max_expirations_per_tick bound is too low for current load

Watch these metrics together:

core.logical_slot_lag
core.expiration_backlog
recovery.active_snapshot_lsn

Read Behavior

Strict reads are fail-fast in the current alpha.

If the caller supplies required_lsn, the engine does not wait. It checks the local in-memory state immediately and returns:

normal data when last_applied_lsn >= required_lsn
fence_not_applied(required_lsn, last_applied_lsn) when the fence has not been met yet
engine_halted when the node has entered fail-closed mode

Reservation lookups also distinguish:

found
not_found
retired

retired means the reservation existed but has already aged out of the bounded live-history window.

Overload And Bounded Failure Modes

The operator-facing distinction to preserve is:

submission-layer rejection before commit
committed allocator outcome after durable sequencing

Submission-Layer Rejections

submit and tick_expirations can return rejected(category, code).

Current definite failures:

invalid_request
command_too_large
overloaded

Current indefinite failures:

engine_halted
storage_failure

What overloaded means:

the bounded submission queue is full
the request was not sequenced
no WAL append happened for that request
the engine may still be otherwise healthy

What engine_halted or storage_failure means:

the node hit a WAL append or sync ambiguity
the write outcome is not safe to classify as definite
the host should stop admitting traffic and recover from durable state

Committed Capacity Outcomes

A committed response can still contain a bounded rejection in result_code. Current examples:

resource_busy
resource_table_full
reservation_table_full
expiration_index_full
operation_table_full
operation_conflict

These are not transport or storage failures. They are committed allocator outcomes and should be treated as definite business results.

Metrics To Watch Under Pressure

queue_depth and queue_capacity for ingress pressure
core.operation_table_utilization_pct for retry-window pressure before operation_table_full
core.logical_slot_lag and core.expiration_backlog for expiration maintenance lag

WAL Ambiguity And Fail-Closed Behavior

If the live engine hits a WAL append or sync failure while committing either a client command or an internal expiration command, it immediately enters fail-closed mode.

Current fail-closed behavior:

accepting_writes flips to false
later writes return engine_halted
strict reads also return engine_halted
the process must rebuild from durable state before it can serve traffic again

The code logs these events at error level with these exact signatures:

halting engine on WAL error, accepting_writes set to false: operation_id={} request_slot={} applied_lsn={} phase={} error={error:?}
halting engine on internal WAL error, accepting_writes set to false: request_slot={} applied_lsn={} phase={} error={error:?}

Ambiguous Write Retry Rule

Client retries after an indefinite write outcome must:

reuse the same operation_id
reuse the same payload
happen within the configured retry window

Current recovery behavior within the dedupe window:

if the failure happened before WAL append, the retry executes once after recovery
if the failure happened after WAL append but before sync returned, recovery may restore the committed write and the retry returns the cached result

Current limitation:

once the dedupe window has expired, the same operation_id is no longer guaranteed to resolve the earlier ambiguity

Recovery And Corruption Handling

recover(...) is the only supported way to clear fail-closed mode.

What Recovery Repairs Automatically

Recovery automatically truncates only one case:

an incomplete EOF torn tail

That means:

recovery scans the WAL to the last valid frame boundary
if the stop reason is torn_tail, it truncates the file to the valid prefix
it then replays the valid frames

What Recovery Treats As Fatal

Recovery fails closed and returns an error for:

invalid runtime configuration for the persisted state
snapshot file read or decode failure
semantically invalid decoded snapshot contents
middle-of-log WAL corruption such as checksum failure before EOF
checksum-valid WAL with non-monotonic lsn
checksum-valid WAL with rewound request_slot
WAL payload decode failure during replay

Current operational rule:

do not truncate or rewrite the WAL manually for any of those fatal cases
preserve the WAL file, snapshot file, config, and logs for offline investigation

Recovery Logs

Successful recovery logs one info line:

recovery complete: loaded_snapshot=... loaded_snapshot_lsn=... replayed_wal_frame_count=... replayed_wal_last_lsn=...

Failure logs use error level and distinguish:

recovery aborted due to invalid configuration
recovery failed while loading snapshot
recovery rejected semantically invalid snapshot
recovery failed while scanning wal
recovery rejected wal payload
recovery rejected wal replay ordering

Recommended Validation And Inspection Commands

Validate the repository path used by CI:

./scripts/preflight.sh

Re-run the core recovery coverage directly:

cargo test -p allocdb-core recovery -- --nocapture

Re-run node restart and fail-closed coverage directly:

cargo test -p allocdb-node recover_restores_state_and_retry_cache
cargo test -p allocdb-node halted_engine_rejects_reads_until_recovery
cargo test -p allocdb-node expiration_tick_post_append_failure_requires_recovery_for_expired_state
cargo test -p allocdb-node recovery_metrics_report_snapshot_and_wal_replay

Run the deterministic benchmark harness:

cargo run -p allocdb-bench -- --scenario all

Inspect the current recovery and halt log signatures in source:

rg -n "recovery complete|recovery aborted due to invalid configuration|recovery failed while loading snapshot|recovery rejected semantically invalid snapshot|recovery failed while scanning wal|recovery rejected wal payload|recovery rejected wal replay ordering|halting engine on WAL error|halting engine on internal WAL error" \
  crates/allocdb-core/src/recovery.rs \
  crates/allocdb-node/src/engine.rs

Inspect durable files on disk:

ls -lh /path/to/node.wal /path/to/node.snapshot
stat /path/to/node.wal /path/to/node.snapshot

Minimum Operator Checklist

start with recover(...), not open(...), for every normal restart
do not admit traffic until accepting_writes = true
drive tick_expirations from the host process
watch queue pressure, retry-window utilization, logical slot lag, and expiration backlog
treat engine_halted and storage_failure as indefinite outcomes
recover from durable state after any WAL-path ambiguity
allow automatic EOF torn-tail truncation, but fail closed on all other corruption classes

Scope​

Local Replicated Cluster Runner​

Kubernetes Replica Deployment​

Local QEMU Testbed​

Single-Node Engine​

Durable Files And Startup Path​

Startup States​

fresh_start​

wal_only​

snapshot_only​

snapshot_and_wal​

Startup Checklist​

Normal Restart And Planned Checkpoint​

Expiration Maintenance​

Read Behavior​

Overload And Bounded Failure Modes​

Submission-Layer Rejections​

Committed Capacity Outcomes​

Metrics To Watch Under Pressure​

WAL Ambiguity And Fail-Closed Behavior​

Ambiguous Write Retry Rule​

Recovery And Corruption Handling​

What Recovery Repairs Automatically​

What Recovery Treats As Fatal​

Recovery Logs​

Recommended Validation And Inspection Commands​

Minimum Operator Checklist​

Scope

Local Replicated Cluster Runner

Kubernetes Replica Deployment

Local QEMU Testbed

Single-Node Engine

Durable Files And Startup Path

Startup States

`fresh_start`

`wal_only`

`snapshot_only`

`snapshot_and_wal`

Startup Checklist

Normal Restart And Planned Checkpoint

Expiration Maintenance

Read Behavior

Overload And Bounded Failure Modes

Submission-Layer Rejections

Committed Capacity Outcomes

Metrics To Watch Under Pressure

WAL Ambiguity And Fail-Closed Behavior

Ambiguous Write Retry Rule

Recovery And Corruption Handling

What Recovery Repairs Automatically

What Recovery Treats As Fatal

Recovery Logs

Recommended Validation And Inspection Commands

Minimum Operator Checklist