A book of procedures. Incidents above the line — what to do when something breaks. Routines below — what to do on the days nothing's broken. Written calmly so it can be followed under pressure.
A book to grab when something goes wrong, and a checklist to follow on the days it doesn't. Procedural, not narrative. Written so it can be followed by a tired person under pressure without requiring fresh thinking about what the right thing to do is.
The runbook is procedural infrastructure. Its value comes from existing before it's needed. A procedure written calmly during a 40-minute window on a Wednesday afternoon is much better than the same procedure invented at 3am while a customer is complaining and the platform is down. This document captures the thinking now so the operator doesn't have to do it under pressure later.
Two categories of procedure live here. Incident procedures are bordered red and address things that go wrong. They follow a consistent shape — symptoms, severity, immediate actions, diagnosis, resolution, post-incident steps. Routine procedures are bordered green and address the operations that recur. Daily health checks, deployments, treasury sweeps, monthly reviews. Both are equally important; they just get used at different moments.
When something goes wrong, do not start by reading the relevant procedure carefully. Start by reading O-020 First Response Discipline, which captures the actions that come before any specific procedure. Stabilize, communicate, observe — then open the relevant runbook.
For routine operations, the procedure can be followed top to bottom without preamble. The numbered steps are designed to be executable in order without requiring inference about what comes next.
Not a list of every possible failure. Failures that haven't happened yet won't be in here. After any incident not covered, write a new procedure based on what was actually done. The runbook grows by accretion, not by upfront completeness.
Not a substitute for understanding the systems. The procedures assume the operator knows what x402, AO, HyperBEAM, and the audit relay are. They do not re-explain architectural concepts. Reference the A-series, M-series, or E-series blueprints when context is needed.
Not a deployment automation tool. The procedures describe what the operator does; many steps invoke tools (the audit relay, deployment scripts, AO message senders) that handle the actual mechanics. The runbook coordinates the operator's actions; it does not replace the tools.
What every incident response begins with, regardless of which procedure applies afterward. Three minutes of discipline buys the rest of the response.
Whatever just happened, taking 30 seconds before doing anything makes the next decision better. Bad responses come from panic responses. Most incidents are not as time-critical as they feel in the first minute. Even a SEV-1 generally tolerates 30 seconds of operator composure before the response begins.
Use the SEV-1 / SEV-2 / SEV-3 framework on the cover sheet. The severity determines the response cadence. Wrong-severity assessments are common and costly in both directions: SEV-1 treated as SEV-3 leaves customers exposed; SEV-3 treated as SEV-1 burns operator time on something that could have waited until morning.
Test for severity by asking: is a customer being harmed right now? Is money or identity at risk? Are downstream systems making decisions based on bad output from Paxiom? Yes to any → SEV-1. Service degraded but no immediate harm → SEV-2. Internal-only issue → SEV-3.
For SEV-1 specifically: the priority is stopping the harm, not understanding the cause. If a service is producing wrong outputs, take it offline before investigating. If the settlement wallet is leaking, freeze it before forensics. If a key may be compromised, rotate before tracing. Diagnosis happens after the bleeding stops.
This feels backwards to engineers who want to understand before acting. The instinct is wrong for incident response. Customers can wait an hour for a service that's offline; they cannot recover from an hour of bad outputs that they relied on.
Open the specific runbook entry. Read it through once before starting. Note any preconditions or assumptions. Then execute step by step. Don't skip steps even if they seem unnecessary. The procedure is written to work for the tired operator at 3am; what seems unnecessary at 2pm may be load-bearing in the actual incident.
Keep a scratch file open during incident response. Note timestamps, observations, actions taken, and anything surprising. This file becomes the source for the post-incident write-up (O-900). Without it, the write-up reconstructs from memory and misses the details that mattered.
Format does not matter. Plain text in a terminal, notes in vim, scrawled on paper if necessary. The discipline of writing while acting matters more than the format.
During incident response, do not clean up logs, restart processes, delete files, or make any change that destroys evidence of what happened. Stabilizing the service does not require destroying the evidence trail.
Take the service offline by setting it to refuse new requests. Do
not rm -rf the working directory. Capture log files
before any restart that would rotate them. The post-incident review
depends on having the evidence; an incident that fixes itself
without leaving evidence is an incident that will recur unexplained.
relay status,
check .relay/queue.md, check the production server's
process status if accessible.
ps aux | grep paxiom or equivalent.
df -h, free -m,
ulimit -n. Resource exhaustion is the most common
cause of "service stopped responding."
Write the incident summary per O-900. If the cause was a class of problem not covered by an existing procedure, write a new procedure. If the cause was covered but the procedure was incomplete or misleading, revise it.
Customers who paid for failed proofs get refunded. Customers who relied on wrong outputs get notified. Communication is direct and honest about what happened. Hiding incidents that affected customers damages trust more than the incidents themselves.
aos CLI to send a no-op message. Confirm whether
the issue is process-specific or network-wide.
Customers who received signatures during the compromise window need to know which signatures may be untrustworthy. The platform's audit trail on AO/Arweave shows every signature; customers can verify independently which were legitimate and which weren't.
Communication is direct, factual, and timely. Customers tolerate incidents handled transparently; they do not tolerate incidents hidden until later disclosed.
Once new keys are operational, the recovery is not complete until the entry-point vulnerability is fixed. Otherwise the new keys are compromised by the same vector that compromised the old ones. Address the vulnerability before bringing the platform fully back online.
Disputes are public information once they enter discovery layers and reputation systems. Handle them carefully. A correctly resolved dispute (proof was right, explained patiently, customer satisfied) is reputation-positive. A poorly handled dispute (defensive, slow, hidden) is reputation-negative regardless of who was technically correct.
The hot-tier settlement wallet (K-002) is exposed to internet attack surface and is the most likely compromise target. Keeping its balance bounded means a successful compromise loses operational balance, not platform reserves. The sweep maintains this property by moving funds to the cold-tier treasury wallet (K-006) on a defined cadence.
This is not a treasury disbursement. Sweeping funds into K-006 is routine and uses the hot-tier signing of K-002. Disbursing funds out of K-006 (paying expenses, transferring to personal accounts, paying down obligations) requires cold-tier ceremony per E-301. Different procedure, different security posture.
Daily cadence catches issues while they're small. A wallet balance that's off by 10% is a bookkeeping mistake; the same gap discovered after a month is a compliance problem. A service that's slow today is a routine investigation; the same slowness that's been happening for three weeks is reputation damage.
The five-minute commitment keeps the ritual sustainable. If it grows to 30 minutes regularly, something is off — either too many fires, or the check is being padded with non-essential work.
Monthly review produces a written summary preserved in the build journal. Decisions made are noted. Adjustments to subsequent months are noted. Concerns surfaced are noted with planned response.
config/services/A-2NN.yaml.
The cryptographic primitive at the bottom of Service 02 — Ethereum sync committee signature verification. One C-FFI function exported from a small Rust crate, compiled to either a native cdylib or a wasm32 module for the HyperBEAM device. Verifies one thing well; the layers above it do the rest.
Canonical source lives in the private repository
k-luecke/bls-verifier on GitHub. Sources only — the
libbls_verifier.so cdylib and the bls-test
integration binary are reproducible from those sources via
cargo build --release and are deliberately not committed.
Add target/, Cargo.lock, *.so,
and *.dylib to .gitignore.
The repository is structured as a Cargo workspace with three members:
bls-verifier (the cdylib primitive — this sheet's subject),
bls-test (a tokio binary that fetches a current sync
committee from Lodestar mainnet and verifies it end-to-end), and
bls-verify-cli (a stdin-JSON wrapper for ad-hoc verification;
its JSON schema is throwaway and will be replaced by the HyperBEAM
device interface).
Until k-luecke/bls-verifier exists, the canonical
source is the tarball bls-verifier-scaffold.tar.gz
produced by the 2026-05-01 scaffold session. The tarball was
created in an ephemeral sandbox and must be pulled to durable
storage before the sandbox is reclaimed. Pull destination:
operator's laptop, then push to a freshly-created
k-luecke/bls-verifier private repo. Update this sheet
to remove the interim notice once that repo exists.
A single C-ABI function, verify_sync_committee, takes a flat
buffer of 48-byte participating BLS pubkeys, a 96-byte aggregate
signature, and a 32-byte signing root. It parses each pubkey, aggregates
them with subgroup checking, and verifies the signature against the
aggregate using the BLS POP DST. Return codes are documented inline in
bls-verifier/src/lib.rs and reproduced below as the
human-readable reference.
Inputs.
pubkeys_ptr / pubkeys_len — concatenated 48-byte participating pubkeyssig_ptr — 96-byte aggregate signaturesigning_root_ptr — 32-byte signing root (caller computes domain + root)Caller is responsible for.
fork_version is fork-dependent — fetch dynamically)sha256(parent_root || domain)Return codes.
1 signature verified 0 signature invalid-1 signature parse failed (not a valid 96-byte G2 point)-2 no pubkeys provided (pubkeys_len == 0)-3 aggregation failed (subgroup check or internal blst error)-4 malformed pubkey chunk (any 48-byte slice that is not a valid G1 point)
Single source of truth is the doc-comment block in
lib.rs. If the codes ever change, both the source
and this sheet must be updated together.
verify_sync_committee verifies one thing — that an
aggregated BLS signature over a given signing root is valid against
an aggregated pubkey. It does not, by itself, constitute a sync
committee verifier. A future engineer or future-Claude looking at
lib.rs in isolation may mistake it for the whole story
and wire it into production with assumptions the primitive does
not satisfy. The primitive specifically does not:
fork_version and genesis_validators_rootparent_root and the domainThe HyperBEAM device that wraps this primitive does all of the above. Calling the primitive directly without a wrapper that supplies these preconditions will produce signatures that verify successfully against the wrong inputs — exactly the failure mode this runbook entry exists to prevent.
Three categories of failure surface in different ways and require different remediations. Distinguishing them at first observation saves time during incident response.
Network outbound to lodestar-mainnet.chainsafe.io denied at
the host policy layer. Surfaces as a JSON decode error at the very first
beacon API call — typically
reqwest::Error { kind: Decode, "expected value", line 1 column 1 }
because the response body is plain-text "Host not in allowlist" rather
than JSON. Direct curl -i against the endpoint confirms
the deny by returning HTTP/2 403 with header
x-deny-reason: host_not_allowed.
Remediation is environmental, not code. Run the verifier in an environment with mainnet beacon access (operator laptop, RunPod, or HyperBEAM node). Not a code defect; do not file an incident.
Lodestar (or whichever beacon endpoint is in use) goes down, returns 5xx, gets rate-limited, returns malformed JSON, or times out under load. Symptomatically similar to sandbox-class failures from the caller's perspective — JSON decode failures, connection resets — but the remediation is different: failover to an alternate beacon endpoint rather than relocating the runtime.
The integration test bls-test currently hardcodes a single
Chainsafe endpoint, which is acceptable for a scaffold. Production
HyperBEAM-device implementations must support beacon
endpoint failover with at least two and ideally three independent
providers. A sync committee verifier that depends on a single beacon
endpoint inherits that endpoint's availability as a hard dependency.
At the next mainnet fork (Glamsterdam),
the current_version value returned by the beacon API
/eth/v1/beacon/states/{slot}/fork endpoint will change from
the present 0x06000000 (Fulu, active since 2025-12-03) to
the next allocated value. As long as the verifier fetches fork version
dynamically — which the post-fix bls-test does and which
the production HyperBEAM device must — this is not a
regression. The signing root will recompute correctly with the new
domain and verification will continue to succeed.
A future operator observing the fork version value change in production logs may mistake this for a bug. It is not. The primitive itself is fork-version-agnostic; the wrapper supplies the value. Verify that the wrapper is fetching the value dynamically (rather than hardcoding) before treating any fork-boundary value change as an incident.
The HyperBEAM device that wraps this primitive is the production
verifier and is the proper subject of Service 02 in the A-series
blueprint. The device handles the participation-bit filtering, domain
computation, signing-root computation, fork-version fetching, beacon
endpoint failover, AO compliance hook, and x402 facilitator integration.
It is the layer that becomes a public service; the primitive in
k-luecke/bls-verifier is one component of it.
When the HyperBEAM device exists, it should have its own runbook entry
in this series — likely O-701 — documenting the
device-level operations (build, deploy, monitor) and pointing back to
this sheet for the primitive contract. Subsequent verification
primitives (audit relay signatures, identity signing keys) extend the
same series: O-720, O-730, and so on.
Status update (2026-05-02). O-701 exists as
a sketch and a scaffolded implementation. The harness lives in
k-luecke/bls-verifier's bls-device/ crate;
the production runbook is
O-702.
Operator commands for the local bring-up of the device, plus the matching
HTTP front-end, are documented in
paxiom/docs/hyperbeam-bringup.md
and
paxiom/docs/sync-committee-service.md.
The Phase 0 substrate gate audit trail —
PXM-G-100 — names what counts
as "closed" for each gate.
The runbook grows by use. Every incident not covered should add a procedure. Every procedure that worked badly should be revised. The work after the incident is as important as the work during it.
While the incident is fresh, capture what happened. Format:
Update the runbook. New procedures for incident classes that weren't covered. Revisions to procedures that worked badly. Add to the risk register if the incident exposed a previously unidentified failure mode.
Build journal entry that links to the post-incident summary. The journal's narrative gives context; the summary provides detail.
Close the action items. If the incident exposed a class of problem that needs systemic response (architectural change, monitoring upgrade, key rotation, dependency replacement), schedule the work and execute it. Open action items beyond 30 days are a sign that the lesson hasn't been absorbed.
Post-incident work is unglamorous and easily skipped. The fire is out; moving on feels good. But incidents that aren't documented recur unexplained, and procedures that aren't updated stay subtly broken.
The platform's compounding advantage in operational reliability comes from this discipline specifically. Every incident handled well is a reduction in the probability of similar incidents in the future. Every procedure written calmly is a 3am decision that doesn't have to be made under pressure. The runbook itself is infrastructure that earns its keep across the operational lifetime of the platform.
A-series Phase 1 Blueprint — service specifications and build context that incident procedures reference.
M-series Audit Relay Blueprint — tooling that several incident procedures invoke.
E-series Key Custody & Identity — referenced by O-105 (key compromise) and O-202 (treasury sweep).
Build Journal — narrative record of platform history. Post-incident summaries link from journal entries.
Rev. A — 2026.04.29. Initial issue. Eight incident procedures (O-101 through O-108). Five routine procedures (O-201 through O-205). First response discipline (O-020) and post-incident discipline (O-900). Severity framework (SEV-1, SEV-2, SEV-3) established.
Subsequent revisions will add procedures as incidents reveal gaps. The runbook is expected to grow by 1-3 procedures per quarter during active operation. Rev. B expected after first SEV-1 incident.
This document exists to be opened in moments when calm thinking is hard. Every word in it was written assuming the reader is tired, stressed, and possibly facing something unfamiliar. The procedures don't make the operator brilliant; they let an ordinarily competent operator respond well even when conditions are bad.
Use it. Update it. Keep it close.
Build the engine first. The moats come later. The runbook is what keeps both running.