Paxiom · Operations · Drawing Set

Operations Runbook

Procedures for the day things go wrong, and the days they go right.

Project No.: PXM-001-O
Issue: Operations / Rev. A
Date: 2026.04.29
Drawn By: K. Luecke

Sheet O-000 · Cover & Index

The runbook.
For the bad days,
and the good ones.

A book of procedures. Incidents above the line — what to do when something breaks. Routines below — what to do on the days nothing's broken. Written calmly so it can be followed under pressure.

Audience

The operator at 3amor whoever is conscious

Tone

Calm imperativesno jargon, no assumptions

Coverage

Phase 1 servicesexpand as the platform grows

Companion Docs

A · M · E seriescross-referenced where relevant

Severity Triage

Three levelsSEV-1, SEV-2, SEV-3

Update Cadence

After every incidenteven small ones

SEV-1 · Critical

Customer-facing service is down or producing wrong outputs. Money or identity at risk. Drop everything else. Response in minutes.

SEV-2 · Significant

Service is degraded but functional. Revenue impact is bounded. Customer experience is compromised. Response in hours.

SEV-3 · Routine

Operational issue without immediate customer impact. Investigation and fix can wait for normal working hours.

Front Matter

O-000

Cover, Index & Severity Keythis sheet

↑

O-010

Project Narrativewhat this is, how to use it

O-010

O-020

First Response Disciplinewhat to do before opening any procedure

O-020

Incident Procedures

O-101

Service Endpoint Downx402 endpoint not responding

O-101

O-102

Proof Generation Failingverification produces wrong or missing outputs

O-102

O-103

Facilitator Misbehavingx402 settlement fails or stalls

O-103

O-104

AO Process Unresponsivecompliance log not accepting writes

O-104

O-105

Hot Key Compromise SuspectedK-001 or K-002 may be exposed

O-105

O-106

Customer Dispute About a Proofcustomer claims output is wrong

O-106

O-107

Anomalous Traffic Patternunexpected volume or pattern

O-107

O-108

Production Server Inaccessiblecan't SSH, can't deploy, can't reach

O-108

Routine Procedures

O-201

Service Deploymentshipping a code change

O-201

O-202

Treasury Sweepmoving operational funds to cold storage

O-202

O-203

Daily Health Checkfive minute morning ritual

O-203

O-204

Monthly Reviewreputation, revenue, capacity

O-204

O-205

Adding a New Serviceramp from prototype to public

O-205

Verification Primitives

O-700

BLS Sync Committee Primitivecryptographic core of Service 02

O-700

Back Matter

O-900

Post-Incident Disciplinewriting the runbook entry that wasn't there

O-900

O-010 · Project Narrative

What this is.

A book to grab when something goes wrong, and a checklist to follow on the days it doesn't. Procedural, not narrative. Written so it can be followed by a tired person under pressure without requiring fresh thinking about what the right thing to do is.

The runbook is procedural infrastructure. Its value comes from existing before it's needed. A procedure written calmly during a 40-minute window on a Wednesday afternoon is much better than the same procedure invented at 3am while a customer is complaining and the platform is down. This document captures the thinking now so the operator doesn't have to do it under pressure later.

Two categories of procedure live here. Incident procedures are bordered red and address things that go wrong. They follow a consistent shape — symptoms, severity, immediate actions, diagnosis, resolution, post-incident steps. Routine procedures are bordered green and address the operations that recur. Daily health checks, deployments, treasury sweeps, monthly reviews. Both are equally important; they just get used at different moments.

How to use this document

When something goes wrong, do not start by reading the relevant procedure carefully. Start by reading O-020 First Response Discipline, which captures the actions that come before any specific procedure. Stabilize, communicate, observe — then open the relevant runbook.

For routine operations, the procedure can be followed top to bottom without preamble. The numbered steps are designed to be executable in order without requiring inference about what comes next.

What this document is not

Not a list of every possible failure. Failures that haven't happened yet won't be in here. After any incident not covered, write a new procedure based on what was actually done. The runbook grows by accretion, not by upfront completeness.

Not a substitute for understanding the systems. The procedures assume the operator knows what x402, AO, HyperBEAM, and the audit relay are. They do not re-explain architectural concepts. Reference the A-series, M-series, or E-series blueprints when context is needed.

Not a deployment automation tool. The procedures describe what the operator does; many steps invoke tools (the audit relay, deployment scripts, AO message senders) that handle the actual mechanics. The runbook coordinates the operator's actions; it does not replace the tools.

O-020 · First Response Discipline

Before the procedure, the discipline.

What every incident response begins with, regardless of which procedure applies afterward. Three minutes of discipline buys the rest of the response.

Step Zero — Breathe

Whatever just happened, taking 30 seconds before doing anything makes the next decision better. Bad responses come from panic responses. Most incidents are not as time-critical as they feel in the first minute. Even a SEV-1 generally tolerates 30 seconds of operator composure before the response begins.

Step One — Determine severity

Use the SEV-1 / SEV-2 / SEV-3 framework on the cover sheet. The severity determines the response cadence. Wrong-severity assessments are common and costly in both directions: SEV-1 treated as SEV-3 leaves customers exposed; SEV-3 treated as SEV-1 burns operator time on something that could have waited until morning.

Test for severity by asking: is a customer being harmed right now? Is money or identity at risk? Are downstream systems making decisions based on bad output from Paxiom? Yes to any → SEV-1. Service degraded but no immediate harm → SEV-2. Internal-only issue → SEV-3.

Step Two — Stabilize before diagnosing

For SEV-1 specifically: the priority is stopping the harm, not understanding the cause. If a service is producing wrong outputs, take it offline before investigating. If the settlement wallet is leaking, freeze it before forensics. If a key may be compromised, rotate before tracing. Diagnosis happens after the bleeding stops.

This feels backwards to engineers who want to understand before acting. The instinct is wrong for incident response. Customers can wait an hour for a service that's offline; they cannot recover from an hour of bad outputs that they relied on.

Step Three — Open the relevant procedure

Open the specific runbook entry. Read it through once before starting. Note any preconditions or assumptions. Then execute step by step. Don't skip steps even if they seem unnecessary. The procedure is written to work for the tired operator at 3am; what seems unnecessary at 2pm may be load-bearing in the actual incident.

Step Four — Document while you work

Keep a scratch file open during incident response. Note timestamps, observations, actions taken, and anything surprising. This file becomes the source for the post-incident write-up (O-900). Without it, the write-up reconstructs from memory and misses the details that mattered.

Format does not matter. Plain text in a terminal, notes in vim, scrawled on paper if necessary. The discipline of writing while acting matters more than the format.

CRITICAL Do not delete evidence O-020 / W.01

During incident response, do not clean up logs, restart processes, delete files, or make any change that destroys evidence of what happened. Stabilizing the service does not require destroying the evidence trail.

Take the service offline by setting it to refuse new requests. Do not rm -rf the working directory. Capture log files before any restart that would rotate them. The post-incident review depends on having the evidence; an incident that fixes itself without leaving evidence is an incident that will recur unexplained.

O-101 · Incident Procedure

Service Endpoint Down.

O-101

x402 endpoint not responding

SEV-1 if customer-facing

Symptoms

HTTP 5xx, timeouts, customer reports of failed calls

Severity

SEV-1 (in production); SEV-3 (in dev)

Response Time

Within minutes

Immediate actions

Confirm scope. Is one service down, all services down, or only some requests failing? One service: probably service-specific issue. All services: infrastructure issue. Selective: probably routing or rate-limit related.

Check status indicators. Run relay status, check .relay/queue.md, check the production server's process status if accessible.

Confirm not a network blip. Try the endpoint from a different network. If reachable from elsewhere, the issue may be local-only.

Check facilitator status. If the issue correlates with facilitator outages, jump to O-103.

Diagnosis

SSH to production server. Confirm the service process is running. ps aux | grep paxiom or equivalent.

Check service logs. Last 5 minutes of output. Look for panics, exception traces, repeated errors.

Check resource exhaustion. Memory, disk, file descriptors. df -h, free -m, ulimit -n. Resource exhaustion is the most common cause of "service stopped responding."

Check upstream dependencies. Beacon endpoint, archive node, Load Network/Arweave evidence path, AO process, RunPod GPU availability. Service can appear "down" when an upstream dependency is the actual problem.

Resolution paths

Process crashed: capture core dump if available, restart process, monitor for repeat. If restart succeeds, schedule SEV-3 follow-up to investigate cause.

Resource exhausted: free resources (kill leaked processes, expand disk, raise ulimits), restart service, schedule SEV-2 to address resource leak.

Upstream dependency down: set service to return 503 with explanatory message rather than timing out. Wait for upstream recovery. Update status page if one exists.

Cause unknown after 15 minutes: escalate by setting all customer-facing endpoints to maintenance mode. Then continue investigation without time pressure.

Post-incident

Write the incident summary per O-900. If the cause was a class of problem not covered by an existing procedure, write a new procedure. If the cause was covered but the procedure was incomplete or misleading, revise it.

O-102 · Incident Procedure

Proof Generation Failing.

O-102

Verification produces wrong or missing outputs

SEV-1 always

Symptoms

Proofs fail to verify; wrong values returned; customer dispute

Severity

SEV-1 always — wrong outputs erode the platform's reason to exist

Response Time

Within minutes

Immediate actions

Take the affected service offline immediately. Set the endpoint to return 503 with maintenance message. Wrong outputs harm customers and the platform's identity; service downtime does not.

Snapshot the current state. Capture the inputs and outputs of the most recent failing requests. Save logs from the past hour. Do not restart anything yet.

Check whether other services are affected. If the failure pattern correlates across services, infrastructure issue. If isolated, service-specific bug.

Diagnosis

Reproduce the failure. Take a failing input from the captured logs and run it through the service in isolation. Confirm the failure is reproducible before continuing.

Check upstream data integrity. If the service depends on archive-node witnesses, Load Network/Arweave evidence, or beacon chain data, verify the source data and proof packet are consistent. The service can fail because the data feeding it is wrong.

Check the verifier itself. Run the verifier with a known-good test vector that has previously verified successfully. If known-good vectors fail, the verifier is broken. If they succeed, the issue is input-specific.

Check for recent changes. Was anything deployed recently? Did a dependency update? Did upstream chain state change in a relevant way (fork, hard fork, sync committee transition)?

Resolution paths

Verifier broken: revert to last known-good version. Service stays offline until verifier is confirmed correct.

Upstream data wrong: contact upstream provider, switch to alternate source if available, surface the issue to affected customers.

Customer-specific failure: isolate the failing input pattern, fix the handling, deploy. Monitor for recurrence before bringing service back online.

Cause unknown after 30 minutes: service stays offline. This is the right answer. Wrong outputs are worse than no outputs.

Customer communication

Customers who paid for failed proofs get refunded. Customers who relied on wrong outputs get notified. Communication is direct and honest about what happened. Hiding incidents that affected customers damages trust more than the incidents themselves.

O-103 · Incident Procedure

Facilitator Misbehaving.

O-103

x402 settlement fails or stalls

SEV-1 or SEV-2 depending on impact

Symptoms

Payments not settling; verify or settle calls failing

Severity

SEV-1 if revenue blocked; SEV-2 if delays only

Response Time

15-30 minutes

Diagnosis

Check facilitator's status page. Coinbase publishes x402 facilitator status. If the issue is on their side, response is communication, not technical fix.

Test with known-good signature. Send a test x402 request with a signature that should succeed. If verify fails, confirm the issue is facilitator-side.

Check Base mainnet status. Block production delays affect settlement timing. If Base is degraded, settlements will stall regardless of facilitator behavior.

Check own settlement wallet (K-002). Has the wallet been blocked, blacklisted, or had its funds frozen? Rare but possible.

Resolution paths

Facilitator outage: notify customers, set service to deferred-settlement mode if supported, wait for recovery. Do not switch facilitators mid-outage — adds operational complexity without fixing the immediate problem.

Base mainnet delays: communicate expected settlement times. Service can continue accepting requests; settlement catches up when Base recovers.

Wallet issue (K-002 affected): escalate to E-302 compromise response. New wallet generated and brought online.

Sustained facilitator failures (> 24 hours): evaluate switching to alternate facilitator (x402-rs self-hosted, or another commercial provider). This is a planned migration, not an emergency response.

O-104 · Incident Procedure

AO Process Unresponsive.

O-104

Compliance log not accepting writes

SEV-2 typically

Symptoms

Audit log writes failing; AO messages not confirming

Severity

SEV-2 — service can continue with degraded audit; SEV-1 if extended

Response Time

Hours, not minutes

Diagnosis

Check AO network status. The wider AO network's health affects all processes. Check ao.link or community status channels.

Try messaging the process directly. Use aos CLI to send a no-op message. Confirm whether the issue is process-specific or network-wide.

Check Arweave gateway availability. AO depends on Arweave for persistence; gateway issues cascade.

Check process state. Has the process accumulated too much state? Memory exhaustion in AO processes is real and recovers only by spawning a successor.

Resolution paths

Service can continue with deferred logging. Audit records queued locally, drained to AO when network recovers. Do not stop serving customers because audit logging is delayed.

AO network outage: wait for recovery. Communicate with affected customers about audit log delay. Audit eventually catches up because the queue is preserved.

Process-specific failure: spawn successor process, migrate references, transfer ownership through cold-tier ceremony (E-301).

Sustained outage (> 6 hours): the audit-trail property is core to the platform's identity. Surface the issue to customers, indicate what records will eventually be persisted, and offer transparency about the delay.

O-105 · Incident Procedure

Hot Key Compromise Suspected.

O-105

K-001 or K-002 may be exposed

SEV-1 always

Symptoms

Unauthorized signatures; unexpected wallet activity; system intrusion evidence

Severity

SEV-1 — drop everything

Response Time

Within minutes

Immediate actions (in this order)

Stop all signing operations immediately. Set the production server to refuse new requests. Every signature produced during the response window is potential damage.

If K-002 (settlement wallet) is compromised: immediately sweep all funds to recovery destination per E-302. Time matters — attacker may be racing to drain.

If K-001 (KYA platform identity) is compromised: do not yet rotate. The cold-tier ceremony for KYA rotation is deliberate, not panic. Continue to step 4 first.

Capture forensic evidence. Production server process state, recent logs, network connection records, any evidence of how the compromise occurred. Do this before any cleanup.

Initiate cold-tier rotation ceremony per E-302. K-008 signs the K-001 succession; new K-001 brought online; old K-001 revoked through KYA registry update.

Diagnosis (after immediate actions)

Reconstruct the attack timeline. When did access occur? What was signed during the compromise window? Which signatures were legitimate and which weren't?

Identify the entry point. Compromised credential? Code execution vulnerability? Supply chain? Each implies different downstream concerns.

Estimate blast radius. What other systems share any credentials, network paths, or trust relationships with the compromised key?

Customer communication

Customers who received signatures during the compromise window need to know which signatures may be untrustworthy. The platform's audit trail on AO/Arweave shows every signature; customers can verify independently which were legitimate and which weren't.

Communication is direct, factual, and timely. Customers tolerate incidents handled transparently; they do not tolerate incidents hidden until later disclosed.

Recovery completion

Once new keys are operational, the recovery is not complete until the entry-point vulnerability is fixed. Otherwise the new keys are compromised by the same vector that compromised the old ones. Address the vulnerability before bringing the platform fully back online.

O-106 · Incident Procedure

Customer Dispute About a Proof.

O-106

Customer claims output is wrong

SEV-2 typically

Symptoms

Customer claims a paid proof is incorrect or didn't verify

Severity

SEV-2 if isolated; escalates to SEV-1 if pattern emerges

Response Time

Hours, not days

Investigation

Locate the disputed transaction. Use the customer's transaction ID or payment hash to find the corresponding audit record in the AO compliance log. Every transaction is preserved.

Reproduce the verification. Take the same inputs and run them through the verifier independently. Confirm whether the original output was correct or incorrect.

Check the customer's verification process. Often the dispute is the customer's verification failing rather than the proof being wrong. Walk through their verification steps; most disputes resolve at this step.

Resolution paths

Proof was correct, customer's verification was wrong: explain the verification process. Provide reference implementation. Refund as goodwill if appropriate; the customer's confusion is their problem but goodwill costs little.

Proof was correct, customer relied on it for the wrong purpose: clarify the proof's actual scope. Service descriptions should be clear enough to prevent this; if they weren't, this is documentation work to do.

Proof was wrong: escalate to O-102. Refund the customer. Investigate whether the failure pattern is isolated or systematic. If systematic, take service offline and investigate as SEV-1.

Reputation discipline

Disputes are public information once they enter discovery layers and reputation systems. Handle them carefully. A correctly resolved dispute (proof was right, explained patiently, customer satisfied) is reputation-positive. A poorly handled dispute (defensive, slow, hidden) is reputation-negative regardless of who was technically correct.

O-107 · Incident Procedure

Anomalous Traffic Pattern.

O-107

Unexpected volume or pattern

SEV-2 if not actively harmful

Symptoms

Traffic spike, unusual request patterns, suspicious source distribution

Severity

SEV-3 if just unusual; SEV-2 if suspected attack; SEV-1 if causing harm

Response Time

Minutes for assessment, hours for action

Assessment

Characterize the pattern. What's anomalous specifically? Volume spike? New source? Unusual request shape? Each implies different responses.

Check for legitimate explanation. New customer ramping up? Marketing-driven traffic? Adjacent ecosystem event (lookalike attack on a competitor driving traffic to alternatives)? Most anomalies are legitimate.

Check for harm. Is the platform serving the traffic successfully or is it degrading service for other customers? Spike that everyone gets served fine is good news; spike that's causing failures is the actual problem.

Response options

Legitimate spike: celebrate. Make sure infrastructure scales. Schedule capacity review.

Suspected attack, no harm yet: apply rate limiting per source. Watch for evolution. Document the pattern for future detection.

Active attack causing harm: aggressive rate limits, block suspected attacker patterns, escalate to SEV-1 if customer-facing impact continues.

Pattern unclear after 1 hour: conservative rate limits in place, continue monitoring, schedule SEV-2 follow-up for fuller analysis.

O-108 · Incident Procedure

Production Server Inaccessible.

O-108

Can't SSH, can't deploy, can't reach

SEV-1 if customer-facing affected

Symptoms

SSH timeouts, ping failures, host unreachable

Severity

SEV-1 if services affected; SEV-2 if services running but ops blocked

Response Time

15 minutes to assess

Diagnosis

Check whether services are responding to customers. Ops access blocked but customer-facing services running is SEV-2. Both blocked is SEV-1.

Check from a different network. The issue may be operator's network or VPN, not the server.

Check cloud provider status. Region outage affects many things; specific instance issue is different.

Use cloud provider's console access. Most providers offer serial console or VNC even when SSH is unreachable. Use this to inspect the running state.

Resolution paths

Network issue, services still serving: wait for network recovery. Avoid panic decisions like rebooting the instance which can make things worse.

Server unreachable, services not serving: this is SEV-1. Through cloud console, attempt to diagnose. If unreachable through console too, instance reboot via cloud provider API.

Suspect compromise (not a network issue but host compromised): escalate to O-105. Do not reboot through cloud console — it destroys forensic evidence. Capture instance snapshot for forensics, provision new instance, migrate.

O-201 · Routine Procedure

Service Deployment.

O-201

Shipping a code change

Per-change

When

Whenever a service change is ready for production

Frequency

Per change; daily during active development

Duration

15-30 minutes typical

Pre-deployment

Confirm the change passed audit relay. The relay's plan was reviewed and the revision was accepted. Tests passed. See M-series.

Check current production state. Service responding normally? No active incidents? Traffic at normal levels? Deploy into calm seas, not storms.

Identify the rollback path. What's the previous version? How long does rollback take? Confirm before deploying.

Deployment

Tag the new version. Git tag, version number, or whatever the project uses for release identification.

Push to production. Through the deployment script or process for the specific service.

Verify the service is running new code. Health check, version endpoint, log inspection — confirm the new version is actually serving requests, not just installed.

Run smoke tests. A handful of test requests against the new version. Confirm responses look correct.

Post-deployment

Watch for 15 minutes. Monitor logs, error rates, and customer-facing health. Most deployment failures show up within minutes.

If issues appear: rollback to the previous version immediately. Investigate the issue without time pressure once production is back to a known-good state.

If smooth: record the deployment in the build journal. Note version, time, what changed.

O-202 · Routine Procedure

Treasury Sweep.

O-202

Moving operational funds to cold storage

Periodic

When

Operational wallet (K-002) exceeds defined threshold

Frequency

Weekly to monthly depending on revenue volume

Duration

10-15 minutes plus cold-tier ceremony

Why this exists

The hot-tier settlement wallet (K-002) is exposed to internet attack surface and is the most likely compromise target. Keeping its balance bounded means a successful compromise loses operational balance, not platform reserves. The sweep maintains this property by moving funds to the cold-tier treasury wallet (K-006) on a defined cadence.

Procedure

Check K-002 balance. Confirm it exceeds the sweep threshold (defined in operational document, not here).

Calculate sweep amount. Total balance minus the operational reserve to keep in K-002 for ongoing settlement.

Construct the sweep transaction. Send from K-002 to K-006. Transaction prepared on networked workstation; not yet signed.

Hot-tier signs the sweep. K-002's signing happens on the production server through the standard signing mechanism. This is the simple part.

Submit, confirm. Wait for chain confirmation. Verify K-006 received the expected amount.

Record in financial log. Date, amount, source balance, destination balance, transaction hash. This trail matters for tax accounting and for the platform's own reporting.

What this procedure is not

This is not a treasury disbursement. Sweeping funds into K-006 is routine and uses the hot-tier signing of K-002. Disbursing funds out of K-006 (paying expenses, transferring to personal accounts, paying down obligations) requires cold-tier ceremony per E-301. Different procedure, different security posture.

O-203 · Routine Procedure

Daily Health Check.

O-203

Five minute morning ritual

Daily

When

Morning, before any other platform work

Frequency

Daily

Duration

5 minutes if smooth; longer if issues found

The check

Service uptime. All five services responding to health checks. If any are not, jump to O-101 immediately.

Yesterday's transaction count and revenue. Did the platform serve traffic? At what volume? Note any anomalies.

K-002 wallet balance. Within expected range? Above sweep threshold? Schedule O-202 if needed.

AO compliance log writing successfully. Recent entries present? No gaps?

Audit relay queue. Anything awaiting review? Any cycles abandoned overnight?

Customer support inbox. Any disputes or questions overnight? Any repeating themes?

Why daily

Daily cadence catches issues while they're small. A wallet balance that's off by 10% is a bookkeeping mistake; the same gap discovered after a month is a compliance problem. A service that's slow today is a routine investigation; the same slowness that's been happening for three weeks is reputation damage.

The five-minute commitment keeps the ritual sustainable. If it grows to 30 minutes regularly, something is off — either too many fires, or the check is being padded with non-essential work.

O-204 · Routine Procedure

Monthly Review.

O-204

Reputation, revenue, capacity

Monthly

When

First weekend of the month

Frequency

Monthly

Duration

2-4 hours

Revenue and economics

Per-service transaction volume and revenue. Compare to A-300 pro forma projections. Note services performing above and below projection.

Per-service costs. GPU compute, gas, Arweave, infrastructure. Confirm gross margins are healthy.

Trajectory. Is the platform tracking toward the $30k/year floor and the all-five-earning trigger? Or diverging?

Pricing review. Customer behavior since launch. Adjust prices if data justifies — see A-300 / N.01 sensitivity notes.

Reputation and ecosystem

Discovery layer presence. Are registrations current? Any new directories worth registering on?

Customer feedback themes. Recurring praise, recurring complaints, recurring confusion points.

Competitive landscape. Have existing providers shipped x402-native versions? New entrants? Pricing changes? See A-320 / R.05.

Ecosystem catalysts. Any input-trust failures, lookalike incidents, or KYA standards developments worth marketing against?

Capacity and pace

Operator hours sustained. Honest assessment of the past month's working hours. Sustainable? Compromising health, relationships, or judgment? See A-320 / R.04.

Build progress. Where are the Phase 1 services in their build sequence? Tracking against blueprint estimates?

Day-job-quit decision check. If all five services are operational, run the decision matrix in A-310 / W.01. Otherwise, confirm continuing the side-pace plan.

Output

Monthly review produces a written summary preserved in the build journal. Decisions made are noted. Adjustments to subsequent months are noted. Concerns surfaced are noted with planned response.

O-205 · Routine Procedure

Adding a New Service.

O-205

Ramp from prototype to public

Per-launch

When

Each Phase 1 service launch; each Phase 2 expansion

Frequency

5 times in Phase 1; ongoing afterward

Duration

Spans days; checkpoint procedure not single-session

Pre-launch (services already feature-complete)

Service spec finalized. Phase 1 blueprint elevation sheet matches the implementation. Discrepancies resolved.

Test vectors documented. Known-good inputs and expected outputs preserved. Used for ongoing health checks.

Pricing decided. Per A-300 schedule or revised based on competitive analysis.

AO process for the service deployed. K-005 generated per E-300. Process operational on AO.

Audit relay configured for the service. Per-service config in config/services/A-2NN.yaml.

Soft launch

Service live but not yet announced. Endpoint responding, payment flow tested, audit logging confirmed, but not yet registered in any discovery layer.

Run synthetic transactions. Operator-driven test traffic to confirm everything works under realistic conditions.

Monitor for 24-48 hours. Catch any issues that only show up under sustained operation.

Public launch

Register on discovery layers. MCP registries, x402 service indexes, agent framework plugin lists. Per A-330 schedule when that document is complete.

Update landing page. Add to the services list, update any metrics, link to documentation.

Notify outbound prospects. Specific organizations that expressed interest. Per A-340 schedule when complete.

Monitor traffic ramp. First 30 days produce most of the data. Adjust pricing, capacity, documentation based on what shows up.

O-700 · Verification Primitive

BLS Sync Committee Primitive.

The cryptographic primitive at the bottom of Service 02 — Ethereum sync committee signature verification. One C-FFI function exported from a small Rust crate, compiled to either a native cdylib or a wasm32 module for the HyperBEAM device. Verifies one thing well; the layers above it do the rest.

Artifact location

Canonical source lives in the private repository k-luecke/bls-verifier on GitHub. Sources only — the libbls_verifier.so cdylib and the bls-test integration binary are reproducible from those sources via cargo build --release and are deliberately not committed. Add target/, Cargo.lock, *.so, and *.dylib to .gitignore.

The repository is structured as a Cargo workspace with three members: bls-verifier (the cdylib primitive — this sheet's subject), bls-test (a tokio binary that fetches a current sync committee from Lodestar mainnet and verifies it end-to-end), and bls-verify-cli (a stdin-JSON wrapper for ad-hoc verification; its JSON schema is throwaway and will be replaced by the HyperBEAM device interface).

INTERIM STATE Repository not yet created (as of 2026-05-01) O-700 / N.01

Until k-luecke/bls-verifier exists, the canonical source is the tarball bls-verifier-scaffold.tar.gz produced by the 2026-05-01 scaffold session. The tarball was created in an ephemeral sandbox and must be pulled to durable storage before the sandbox is reclaimed. Pull destination: operator's laptop, then push to a freshly-created k-luecke/bls-verifier private repo. Update this sheet to remove the interim notice once that repo exists.

What the primitive does

A single C-ABI function, verify_sync_committee, takes a flat buffer of 48-byte participating BLS pubkeys, a 96-byte aggregate signature, and a 32-byte signing root. It parses each pubkey, aggregates them with subgroup checking, and verifies the signature against the aggregate using the BLS POP DST. Return codes are documented inline in bls-verifier/src/lib.rs and reproduced below as the human-readable reference.

REFERENCE verify_sync_committee — return codes O-700 / R.01

Inputs.

pubkeys_ptr / pubkeys_len — concatenated 48-byte participating pubkeys
sig_ptr — 96-byte aggregate signature
signing_root_ptr — 32-byte signing root (caller computes domain + root)

Caller is responsible for.

Filtering pubkeys by sync committee participation bits
Computing the fork domain (fork_version is fork-dependent — fetch dynamically)
Computing the signing root: sha256(parent_root || domain)

Return codes.

1 signature verified
0 signature invalid
-1 signature parse failed (not a valid 96-byte G2 point)
-2 no pubkeys provided (pubkeys_len == 0)
-3 aggregation failed (subgroup check or internal blst error)
-4 malformed pubkey chunk (any 48-byte slice that is not a valid G1 point)

Single source of truth is the doc-comment block in lib.rs. If the codes ever change, both the source and this sheet must be updated together.

What this primitive does not do

CRITICAL The primitive is not a complete sync committee verifier O-700 / W.01

verify_sync_committee verifies one thing — that an aggregated BLS signature over a given signing root is valid against an aggregated pubkey. It does not, by itself, constitute a sync committee verifier. A future engineer or future-Claude looking at lib.rs in isolation may mistake it for the whole story and wire it into production with assumptions the primitive does not satisfy. The primitive specifically does not:

Filter participating pubkeys from the 512-validator sync committee using the participation bitfield
Compute the fork domain from fork_version and genesis_validators_root
Compute the signing root from parent_root and the domain
Validate that the supplied pubkey set has length 512 (or any other count)
Track fork epoch transitions or fetch fork versions from the beacon API
Handle network I/O of any kind — it is a pure function over byte buffers

The HyperBEAM device that wraps this primitive does all of the above. Calling the primitive directly without a wrapper that supplies these preconditions will produce signatures that verify successfully against the wrong inputs — exactly the failure mode this runbook entry exists to prevent.

Failure modes

Three categories of failure surface in different ways and require different remediations. Distinguishing them at first observation saves time during incident response.

Sandbox-class failures

Network outbound to lodestar-mainnet.chainsafe.io denied at the host policy layer. Surfaces as a JSON decode error at the very first beacon API call — typically reqwest::Error { kind: Decode, "expected value", line 1 column 1 } because the response body is plain-text "Host not in allowlist" rather than JSON. Direct curl -i against the endpoint confirms the deny by returning HTTP/2 403 with header x-deny-reason: host_not_allowed.

Remediation is environmental, not code. Run the verifier in an environment with mainnet beacon access (operator laptop, RunPod, or HyperBEAM node). Not a code defect; do not file an incident.

Beacon endpoint outage

Lodestar (or whichever beacon endpoint is in use) goes down, returns 5xx, gets rate-limited, returns malformed JSON, or times out under load. Symptomatically similar to sandbox-class failures from the caller's perspective — JSON decode failures, connection resets — but the remediation is different: failover to an alternate beacon endpoint rather than relocating the runtime.

The integration test bls-test currently hardcodes a single Chainsafe endpoint, which is acceptable for a scaffold. Production HyperBEAM-device implementations must support beacon endpoint failover with at least two and ideally three independent providers. A sync committee verifier that depends on a single beacon endpoint inherits that endpoint's availability as a hard dependency.

Fork-boundary regression

At the next mainnet fork (Glamsterdam), the current_version value returned by the beacon API /eth/v1/beacon/states/{slot}/fork endpoint will change from the present 0x06000000 (Fulu, active since 2025-12-03) to the next allocated value. As long as the verifier fetches fork version dynamically — which the post-fix bls-test does and which the production HyperBEAM device must — this is not a regression. The signing root will recompute correctly with the new domain and verification will continue to succeed.

A future operator observing the fork version value change in production logs may mistake this for a bug. It is not. The primitive itself is fork-version-agnostic; the wrapper supplies the value. Verify that the wrapper is fetching the value dynamically (rather than hardcoding) before treating any fork-boundary value change as an incident.

Next layer up

The HyperBEAM device that wraps this primitive is the production verifier and is the proper subject of Service 02 in the A-series blueprint. The device handles the participation-bit filtering, domain computation, signing-root computation, fork-version fetching, beacon endpoint failover, AO compliance hook, and x402 facilitator integration. It is the layer that becomes a public service; the primitive in k-luecke/bls-verifier is one component of it.

When the HyperBEAM device exists, it should have its own runbook entry in this series — likely O-701 — documenting the device-level operations (build, deploy, monitor) and pointing back to this sheet for the primitive contract. Subsequent verification primitives (audit relay signatures, identity signing keys) extend the same series: O-720, O-730, and so on.

Status update (2026-05-02). O-701 exists as a sketch and a scaffolded implementation. The harness lives in k-luecke/bls-verifier's bls-device/ crate; the production runbook is O-702. Operator commands for the local bring-up of the device, plus the matching HTTP front-end, are documented in paxiom/docs/hyperbeam-bringup.md and paxiom/docs/sync-committee-service.md. The Phase 0 substrate gate audit trail — PXM-G-100 — names what counts as "closed" for each gate.

O-900 · Post-Incident Discipline

Writing the runbook entry that wasn't there.

The runbook grows by use. Every incident not covered should add a procedure. Every procedure that worked badly should be revised. The work after the incident is as important as the work during it.

Within 24 hours

While the incident is fresh, capture what happened. Format:

Summary. One paragraph: what broke, what the impact was, how it was resolved.
Timeline. Timestamped sequence of events. When detected, when escalated, what was tried, what worked.
Root cause. The underlying reason, not just the proximate symptom. Five whys if necessary.
What worked. Procedures that helped, instincts that were right, tools that performed well.
What didn't. Confusions, dead ends, missing procedures, tools that failed.
Customer impact. Who was affected, how, what communication occurred.
Action items. Specific changes to procedures, code, monitoring, or operations to prevent recurrence.

Within a week

Update the runbook. New procedures for incident classes that weren't covered. Revisions to procedures that worked badly. Add to the risk register if the incident exposed a previously unidentified failure mode.

Build journal entry that links to the post-incident summary. The journal's narrative gives context; the summary provides detail.

Within a month

Close the action items. If the incident exposed a class of problem that needs systemic response (architectural change, monitoring upgrade, key rotation, dependency replacement), schedule the work and execute it. Open action items beyond 30 days are a sign that the lesson hasn't been absorbed.

The discipline this requires

Post-incident work is unglamorous and easily skipped. The fire is out; moving on feels good. But incidents that aren't documented recur unexplained, and procedures that aren't updated stay subtly broken.

The platform's compounding advantage in operational reliability comes from this discipline specifically. Every incident handled well is a reduction in the probability of similar incidents in the future. Every procedure written calmly is a 3am decision that doesn't have to be made under pressure. The runbook itself is infrastructure that earns its keep across the operational lifetime of the platform.

Revision log

Rev. A — 2026.04.29. Initial issue. Eight incident procedures (O-101 through O-108). Five routine procedures (O-201 through O-205). First response discipline (O-020) and post-incident discipline (O-900). Severity framework (SEV-1, SEV-2, SEV-3) established.

Subsequent revisions will add procedures as incidents reveal gaps. The runbook is expected to grow by 1-3 procedures per quarter during active operation. Rev. B expected after first SEV-1 incident.

One last thing

This document exists to be opened in moments when calm thinking is hard. Every word in it was written assuming the reader is tired, stressed, and possibly facing something unfamiliar. The procedures don't make the operator brilliant; they let an ordinarily competent operator respond well even when conditions are bad.

Use it. Update it. Keep it close.

Build the engine first. The moats come later. The runbook is what keeps both running.

The runbook.For the bad days,and the good ones.

What this is.

How to use this document

What this document is not

Before the procedure, the discipline.

Step Zero — Breathe

Step One — Determine severity

Step Two — Stabilize before diagnosing

Step Three — Open the relevant procedure

Step Four — Document while you work

Service Endpoint Down.

Immediate actions

Diagnosis

Resolution paths

Post-incident

Proof Generation Failing.

Immediate actions

Diagnosis

Resolution paths

Customer communication

Facilitator Misbehaving.

Diagnosis

Resolution paths

AO Process Unresponsive.

Diagnosis

Resolution paths

Hot Key Compromise Suspected.

Immediate actions (in this order)

Diagnosis (after immediate actions)

Customer communication

Recovery completion

Customer Dispute About a Proof.

Investigation

Resolution paths

Reputation discipline

Anomalous Traffic Pattern.

Assessment

Response options

Production Server Inaccessible.

Diagnosis

Resolution paths

Service Deployment.

Pre-deployment

Deployment

Post-deployment

Treasury Sweep.

Why this exists

Procedure

What this procedure is not

Daily Health Check.

The check

Why daily

Monthly Review.

Revenue and economics

Reputation and ecosystem

Capacity and pace

Output

Adding a New Service.

Pre-launch (services already feature-complete)

Soft launch

Public launch

BLS Sync Committee Primitive.

Artifact location

What the primitive does

What this primitive does not do

Failure modes

Sandbox-class failures

Beacon endpoint outage

Fork-boundary regression

Next layer up

Writing the runbook entry that wasn't there.

Within 24 hours

Within a week

Within a month

The discipline this requires

Related documents

Revision log

One last thing

The runbook.
For the bad days,
and the good ones.