Skip to main content

Handle node downtime and recovery

Use this guide when an operator-backed environment is unhealthy and you need to decide whether to re-verify dependencies, restore the deployment baseline, or escalate to a private operator process.

Prerequisites

  • You have already classified the issue as node-operations work.
  • You know which localnet mode the environment is supposed to be using.
  • You can inspect operator, KYC, chain, and deployment-service readiness.

1. Classify the outage before changing anything

Put the failure into one of these buckets first:
Failure classTypical signal
mode mismatchchain and deployment-service URLs do not match the same localnet mode
dependency failurePostgres, Redis, PCCS, AESM, oracle, or KYC-side prerequisites are unavailable
readiness failureoperator or KYC status endpoints do not report healthy state
baseline driftthe environment has accumulated state and no longer reflects the deployment baseline

2. Re-verify the mode and ownership contracts

Before restarting services, confirm:
  • ETH_RPC_URL and DEPLOYMENT_SERVICE_URL match the same localnet mode
  • APP_CONFIG=/opt/dexlabs
  • deployment-service is still the only owner of addresses.json
If those assumptions are wrong, correct them before you continue.

3. Recheck the shared dependencies

Confirm the required dependencies are available again:
  • chain and deployment-service
  • Postgres
  • Redis
  • PCCS and AESM when enclave-capable runtime is in use
  • oracle and KYC-side upstream dependencies
If a shared dependency is unavailable, recover that dependency first instead of trying ad-hoc node restarts.

4. Recheck operator and KYC readiness

Treat the status endpoints as the first readiness contract:
SurfaceHealthy posture
operator/v2/status reports a healthy running state and the localnet readiness checks succeed
KYC/v1/status reports isHealthy=true
If those readiness signals are still failing, do not assume that request sequencing or trading flows are the root problem.

5. Restore the canonical baseline when the environment has drifted

If the environment is unhealthy because it has drifted from the known-good localnet state:
  1. restore the deployment baseline through the owning deployment-service flow
  2. keep release and registration effects outside the baseline snapshot
  3. rerun the registration and readiness checks after the baseline is clean again
Use the deploy-baseline model, not an ad-hoc checkpoint model.

6. Revalidate registration-dependent health

After baseline recovery, recheck:
  • release measurements and registration-report availability
  • registration sequencing for operator and KYC paths
  • operator and KYC readiness after registration completes
If readiness still fails after those checks, escalate with the exact failing stage rather than a generic downtime summary.

7. Decide the next route

If you need…Use this next
exact readiness, mode, and invariant lookupNode Operations Reference
SGX-specific attestation diagnosisHow to Troubleshoot SGX Attestation Issues
monitoring signals before the next restartHow to Set Up Node Monitoring and Alerting
a private runbook or incident pathSupport Channels
Last modified on April 12, 2026