Skip to main content

Set up node monitoring and alerting

Use this guide when you already operate an operator-backed environment and need a practical monitoring baseline for health, request flow, and storage pressure.

Prerequisites

  • You already know the environment is on a node-operations path.
  • You know which operator and KYC endpoints you are monitoring.
  • You can observe the deployment-service, chain, operator, and KYC dependencies that belong to that environment.

1. Start with the readiness endpoints

Treat these as the first health signals:
SurfaceSignal to watch
operator/v2/status plus the running-state and health fields used by the localnet readiness check
KYC/v1/status with isHealthy=true
deployment-servicehealth and deployment responsiveness for the owned baseline/reset path
If readiness is red, alert on that first before digging into secondary performance signals.

2. Add operator progress signals

Track the fields that show whether the operator is actually moving:
SignalWhy to watch it
runningStateconfirms whether the node is in a serving role
healthdistinguishes serving state from degraded or blocked startup
lastRequest.requestIndexshows whether requests are advancing
lastTxshows whether transaction progression is stalled
raftMetrics.nodeId and raftMetrics.voterIdshelp explain cluster-role and voter-set issues in status output

3. Monitor the shared dependencies

Add explicit checks for:
  • chain reachability
  • deployment-service reachability
  • Postgres availability
  • Redis availability
  • PCCS and AESM when enclave-capable runtime is in use
  • oracle-source availability
If those dependencies fail, alert there instead of only alerting on downstream operator symptoms.

4. Monitor request, transaction, and system pressure

The current repo monitoring scripts indicate these practical categories:
CategoryExample focus
request KPIsbusiest request paths and aggregate request volume
transaction logswhether transaction handling is moving or backing up
system KPIsCPU and memory pressure
disk and database storagefilesystem pressure and database growth
These signals help distinguish “node is down” from “node is running but falling behind.”

5. Turn health signals into alert classes

Use simple alert classes:
Alert classTrigger shape
readinessoperator or KYC readiness endpoint stays unhealthy
dependencyPCCS, AESM, Postgres, Redis, chain, or deployment-service becomes unavailable
progression stallrequest index or transaction progression stops moving while the node is supposed to serve
storage pressuredisk or database growth threatens the environment’s stability

6. Route follow-up work correctly

After an alert fires:
Last modified on April 12, 2026