Set up node monitoring and alerting
Use this guide when you already operate an operator-backed environment and need a practical monitoring baseline for health, request flow, and storage pressure.Prerequisites
- You already know the environment is on a node-operations path.
- You know which operator and KYC endpoints you are monitoring.
- You can observe the deployment-service, chain, operator, and KYC dependencies that belong to that environment.
1. Start with the readiness endpoints
Treat these as the first health signals:| Surface | Signal to watch |
|---|---|
| operator | /v2/status plus the running-state and health fields used by the localnet readiness check |
| KYC | /v1/status with isHealthy=true |
| deployment-service | health and deployment responsiveness for the owned baseline/reset path |
2. Add operator progress signals
Track the fields that show whether the operator is actually moving:| Signal | Why to watch it |
|---|---|
runningState | confirms whether the node is in a serving role |
health | distinguishes serving state from degraded or blocked startup |
lastRequest.requestIndex | shows whether requests are advancing |
lastTx | shows whether transaction progression is stalled |
raftMetrics.nodeId and raftMetrics.voterIds | help explain cluster-role and voter-set issues in status output |
3. Monitor the shared dependencies
Add explicit checks for:- chain reachability
- deployment-service reachability
- Postgres availability
- Redis availability
- PCCS and AESM when enclave-capable runtime is in use
- oracle-source availability
4. Monitor request, transaction, and system pressure
The current repo monitoring scripts indicate these practical categories:| Category | Example focus |
|---|---|
| request KPIs | busiest request paths and aggregate request volume |
| transaction logs | whether transaction handling is moving or backing up |
| system KPIs | CPU and memory pressure |
| disk and database storage | filesystem pressure and database growth |
5. Turn health signals into alert classes
Use simple alert classes:| Alert class | Trigger shape |
|---|---|
| readiness | operator or KYC readiness endpoint stays unhealthy |
| dependency | PCCS, AESM, Postgres, Redis, chain, or deployment-service becomes unavailable |
| progression stall | request index or transaction progression stops moving while the node is supposed to serve |
| storage pressure | disk or database growth threatens the environment’s stability |
6. Route follow-up work correctly
After an alert fires:- use How to Handle Node Downtime and Recovery for recovery flow
- use How to Troubleshoot SGX Attestation Issues when the underlying failure is attestation-related
- use Node Operations Reference for the fixed readiness and invariant facts behind the alert