Handle failures by class
- Parse error class from response and status code.
- For
RateLimit, apply bounded retry with backoff and jitter. - For
ServiceUnavailable, retry with backoff and health checks. - For
SafetyFailure, do not retry blindly; fix payload first. - For malformed request classes, fail fast and alert.
- On WebSocket disruptions, rebuild state from REST then resubscribe.
Recovery map
| Error family | Recovery move |
|---|---|
| malformed payload or schema failure | stop, repair serialization or field shape, then rebuild |
| signer, session, or encryption failure | repair identity or cryptographic material before retry |
| nonce, timestamp, or replay-window failure | regenerate replay fields instead of resending unchanged |
| safety failure | fix product, price, amount, collateral, or strategy state first |
| rate-limit or temporary service fault | bounded retry with backoff, jitter, and correlation logging |
| WebSocket disruption | reconnect, re-bootstrap state from REST, then resubscribe |
Protocol-specific recovery decisions
| If you see… | Do this |
|---|---|
InvalidRecvWindow | rebuild the request with a valid receive window; the public upper bound is 60000 ms |
FutureTimestamp | correct client clock skew; requests beyond the server’s future cutoff are rejected rather than queued |
ExpiredTimestamp | rebuild with a fresh timestamp instead of resending stale payloads |
NotAcceptingRequests or ServiceUnavailable | retry with bounded backoff and health awareness; these can reflect leader unavailability, pending durable commit, or missing mark-price readiness |
| WebSocket disconnect or replay gap | reconnect, restore the latest state from REST, then resubscribe to realtime updates |
Minimum observability fields
- request hash or client nonce
- endpoint and transport
- error class and reason enum
- retry decision and attempt count
Escalate after recovery fails
- Recheck Error Reference to confirm the class and recovery posture.
- Recheck Rate Limits and Access Tiers if retries are colliding with quota behavior.
- Use Troubleshooting when the failure remains ambiguous after classification.
- Use Support Channels when the published public routes do not explain the remaining behavior.