Runbook: SaaS Disconnect
Purpose / When to use
Use this runbook when edge reports SaaS disconnected (fairvisor_saas_reachable == 0) and config updates are no longer flowing.
Blast radius & risk level
- Risk level: medium
- Primary risk: stale policy state and delayed event delivery while traffic still passes through enforcement
Signals / symptoms
fairvisor_saas_reachable == 0fairvisor statusshows disconnected state- config pull and heartbeat errors increase
Detection queries
fairvisor_saas_reachable
sum by (operation, status) (rate(fairvisor_saas_calls_total[5m]))
rate(fairvisor_events_sent_total{status="error"}[5m])
CLI checks:
fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz
curl -sS http://localhost:8080/metrics | rg 'fairvisor_saas_reachable|fairvisor_saas_calls_total|fairvisor_events_sent_total'
Triage checklist
- Verify env values:
FAIRVISOR_SAAS_URL,FAIRVISOR_EDGE_ID,FAIRVISOR_EDGE_TOKEN. - Validate outbound DNS/TLS/connectivity to SaaS endpoint.
- Check token validity and recent rotation events.
- Confirm edge still has active bundle loaded.
- Assess event backlog risk window.
Mitigation playbook
Safe-first path:
- Keep edge serving with last known good bundle.
- Fix network/auth root cause.
- Confirm heartbeat and config pull resume.
If urgent policy change is needed during outage:
- Switch to standalone known-good local bundle process.
- Apply explicit rollback-safe file workflow.
- Resume SaaS mode only after connectivity stabilizes.
Verification checklist
fairvisor_saas_reachablestable at1.- SaaS call errors return to baseline.
- Expected bundle version/hash observed.
- Event send success recovers.
Exit criteria
- Continuous connectivity for at least one polling/heartbeat cycle set
- No unresolved auth/network errors
- Pending config drift reconciled
Rollback / recovery path
- If reconnect fails repeatedly, maintain standalone control with explicit change freeze.
- Keep incident bridge open until SaaS path stable.
- Reconcile any temporary local changes back into control plane source of truth.
Post-incident notes
Record:
- outage duration
- root cause class (network/auth/control-plane)
- config drift count while disconnected
- event backlog loss or recovery details
Do not
- Do not redeploy frequent policy changes blindly while disconnected.
- Do not assume disconnected means enforcement is off.
- Do not rotate tokens without coordinated validation.