Runbook: SaaS Disconnect

Purpose / When to use

Use this runbook when edge reports SaaS disconnected (fairvisor_saas_reachable == 0) and config updates are no longer flowing.

Blast radius & risk level

Risk level: medium
Primary risk: stale policy state and delayed event delivery while traffic still passes through enforcement

Signals / symptoms

fairvisor_saas_reachable == 0
fairvisor status shows disconnected state
config pull and heartbeat errors increase

Detection queries

fairvisor_saas_reachable
sum by (operation, status) (rate(fairvisor_saas_calls_total[5m]))
rate(fairvisor_events_sent_total{status="error"}[5m])

CLI checks:

fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz
curl -sS http://localhost:8080/metrics | rg 'fairvisor_saas_reachable|fairvisor_saas_calls_total|fairvisor_events_sent_total'

Triage checklist

Verify env values: FAIRVISOR_SAAS_URL, FAIRVISOR_EDGE_ID, FAIRVISOR_EDGE_TOKEN.
Validate outbound DNS/TLS/connectivity to SaaS endpoint.
Check token validity and recent rotation events.
Confirm edge still has active bundle loaded.
Assess event backlog risk window.

Mitigation playbook

Safe-first path:

Keep edge serving with last known good bundle.
Fix network/auth root cause.
Confirm heartbeat and config pull resume.

If urgent policy change is needed during outage:

Switch to standalone known-good local bundle process.
Apply explicit rollback-safe file workflow.
Resume SaaS mode only after connectivity stabilizes.

Verification checklist

fairvisor_saas_reachable stable at 1.
SaaS call errors return to baseline.
Expected bundle version/hash observed.
Event send success recovers.

Exit criteria

Continuous connectivity for at least one polling/heartbeat cycle set
No unresolved auth/network errors
Pending config drift reconciled

Rollback / recovery path

If reconnect fails repeatedly, maintain standalone control with explicit change freeze.
Keep incident bridge open until SaaS path stable.
Reconcile any temporary local changes back into control plane source of truth.

Post-incident notes

Record:

outage duration
root cause class (network/auth/control-plane)
config drift count while disconnected
event backlog loss or recovery details

Do not

Do not redeploy frequent policy changes blindly while disconnected.
Do not assume disconnected means enforcement is off.
Do not rotate tokens without coordinated validation.