Runbook: SaaS Disconnect

Purpose / When to use

Use this runbook when edge reports SaaS disconnected (fairvisor_saas_reachable == 0) and config updates are no longer flowing.

Blast radius & risk level

  • Risk level: medium
  • Primary risk: stale policy state and delayed event delivery while traffic still passes through enforcement

Signals / symptoms

  • fairvisor_saas_reachable == 0
  • fairvisor status shows disconnected state
  • config pull and heartbeat errors increase

Detection queries

fairvisor_saas_reachable
sum by (operation, status) (rate(fairvisor_saas_calls_total[5m]))
rate(fairvisor_events_sent_total{status="error"}[5m])

CLI checks:

fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz
curl -sS http://localhost:8080/metrics | rg 'fairvisor_saas_reachable|fairvisor_saas_calls_total|fairvisor_events_sent_total'

Triage checklist

  1. Verify env values: FAIRVISOR_SAAS_URL, FAIRVISOR_EDGE_ID, FAIRVISOR_EDGE_TOKEN.
  2. Validate outbound DNS/TLS/connectivity to SaaS endpoint.
  3. Check token validity and recent rotation events.
  4. Confirm edge still has active bundle loaded.
  5. Assess event backlog risk window.

Mitigation playbook

Safe-first path:

  1. Keep edge serving with last known good bundle.
  2. Fix network/auth root cause.
  3. Confirm heartbeat and config pull resume.

If urgent policy change is needed during outage:

  1. Switch to standalone known-good local bundle process.
  2. Apply explicit rollback-safe file workflow.
  3. Resume SaaS mode only after connectivity stabilizes.

Verification checklist

  1. fairvisor_saas_reachable stable at 1.
  2. SaaS call errors return to baseline.
  3. Expected bundle version/hash observed.
  4. Event send success recovers.

Exit criteria

  • Continuous connectivity for at least one polling/heartbeat cycle set
  • No unresolved auth/network errors
  • Pending config drift reconciled

Rollback / recovery path

  1. If reconnect fails repeatedly, maintain standalone control with explicit change freeze.
  2. Keep incident bridge open until SaaS path stable.
  3. Reconcile any temporary local changes back into control plane source of truth.

Post-incident notes

Record:

  • outage duration
  • root cause class (network/auth/control-plane)
  • config drift count while disconnected
  • event backlog loss or recovery details

Do not

  • Do not redeploy frequent policy changes blindly while disconnected.
  • Do not assume disconnected means enforcement is off.
  • Do not rotate tokens without coordinated validation.