Runbook: Bad Bundle Rollback

Purpose / When to use

Use this runbook when a recent policy bundle introduces severe regressions: widespread rejects, incorrect matching, or unstable behavior.

Blast radius & risk level

  • Risk level: high
  • Primary risk: prolonged customer impact if rollback is delayed or non-monotonic

Signals / symptoms

  • Reject spike immediately after bundle deployment
  • Unexpected reject reasons or RateLimit* behavior on unaffected routes
  • Operator reports mismatched expected vs active policy version

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="no_bundle_loaded"}[5m])

Operational checks:

fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz

Triage checklist

  1. Confirm incident correlates with latest bundle version.
  2. Identify last known-good version/hash.
  3. Freeze further policy changes until rollback completes.
  4. Assign rollback owner and verifier.

Mitigation playbook

SaaS mode rollback:

  1. Revert control plane bundle to last known-good content.
  2. Increment version as required by monotonic flow.
  3. Confirm all edges pull and apply rollback bundle.

Standalone rollback:

cp /etc/fairvisor/policy.backup.json /etc/fairvisor/policy.json
kill -HUP $(pidof nginx)

Verification checklist

  1. Active bundle version/hash match expected rollback target.
  2. Reject distribution returns near baseline.
  3. Critical endpoints validate with synthetic checks.
  4. No no_bundle_loaded rejects during recovery.

Exit criteria

  • Traffic stabilized
  • rollback bundle confirmed across fleet
  • incident bridge agrees policy state is safe

Rollback / recovery path

  1. Keep rollout freeze until root cause is documented.
  2. Add regression test reproducing failure pattern.
  3. Reattempt rollout only with shadow-first path.

Post-incident notes

Record:

  • faulty bundle diff summary
  • exact rollback version/hash
  • validation evidence and timestamps
  • test gap that allowed regression

Do not

  • Do not deploy a “quick fix” bundle before restoring known-good state.
  • Do not skip monotonic version discipline.
  • Do not close incident without regression test addition.