Runbook: Bad Bundle Rollback
Purpose / When to use
Use this runbook when a recent policy bundle introduces severe regressions: widespread rejects, incorrect matching, or unstable behavior.
Blast radius & risk level
- Risk level: high
- Primary risk: prolonged customer impact if rollback is delayed or non-monotonic
Signals / symptoms
- Reject spike immediately after bundle deployment
- Unexpected reject reasons or
RateLimit*behavior on unaffected routes - Operator reports mismatched expected vs active policy version
Detection queries
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="no_bundle_loaded"}[5m])
Operational checks:
fairvisor status --edge-url=http://localhost:8080
curl -sS http://localhost:8080/readyz
Triage checklist
- Confirm incident correlates with latest bundle version.
- Identify last known-good version/hash.
- Freeze further policy changes until rollback completes.
- Assign rollback owner and verifier.
Mitigation playbook
SaaS mode rollback:
- Revert control plane bundle to last known-good content.
- Increment version as required by monotonic flow.
- Confirm all edges pull and apply rollback bundle.
Standalone rollback:
cp /etc/fairvisor/policy.backup.json /etc/fairvisor/policy.json
kill -HUP $(pidof nginx)
Verification checklist
- Active bundle version/hash match expected rollback target.
- Reject distribution returns near baseline.
- Critical endpoints validate with synthetic checks.
- No
no_bundle_loadedrejects during recovery.
Exit criteria
- Traffic stabilized
- rollback bundle confirmed across fleet
- incident bridge agrees policy state is safe
Rollback / recovery path
- Keep rollout freeze until root cause is documented.
- Add regression test reproducing failure pattern.
- Reattempt rollout only with shadow-first path.
Post-incident notes
Record:
- faulty bundle diff summary
- exact rollback version/hash
- validation evidence and timestamps
- test gap that allowed regression
Do not
- Do not deploy a “quick fix” bundle before restoring known-good state.
- Do not skip monotonic version discipline.
- Do not close incident without regression test addition.