Runbook: Global Shadow + Kill-Switch Override
Purpose / When to use
Use this runbook only during major incident windows when normal mitigation cannot restore service quickly enough.
Blast radius & risk level
- Risk level: critical
- Primary risk: reducing active enforcement guardrails while incident mode is enabled
Signals / symptoms
- Sustained customer-impacting rejects across broad traffic
- rollback path unavailable or too slow for immediate recovery
- collateral from current controls exceeds acceptable impact
Detection queries
rate(fairvisor_decisions_total{action="reject"}[5m])
max_over_time(fairvisor_global_shadow_active[1m])
max_over_time(fairvisor_kill_switch_override_active[1m])
Triage checklist
- Confirm incident commander approval for break-glass mode.
- Set short TTL (15-30 minutes) and named owner.
- Confirm rollback bundle is prepared before activation.
- Decide scenario:
- Scenario A: global shadow only
- Scenario B: global shadow + kill-switch override
Mitigation playbook
Scenario A (preferred first):
{
"bundle_version": 1101,
"global_shadow": {
"enabled": true,
"reason": "inc-2026-02-20-reject-spike",
"expires_at": "2026-02-20T19:00:00Z"
}
}
Expected effect:
- client traffic opens (
allowpath) - rule evaluation continues in shadow semantics
- kill-switch remains active
Scenario B (true break-glass):
{
"bundle_version": 1102,
"global_shadow": {
"enabled": true,
"reason": "inc-2026-02-20-global-open",
"expires_at": "2026-02-20T19:15:00Z"
},
"kill_switch_override": {
"enabled": true,
"reason": "inc-2026-02-20-global-open",
"expires_at": "2026-02-20T19:15:00Z"
}
}
Expected effect:
- client traffic opens
- kill-switch pre-check is skipped
- telemetry/metrics still available for evaluation paths
Verification checklist
- Active bundle version confirmed.
- Override metrics show expected activation state.
- Reject rate drops to acceptable range.
- Core endpoints and latency recover.
Exit criteria
- root cause fixed
- safe normal policy prepared
- override TTL no longer needed
Rollback / recovery path
Deploy bundle without override blocks (or with enabled: false) and increment version:
{
"bundle_version": 1103,
"policies": [ ... ],
"kill_switches": [ ... ]
}
Then confirm:
fairvisor_global_shadow_active == 0fairvisor_kill_switch_override_active == 0- reject behavior returns to normal policy semantics
Post-incident notes
Record:
- reason for break-glass activation
- scenario used (A or B)
- TTL duration and owner
- residual risk while active
- follow-up controls to avoid repeat
Do not
- Do not enable Scenario B before validating Scenario A is insufficient.
- Do not run overrides without explicit expiry and owner.
- Do not leave overrides active after incident closure.