Runbook: Global Shadow + Kill-Switch Override

Purpose / When to use

Use this runbook only during major incident windows when normal mitigation cannot restore service quickly enough.

Blast radius & risk level

Risk level: critical
Primary risk: reducing active enforcement guardrails while incident mode is enabled

Signals / symptoms

Sustained customer-impacting rejects across broad traffic
rollback path unavailable or too slow for immediate recovery
collateral from current controls exceeds acceptable impact

Detection queries

rate(fairvisor_decisions_total{action="reject"}[5m])
max_over_time(fairvisor_global_shadow_active[1m])
max_over_time(fairvisor_kill_switch_override_active[1m])

Triage checklist

Confirm incident commander approval for break-glass mode.
Set short TTL (15-30 minutes) and named owner.
Confirm rollback bundle is prepared before activation.
Decide scenario:
- Scenario A: global shadow only
- Scenario B: global shadow + kill-switch override

Mitigation playbook

Scenario A (preferred first):

{
  "bundle_version": 1101,
  "global_shadow": {
    "enabled": true,
    "reason": "inc-2026-02-20-reject-spike",
    "expires_at": "2026-02-20T19:00:00Z"
  }
}

Expected effect:

client traffic opens (allow path)
rule evaluation continues in shadow semantics
kill-switch remains active

Scenario B (true break-glass):

{
  "bundle_version": 1102,
  "global_shadow": {
    "enabled": true,
    "reason": "inc-2026-02-20-global-open",
    "expires_at": "2026-02-20T19:15:00Z"
  },
  "kill_switch_override": {
    "enabled": true,
    "reason": "inc-2026-02-20-global-open",
    "expires_at": "2026-02-20T19:15:00Z"
  }
}

Expected effect:

client traffic opens
kill-switch pre-check is skipped
telemetry/metrics still available for evaluation paths

Verification checklist

Active bundle version confirmed.
Override metrics show expected activation state.
Reject rate drops to acceptable range.
Core endpoints and latency recover.

Exit criteria

root cause fixed
safe normal policy prepared
override TTL no longer needed

Rollback / recovery path

Deploy bundle without override blocks (or with enabled: false) and increment version:

{
  "bundle_version": 1103,
  "policies": [ ... ],
  "kill_switches": [ ... ]
}

Then confirm:

fairvisor_global_shadow_active == 0
fairvisor_kill_switch_override_active == 0
reject behavior returns to normal policy semantics

Post-incident notes

Record:

reason for break-glass activation
scenario used (A or B)
TTL duration and owner
residual risk while active
follow-up controls to avoid repeat

Do not

Do not enable Scenario B before validating Scenario A is insufficient.
Do not run overrides without explicit expiry and owner.
Do not leave overrides active after incident closure.