Runbook: Global Shadow + Kill-Switch Override

Purpose / When to use

Use this runbook only during major incident windows when normal mitigation cannot restore service quickly enough.

Blast radius & risk level

  • Risk level: critical
  • Primary risk: reducing active enforcement guardrails while incident mode is enabled

Signals / symptoms

  • Sustained customer-impacting rejects across broad traffic
  • rollback path unavailable or too slow for immediate recovery
  • collateral from current controls exceeds acceptable impact

Detection queries

rate(fairvisor_decisions_total{action="reject"}[5m])
max_over_time(fairvisor_global_shadow_active[1m])
max_over_time(fairvisor_kill_switch_override_active[1m])

Triage checklist

  1. Confirm incident commander approval for break-glass mode.
  2. Set short TTL (15-30 minutes) and named owner.
  3. Confirm rollback bundle is prepared before activation.
  4. Decide scenario:
    • Scenario A: global shadow only
    • Scenario B: global shadow + kill-switch override

Mitigation playbook

Scenario A (preferred first):

{
  "bundle_version": 1101,
  "global_shadow": {
    "enabled": true,
    "reason": "inc-2026-02-20-reject-spike",
    "expires_at": "2026-02-20T19:00:00Z"
  }
}

Expected effect:

  • client traffic opens (allow path)
  • rule evaluation continues in shadow semantics
  • kill-switch remains active

Scenario B (true break-glass):

{
  "bundle_version": 1102,
  "global_shadow": {
    "enabled": true,
    "reason": "inc-2026-02-20-global-open",
    "expires_at": "2026-02-20T19:15:00Z"
  },
  "kill_switch_override": {
    "enabled": true,
    "reason": "inc-2026-02-20-global-open",
    "expires_at": "2026-02-20T19:15:00Z"
  }
}

Expected effect:

  • client traffic opens
  • kill-switch pre-check is skipped
  • telemetry/metrics still available for evaluation paths

Verification checklist

  1. Active bundle version confirmed.
  2. Override metrics show expected activation state.
  3. Reject rate drops to acceptable range.
  4. Core endpoints and latency recover.

Exit criteria

  • root cause fixed
  • safe normal policy prepared
  • override TTL no longer needed

Rollback / recovery path

Deploy bundle without override blocks (or with enabled: false) and increment version:

{
  "bundle_version": 1103,
  "policies": [ ... ],
  "kill_switches": [ ... ]
}

Then confirm:

  • fairvisor_global_shadow_active == 0
  • fairvisor_kill_switch_override_active == 0
  • reject behavior returns to normal policy semantics

Post-incident notes

Record:

  • reason for break-glass activation
  • scenario used (A or B)
  • TTL duration and owner
  • residual risk while active
  • follow-up controls to avoid repeat

Do not

  • Do not enable Scenario B before validating Scenario A is insufficient.
  • Do not run overrides without explicit expiry and owner.
  • Do not leave overrides active after incident closure.