Runbook: Kill-Switch Incident Response

Purpose / When to use

Use this runbook to rapidly block a specific abusive actor, token, tenant, or route when immediate containment is required.

Blast radius & risk level

  • Risk level: high
  • Primary risk: over-broad scope causing collateral 429s

Signals / symptoms

  • Active abuse pattern tied to identifiable descriptor value
  • Sharp reject increase needed for containment in minutes, not hours
  • Existing throttles are too slow or too permissive

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="kill_switch"}[5m])

Header verification:

curl -i -X POST http://localhost:8080/v1/decision \
  -H 'X-Original-Method: POST' \
  -H 'X-Original-URI: /v1/critical' \
  -H 'x-api-key: <suspect-key>'

Triage checklist

  1. Identify smallest possible scope key/value pair.
  2. Decide whether route-scoped switch is sufficient.
  3. Set explicit incident id in reason.
  4. Set short TTL first (expires_at).
  5. Prepare rollback owner before deploy.

Mitigation playbook

Minimal scoped entry:

{
  "scope_key": "header:x-api-key",
  "scope_value": "key_compromised_123",
  "route": "/v1/critical",
  "reason": "inc-2026-02-20-abuse",
  "expires_at": "2026-02-20T19:00:00Z"
}

Execution order:

  1. Add kill-switch entry to bundle.
  2. Increment bundle_version.
  3. Deploy bundle.
  4. Verify containment and collateral.

Verification checklist

  1. X-Fairvisor-Reason: kill_switch appears for expected traffic.
  2. Reject volume rises only for targeted scope.
  3. Critical non-target flows remain healthy.
  4. No descriptor-missing regression on switch key.

Exit criteria

  • Abuse contained
  • Downstream systems stable
  • Safe permanent policy fix prepared

Rollback / recovery path

  1. Remove switch or let TTL expire.
  2. Deploy new bundle version.
  3. Confirm kill_switch reject rate returns to expected baseline.

Post-incident notes

Record:

  • exact scope and TTL used
  • collateral findings
  • time-to-containment
  • permanent control added after incident

Do not

  • Do not start with global scope unless incident severity justifies it.
  • Do not deploy kill-switches without expiry on first pass.
  • Do not skip verification of non-target traffic.