Runbook: Kill-Switch Incident Response

Purpose / When to use

Use this runbook to rapidly block a specific abusive actor, token, tenant, or route when immediate containment is required.

Blast radius & risk level

Risk level: high
Primary risk: over-broad scope causing collateral 429s

Signals / symptoms

Active abuse pattern tied to identifiable descriptor value
Sharp reject increase needed for containment in minutes, not hours
Existing throttles are too slow or too permissive

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="kill_switch"}[5m])

Header verification:

curl -i -X POST http://localhost:8080/v1/decision \
  -H 'X-Original-Method: POST' \
  -H 'X-Original-URI: /v1/critical' \
  -H 'x-api-key: <suspect-key>'

Triage checklist

Identify smallest possible scope key/value pair.
Decide whether route-scoped switch is sufficient.
Set explicit incident id in reason.
Set short TTL first (expires_at).
Prepare rollback owner before deploy.

Mitigation playbook

Minimal scoped entry:

{
  "scope_key": "header:x-api-key",
  "scope_value": "key_compromised_123",
  "route": "/v1/critical",
  "reason": "inc-2026-02-20-abuse",
  "expires_at": "2026-02-20T19:00:00Z"
}

Execution order:

Add kill-switch entry to bundle.
Increment bundle_version.
Deploy bundle.
Verify containment and collateral.

Verification checklist

X-Fairvisor-Reason: kill_switch appears for expected traffic.
Reject volume rises only for targeted scope.
Critical non-target flows remain healthy.
No descriptor-missing regression on switch key.

Exit criteria

Abuse contained
Downstream systems stable
Safe permanent policy fix prepared

Rollback / recovery path

Remove switch or let TTL expire.
Deploy new bundle version.
Confirm kill_switch reject rate returns to expected baseline.

Post-incident notes

Record:

exact scope and TTL used
collateral findings
time-to-containment
permanent control added after incident

Do not

Do not start with global scope unless incident severity justifies it.
Do not deploy kill-switches without expiry on first pass.
Do not skip verification of non-target traffic.

Runbook: Kill-Switch Incident Response

Purpose / When to use

Blast radius & risk level

Signals / symptoms

Detection queries

Triage checklist

Mitigation playbook

Verification checklist

Exit criteria

Rollback / recovery path

Post-incident notes

Do not

Related docs