Runbook: Kill-Switch Incident Response
Purpose / When to use
Use this runbook to rapidly block a specific abusive actor, token, tenant, or route when immediate containment is required.
Blast radius & risk level
- Risk level: high
- Primary risk: over-broad scope causing collateral 429s
Signals / symptoms
- Active abuse pattern tied to identifiable descriptor value
- Sharp reject increase needed for containment in minutes, not hours
- Existing throttles are too slow or too permissive
Detection queries
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject",reason="kill_switch"}[5m])
Header verification:
curl -i -X POST http://localhost:8080/v1/decision \
-H 'X-Original-Method: POST' \
-H 'X-Original-URI: /v1/critical' \
-H 'x-api-key: <suspect-key>'
Triage checklist
- Identify smallest possible scope key/value pair.
- Decide whether route-scoped switch is sufficient.
- Set explicit incident id in
reason. - Set short TTL first (
expires_at). - Prepare rollback owner before deploy.
Mitigation playbook
Minimal scoped entry:
{
"scope_key": "header:x-api-key",
"scope_value": "key_compromised_123",
"route": "/v1/critical",
"reason": "inc-2026-02-20-abuse",
"expires_at": "2026-02-20T19:00:00Z"
}
Execution order:
- Add kill-switch entry to bundle.
- Increment
bundle_version. - Deploy bundle.
- Verify containment and collateral.
Verification checklist
X-Fairvisor-Reason: kill_switchappears for expected traffic.- Reject volume rises only for targeted scope.
- Critical non-target flows remain healthy.
- No descriptor-missing regression on switch key.
Exit criteria
- Abuse contained
- Downstream systems stable
- Safe permanent policy fix prepared
Rollback / recovery path
- Remove switch or let TTL expire.
- Deploy new bundle version.
- Confirm
kill_switchreject rate returns to expected baseline.
Post-incident notes
Record:
- exact scope and TTL used
- collateral findings
- time-to-containment
- permanent control added after incident
Do not
- Do not start with global scope unless incident severity justifies it.
- Do not deploy kill-switches without expiry on first pass.
- Do not skip verification of non-target traffic.