Runbook: Reject Spike

Purpose / When to use

Use this runbook when reject volume rises sharply and customers report increased 429 responses.

Blast radius & risk level

Risk level: high
Primary risk: customer-impacting traffic denial and cascading retries

Signals / symptoms

Sharp increase in fairvisor_decisions_total{action="reject"}
Elevated Retry-After values
API clients report sustained 429s

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])
rate(fairvisor_descriptor_missing_total[5m])

Triage checklist

Identify top reject reason bucket.
Map reason to policy/rule using logs/metrics; for per-request attribution use debug session headers.
Confirm whether a bundle rollout preceded the spike.
Check descriptor-missing regressions for identity keys.
Check if kill-switch recently changed.

Reason map:

token_bucket_exceeded -> burst/rate pressure
budget_exceeded -> quota period exhaustion
kill_switch -> active block entry
circuit_breaker_open -> spend velocity trip

Mitigation playbook

Safe-first options:

Move impacted policy to shadow temporarily.
Increase conservative thresholds for hot path.
Narrow selectors/scope for over-broad rules.

Aggressive options:

Roll back to known-good bundle.
Use global shadow incident mode with short TTL.
Add targeted kill-switch only if abuse is confirmed.

Verification checklist

Reject rate trends back to baseline.
Top reject reason aligns with intended control.
No new latency or allow-path regressions.
Retry-after distribution returns to expected range.

Exit criteria

Reject spike resolved
Policy delta documented and peer-reviewed
Follow-up hardening tasks created

Rollback / recovery path

Reapply last known-good bundle version.
Confirm readyz and policy version consistency.
Re-run triage queries for 15-30 minutes.

Post-incident notes

Record:

top reason and impacted policy/rule (from logs/debug session)
mitigation chosen and timing
customer impact window
permanent prevention action

Do not

Do not tune multiple unrelated policies at once.
Do not keep emergency overrides without TTL/owner.
Do not ignore descriptor-missing spikes while tuning limits.

Runbook: Reject Spike

Purpose / When to use

Blast radius & risk level

Signals / symptoms

Detection queries

Triage checklist

Mitigation playbook

Verification checklist

Exit criteria

Rollback / recovery path

Post-incident notes

Do not

Related docs