Runbook: Reject Spike
Purpose / When to use
Use this runbook when reject volume rises sharply and customers report increased 429 responses.
Blast radius & risk level
- Risk level: high
- Primary risk: customer-impacting traffic denial and cascading retries
Signals / symptoms
- Sharp increase in
fairvisor_decisions_total{action="reject"} - Elevated
Retry-Aftervalues - API clients report sustained 429s
Detection queries
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])
rate(fairvisor_descriptor_missing_total[5m])
Triage checklist
- Identify top reject reason bucket.
- Map reason to policy/rule using logs/metrics; for per-request attribution use debug session headers.
- Confirm whether a bundle rollout preceded the spike.
- Check descriptor-missing regressions for identity keys.
- Check if kill-switch recently changed.
Reason map:
token_bucket_exceeded-> burst/rate pressurebudget_exceeded-> quota period exhaustionkill_switch-> active block entrycircuit_breaker_open-> spend velocity trip
Mitigation playbook
Safe-first options:
- Move impacted policy to shadow temporarily.
- Increase conservative thresholds for hot path.
- Narrow selectors/scope for over-broad rules.
Aggressive options:
- Roll back to known-good bundle.
- Use global shadow incident mode with short TTL.
- Add targeted kill-switch only if abuse is confirmed.
Verification checklist
- Reject rate trends back to baseline.
- Top reject reason aligns with intended control.
- No new latency or allow-path regressions.
- Retry-after distribution returns to expected range.
Exit criteria
- Reject spike resolved
- Policy delta documented and peer-reviewed
- Follow-up hardening tasks created
Rollback / recovery path
- Reapply last known-good bundle version.
- Confirm
readyzand policy version consistency. - Re-run triage queries for 15-30 minutes.
Post-incident notes
Record:
- top reason and impacted policy/rule (from logs/debug session)
- mitigation chosen and timing
- customer impact window
- permanent prevention action
Do not
- Do not tune multiple unrelated policies at once.
- Do not keep emergency overrides without TTL/owner.
- Do not ignore descriptor-missing spikes while tuning limits.