Runbook: Reject Spike

Purpose / When to use

Use this runbook when reject volume rises sharply and customers report increased 429 responses.

Blast radius & risk level

  • Risk level: high
  • Primary risk: customer-impacting traffic denial and cascading retries

Signals / symptoms

  • Sharp increase in fairvisor_decisions_total{action="reject"}
  • Elevated Retry-After values
  • API clients report sustained 429s

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_decisions_total{action="reject"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])
rate(fairvisor_descriptor_missing_total[5m])

Triage checklist

  1. Identify top reject reason bucket.
  2. Map reason to policy/rule using logs/metrics; for per-request attribution use debug session headers.
  3. Confirm whether a bundle rollout preceded the spike.
  4. Check descriptor-missing regressions for identity keys.
  5. Check if kill-switch recently changed.

Reason map:

  • token_bucket_exceeded -> burst/rate pressure
  • budget_exceeded -> quota period exhaustion
  • kill_switch -> active block entry
  • circuit_breaker_open -> spend velocity trip

Mitigation playbook

Safe-first options:

  1. Move impacted policy to shadow temporarily.
  2. Increase conservative thresholds for hot path.
  3. Narrow selectors/scope for over-broad rules.

Aggressive options:

  1. Roll back to known-good bundle.
  2. Use global shadow incident mode with short TTL.
  3. Add targeted kill-switch only if abuse is confirmed.

Verification checklist

  1. Reject rate trends back to baseline.
  2. Top reject reason aligns with intended control.
  3. No new latency or allow-path regressions.
  4. Retry-after distribution returns to expected range.

Exit criteria

  • Reject spike resolved
  • Policy delta documented and peer-reviewed
  • Follow-up hardening tasks created

Rollback / recovery path

  1. Reapply last known-good bundle version.
  2. Confirm readyz and policy version consistency.
  3. Re-run triage queries for 15-30 minutes.

Post-incident notes

Record:

  • top reason and impacted policy/rule (from logs/debug session)
  • mitigation chosen and timing
  • customer impact window
  • permanent prevention action

Do not

  • Do not tune multiple unrelated policies at once.
  • Do not keep emergency overrides without TTL/owner.
  • Do not ignore descriptor-missing spikes while tuning limits.