Runbook: Budget Exhaustion

Purpose / When to use

Use this runbook when budget_exceeded becomes a top reject reason or when budget thresholds cause customer-visible throttling/rejections.

Blast radius & risk level

  • Risk level: high
  • Primary risk: hard quota cutoffs on critical paths and retry storms near period boundary

Signals / symptoms

  • Increased budget_exceeded rejects
  • Traffic cluster near end-of-period boundaries
  • Excessive throttle delays from staged actions

Detection queries

rate(fairvisor_decisions_total{action="reject",reason="budget_exceeded"}[5m])
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_limiter_result_total{algorithm="cost_based",allowed="false"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])

Triage checklist

  1. Confirm affected policies/rules and their period (5m, 1h, 1d, 7d).
  2. Check if spike is expected business seasonality or genuine anomaly.
  3. Check recent bundle changes to budget, cost_key, or staged thresholds.
  4. Validate cost source quality (header/query/fixed) and fallback behavior.
  5. Confirm whether impacted routes are critical read vs high-cost write paths.

Mitigation playbook

Safe-first options:

  1. Raise budget conservatively for impacted policy window.
  2. Relax throttle stage before changing hard reject threshold.
  3. Move policy to shadow briefly if customer impact is severe.

Example staged config tuning:

{
  "algorithm": "cost_based",
  "algorithm_config": {
    "budget": 1200,
    "period": "5m",
    "cost_key": "fixed",
    "fixed_cost": 1,
    "staged_actions": [
      { "threshold_percent": 80, "action": "warn" },
      { "threshold_percent": 97, "action": "throttle", "delay_ms": 150 },
      { "threshold_percent": 100, "action": "reject" }
    ]
  }
}

Aggressive options:

  1. Temporary global shadow incident mode with short TTL.
  2. Bundle rollback to last known-good budget profile.

Verification checklist

  1. budget_exceeded reject rate returns to expected range.
  2. Customer-facing 429 complaints decline.
  3. Retry-after distribution normalizes.
  4. No new regressions in non-budget rules.

Exit criteria

  • Budget controls match real traffic/cost profile
  • No sustained over-throttling on critical endpoints
  • Updated budget policy documented and reviewed

Rollback / recovery path

  1. Reapply prior known-good budget policy bundle.
  2. Confirm active version/hash and reject baseline.
  3. Re-enter shadow-first tuning cycle for future changes.

Post-incident notes

Record:

  • affected periods and routes
  • rejected traffic share vs total
  • threshold changes and rationale
  • business context (launch, billing cycle, seasonality)

Do not

  • Do not raise every budget globally as first response.
  • Do not remove reject stage entirely.
  • Do not ignore cost-source data quality.