Runbook: Budget Exhaustion
Purpose / When to use
Use this runbook when budget_exceeded becomes a top reject reason or when budget thresholds cause customer-visible throttling/rejections.
Blast radius & risk level
- Risk level: high
- Primary risk: hard quota cutoffs on critical paths and retry storms near period boundary
Signals / symptoms
- Increased
budget_exceededrejects - Traffic cluster near end-of-period boundaries
- Excessive throttle delays from staged actions
Detection queries
rate(fairvisor_decisions_total{action="reject",reason="budget_exceeded"}[5m])
sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_limiter_result_total{algorithm="cost_based",allowed="false"}[5m])
rate(fairvisor_retry_after_bucket_total[5m])
Triage checklist
- Confirm affected policies/rules and their
period(5m,1h,1d,7d). - Check if spike is expected business seasonality or genuine anomaly.
- Check recent bundle changes to
budget,cost_key, or staged thresholds. - Validate cost source quality (
header/query/fixed) and fallback behavior. - Confirm whether impacted routes are critical read vs high-cost write paths.
Mitigation playbook
Safe-first options:
- Raise budget conservatively for impacted policy window.
- Relax throttle stage before changing hard reject threshold.
- Move policy to shadow briefly if customer impact is severe.
Example staged config tuning:
{
"algorithm": "cost_based",
"algorithm_config": {
"budget": 1200,
"period": "5m",
"cost_key": "fixed",
"fixed_cost": 1,
"staged_actions": [
{ "threshold_percent": 80, "action": "warn" },
{ "threshold_percent": 97, "action": "throttle", "delay_ms": 150 },
{ "threshold_percent": 100, "action": "reject" }
]
}
}
Aggressive options:
- Temporary global shadow incident mode with short TTL.
- Bundle rollback to last known-good budget profile.
Verification checklist
budget_exceededreject rate returns to expected range.- Customer-facing 429 complaints decline.
- Retry-after distribution normalizes.
- No new regressions in non-budget rules.
Exit criteria
- Budget controls match real traffic/cost profile
- No sustained over-throttling on critical endpoints
- Updated budget policy documented and reviewed
Rollback / recovery path
- Reapply prior known-good budget policy bundle.
- Confirm active version/hash and reject baseline.
- Re-enter shadow-first tuning cycle for future changes.
Post-incident notes
Record:
- affected periods and routes
- rejected traffic share vs total
- threshold changes and rationale
- business context (launch, billing cycle, seasonality)
Do not
- Do not raise every budget globally as first response.
- Do not remove reject stage entirely.
- Do not ignore cost-source data quality.