Runbook: Shadow Mode Rollout
Purpose / When to use
Use this runbook whenever introducing a new limiter rule, changing descriptor keys, or tightening thresholds.
Blast radius & risk level
- Risk level: medium
- Primary risk: false-positive rejects if promoted too early
Signals / symptoms
- Planned policy change on production paths
- Unknown reject behavior for real traffic mix
- Need to validate descriptors and thresholds safely
Detection queries
sum by (reason) (rate(fairvisor_decisions_total{action="allow"}[5m]))
rate(fairvisor_descriptor_missing_total[5m])
Log focus (would-reject analysis):
docker logs -f fairvisor | fairvisor logs --action=allow
Triage checklist
- Confirm selector scope (path/method) is intentionally narrow.
- Confirm descriptor keys exist in traffic.
- Confirm fallback behavior if rule
matchdoes not trigger. - Confirm runbook owner for promotion/rollback.
Mitigation playbook
Phase 1: shadow deploy
{
"id": "candidate-policy",
"spec": {
"mode": "shadow",
"selector": { "pathPrefix": "/api/" },
"rules": [
{
"name": "candidate-limit",
"limit_keys": ["header:x-api-key"],
"algorithm": "token_bucket",
"algorithm_config": { "tokens_per_second": 20, "burst": 40 }
}
]
}
}
Phase 2: observe
- Run in shadow for at least one representative traffic cycle.
- Measure would-reject concentration and descriptor misses.
- Tune limits if unexpected patterns appear.
Phase 3: promote
- Flip
modetoenforce. - Increment
bundle_version. - Deploy with active monitoring.
Verification checklist
- No sustained reject spike after promotion.
Retry-Afterdistribution remains acceptable.- Core endpoint latency remains within SLO.
- Descriptor missing metrics stay near baseline.
Exit criteria
- Policy rejects align with intended abuse/surge controls
- No meaningful customer-facing regression
- Final config documented with rationale
Rollback / recovery path
- Revert policy to
shadow, or restore previous bundle. - Validate reject-rate normalization.
- Re-open tuning cycle before next enforce attempt.
Post-incident notes
Record:
- shadow observation window duration
- top would-reject reasons
- threshold deltas between shadow and enforce
Do not
- Do not skip shadow for new descriptor keys.
- Do not promote during unrelated production incident windows.
- Do not treat one quiet hour as sufficient baseline.