Runbook: Shadow Mode Rollout

Purpose / When to use

Use this runbook whenever introducing a new limiter rule, changing descriptor keys, or tightening thresholds.

Blast radius & risk level

Risk level: medium
Primary risk: false-positive rejects if promoted too early

Signals / symptoms

Planned policy change on production paths
Unknown reject behavior for real traffic mix
Need to validate descriptors and thresholds safely

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="allow"}[5m]))
rate(fairvisor_descriptor_missing_total[5m])

Log focus (would-reject analysis):

docker logs -f fairvisor | fairvisor logs --action=allow

Triage checklist

Confirm selector scope (path/method) is intentionally narrow.
Confirm descriptor keys exist in traffic.
Confirm fallback behavior if rule match does not trigger.
Confirm runbook owner for promotion/rollback.

Mitigation playbook

Phase 1: shadow deploy

{
  "id": "candidate-policy",
  "spec": {
    "mode": "shadow",
    "selector": { "pathPrefix": "/api/" },
    "rules": [
      {
        "name": "candidate-limit",
        "limit_keys": ["header:x-api-key"],
        "algorithm": "token_bucket",
        "algorithm_config": { "tokens_per_second": 20, "burst": 40 }
      }
    ]
  }
}

Phase 2: observe

Run in shadow for at least one representative traffic cycle.
Measure would-reject concentration and descriptor misses.
Tune limits if unexpected patterns appear.

Phase 3: promote

Flip mode to enforce.
Increment bundle_version.
Deploy with active monitoring.

Verification checklist

No sustained reject spike after promotion.
Retry-After distribution remains acceptable.
Core endpoint latency remains within SLO.
Descriptor missing metrics stay near baseline.

Exit criteria

Policy rejects align with intended abuse/surge controls
No meaningful customer-facing regression
Final config documented with rationale

Rollback / recovery path

Revert policy to shadow, or restore previous bundle.
Validate reject-rate normalization.
Re-open tuning cycle before next enforce attempt.

Post-incident notes

Record:

shadow observation window duration
top would-reject reasons
threshold deltas between shadow and enforce

Do not

Do not skip shadow for new descriptor keys.
Do not promote during unrelated production incident windows.
Do not treat one quiet hour as sufficient baseline.