Runbook: Shadow Mode Rollout

Purpose / When to use

Use this runbook whenever introducing a new limiter rule, changing descriptor keys, or tightening thresholds.

Blast radius & risk level

  • Risk level: medium
  • Primary risk: false-positive rejects if promoted too early

Signals / symptoms

  • Planned policy change on production paths
  • Unknown reject behavior for real traffic mix
  • Need to validate descriptors and thresholds safely

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="allow"}[5m]))
rate(fairvisor_descriptor_missing_total[5m])

Log focus (would-reject analysis):

docker logs -f fairvisor | fairvisor logs --action=allow

Triage checklist

  1. Confirm selector scope (path/method) is intentionally narrow.
  2. Confirm descriptor keys exist in traffic.
  3. Confirm fallback behavior if rule match does not trigger.
  4. Confirm runbook owner for promotion/rollback.

Mitigation playbook

Phase 1: shadow deploy

{
  "id": "candidate-policy",
  "spec": {
    "mode": "shadow",
    "selector": { "pathPrefix": "/api/" },
    "rules": [
      {
        "name": "candidate-limit",
        "limit_keys": ["header:x-api-key"],
        "algorithm": "token_bucket",
        "algorithm_config": { "tokens_per_second": 20, "burst": 40 }
      }
    ]
  }
}

Phase 2: observe

  1. Run in shadow for at least one representative traffic cycle.
  2. Measure would-reject concentration and descriptor misses.
  3. Tune limits if unexpected patterns appear.

Phase 3: promote

  1. Flip mode to enforce.
  2. Increment bundle_version.
  3. Deploy with active monitoring.

Verification checklist

  1. No sustained reject spike after promotion.
  2. Retry-After distribution remains acceptable.
  3. Core endpoint latency remains within SLO.
  4. Descriptor missing metrics stay near baseline.

Exit criteria

  • Policy rejects align with intended abuse/surge controls
  • No meaningful customer-facing regression
  • Final config documented with rationale

Rollback / recovery path

  1. Revert policy to shadow, or restore previous bundle.
  2. Validate reject-rate normalization.
  3. Re-open tuning cycle before next enforce attempt.

Post-incident notes

Record:

  • shadow observation window duration
  • top would-reject reasons
  • threshold deltas between shadow and enforce

Do not

  • Do not skip shadow for new descriptor keys.
  • Do not promote during unrelated production incident windows.
  • Do not treat one quiet hour as sufficient baseline.