Runbook: Rate Limit by User

Purpose / When to use

Use this runbook when you need stable per-user fairness and protection against noisy clients without throttling whole tenants.

Blast radius & risk level

  • Risk level: medium
  • Typical impact if misconfigured: high 429 rate for legitimate traffic sharing the same identity key or missing descriptor

Signals / symptoms

  • One user can starve shared backend capacity
  • Per-IP limits look healthy, but user-level abuse still passes
  • fairvisor_descriptor_missing_total grows for your identity key

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_descriptor_missing_total{key="jwt:sub"}[5m])
rate(fairvisor_descriptor_missing_total{key="header:x-user-id"}[5m])

Optional header trace:

curl -i -X POST http://localhost:8080/v1/decision \
  -H 'X-Original-Method: GET' \
  -H 'X-Original-URI: /api/v1/items' \
  -H 'Authorization: Bearer <jwt>'

Triage checklist

  1. Pick the identity source (jwt:sub preferred, header:x-user-id fallback).
  2. Confirm identity field exists on production traffic paths.
  3. Confirm gateway forwards required headers consistently.
  4. Validate no descriptor missing spikes before enforcement.
  5. Confirm route selector scope is narrow enough (avoid global accidental coverage).

Mitigation playbook

Safe-first path:

  1. Create policy with mode: shadow and per-user token bucket.
  2. Observe would-reject volume for at least one traffic cycle.
  3. Tune tokens_per_second and burst until false positives are acceptable.
  4. Promote to mode: enforce with monotonic bundle_version bump.

Reference policy snippet:

{
  "id": "api-per-user-limit",
  "spec": {
    "mode": "shadow",
    "selector": { "pathPrefix": "/api/" },
    "rules": [
      {
        "name": "per-user-rps",
        "limit_keys": ["jwt:sub"],
        "algorithm": "token_bucket",
        "algorithm_config": {
          "tokens_per_second": 10,
          "burst": 20
        }
      }
    ]
  }
}

Fallback identity variant:

"limit_keys": ["header:x-user-id"]

Verification checklist

  1. Reject reason distribution is stable and expected.
  2. fairvisor_descriptor_missing_total is near zero for chosen key.
  3. No unexpected 429 surge on core endpoints.
  4. RateLimit-* headers align with expected user-level buckets.

Exit criteria

  • No sustained user-facing error regression
  • Per-user fairness objective achieved
  • Alert noise stays within team threshold

Rollback / recovery path

  1. Switch policy back to mode: shadow.
  2. If still noisy, remove policy and redeploy known-good bundle.
  3. Verify reject-rate baseline recovery.

Post-incident notes

Record:

  • chosen identity key and reason
  • final threshold values
  • false-positive examples
  • gateway forwarding fixes applied

Do not

  • Do not enforce per-user limits before validating descriptor presence.
  • Do not combine multiple new identity keys in one rollout.
  • Do not rely on IP as the primary identity for authenticated APIs.