Runbook: Rate Limit by User

Purpose / When to use

Use this runbook when you need stable per-user fairness and protection against noisy clients without throttling whole tenants.

Blast radius & risk level

Risk level: medium
Typical impact if misconfigured: high 429 rate for legitimate traffic sharing the same identity key or missing descriptor

Signals / symptoms

One user can starve shared backend capacity
Per-IP limits look healthy, but user-level abuse still passes
fairvisor_descriptor_missing_total grows for your identity key

Detection queries

sum by (reason) (rate(fairvisor_decisions_total{action="reject"}[5m]))
rate(fairvisor_descriptor_missing_total{key="jwt:sub"}[5m])
rate(fairvisor_descriptor_missing_total{key="header:x-user-id"}[5m])

Optional header trace:

curl -i -X POST http://localhost:8080/v1/decision \
  -H 'X-Original-Method: GET' \
  -H 'X-Original-URI: /api/v1/items' \
  -H 'Authorization: Bearer <jwt>'

Triage checklist

Pick the identity source (jwt:sub preferred, header:x-user-id fallback).
Confirm identity field exists on production traffic paths.
Confirm gateway forwards required headers consistently.
Validate no descriptor missing spikes before enforcement.
Confirm route selector scope is narrow enough (avoid global accidental coverage).

Mitigation playbook

Safe-first path:

Create policy with mode: shadow and per-user token bucket.
Observe would-reject volume for at least one traffic cycle.
Tune tokens_per_second and burst until false positives are acceptable.
Promote to mode: enforce with monotonic bundle_version bump.

Reference policy snippet:

{
  "id": "api-per-user-limit",
  "spec": {
    "mode": "shadow",
    "selector": { "pathPrefix": "/api/" },
    "rules": [
      {
        "name": "per-user-rps",
        "limit_keys": ["jwt:sub"],
        "algorithm": "token_bucket",
        "algorithm_config": {
          "tokens_per_second": 10,
          "burst": 20
        }
      }
    ]
  }
}

Fallback identity variant:

"limit_keys": ["header:x-user-id"]

Verification checklist

Reject reason distribution is stable and expected.
fairvisor_descriptor_missing_total is near zero for chosen key.
No unexpected 429 surge on core endpoints.
RateLimit-* headers align with expected user-level buckets.

Exit criteria

No sustained user-facing error regression
Per-user fairness objective achieved
Alert noise stays within team threshold

Rollback / recovery path

Switch policy back to mode: shadow.
If still noisy, remove policy and redeploy known-good bundle.
Verify reject-rate baseline recovery.

Post-incident notes

Record:

chosen identity key and reason
final threshold values
false-positive examples
gateway forwarding fixes applied

Do not

Do not enforce per-user limits before validating descriptor presence.
Do not combine multiple new identity keys in one rollout.
Do not rely on IP as the primary identity for authenticated APIs.