Token Bucket

The token bucket is the primary algorithm for request-rate limiting (requests per second). It allows short bursts while enforcing a steady-state rate over time.

How it works

The implementation uses a continuous-time token bucket (also called a leaky bucket as a meter):

  1. Each unique limit key has a bucket with capacity burst tokens
  2. Tokens refill at tokens_per_second per second, continuously
  3. Each request consumes cost tokens (default 1)
  4. If the bucket has enough tokens, the request is allowed; otherwise rejected

Refill is calculated lazily on each request — there is no background timer:

elapsed     = max(0, now - last_refill)
new_tokens  = min(tokens + elapsed × tokens_per_second, burst)

if new_tokens >= cost:
    allow, set tokens = new_tokens - cost
else:
    reject, retry_after = ceil((cost - new_tokens) / tokens_per_second)

The max(0, ...) guards against backward clock jumps.

State

State is stored in ngx.shared.fairvisor_counters as a single string entry per limit key:

Key:    tb:{rule_name}:{composite_limit_key}
Value:  "{tokens}:{last_refill}"   (both as 6-decimal floats)

Example:

tb:per-org-rps:org-abc  →  "87.432100:1736940000.000000"

Configuration

{
  "algorithm": "token_bucket",
  "algorithm_config": {
    "tokens_per_second": 100,
    "burst": 200,
    "cost_source": "fixed",
    "fixed_cost": 1,
    "default_cost": 1
  }
}
Field Type Required Default Constraint Description
tokens_per_second number yes* > 0 Steady-state refill rate
rps number yes* > 0 Alias for tokens_per_second
burst number yes tokens_per_second Maximum bucket capacity (max burst size)
cost_source string no "fixed" "fixed", "header:<name>", or "query:<name>"
fixed_cost number no 1 > 0 Token cost per request when cost_source = "fixed"
default_cost number no 1 > 0 Fallback cost when the source header/query param is absent or non-numeric

*One of tokens_per_second or rps must be provided.

Variable cost

You can charge a variable number of tokens per request by reading the cost from a header or query parameter:

{
  "algorithm_config": {
    "tokens_per_second": 100,
    "burst": 1000,
    "cost_source": "header:x-request-weight",
    "default_cost": 1
  }
}

If X-Request-Weight: 5 is sent, 5 tokens are consumed. If the header is absent or contains a non-numeric value, default_cost (1) is used.

Response headers

On every allowed request the following headers are set (in enforce mode), per draft-ietf-httpapi-ratelimit-headers and Retry-After per RFC 9110:

RateLimit-Limit: 200
RateLimit-Remaining: 87
RateLimit-Reset: 1
RateLimit: "per-org-rps";r=87;t=1

On rejection:

HTTP 429 Too Many Requests
Retry-After: 2
RateLimit-Limit: 200
RateLimit-Remaining: 0
RateLimit-Reset: 2
X-Fairvisor-Reason: token_bucket_exceeded

Retry-After jitter

The Retry-After value is the raw ceil((cost - tokens) / tokens_per_second) plus a deterministic per-identity jitter of up to 50%. The jitter is seeded from the identity (JWT subject, IP, path) so the same client always gets the same offset — different clients get different offsets. This prevents the thundering herd problem when many clients hit the limit at the same time.

Failure behavior

If the shared dict increment fails (for example, under dict memory pressure), the algorithm fails open:

  • the request is allowed
  • token state is left unchanged (no partial decrement is applied)
  • the failure is logged for metrics

Traffic is never blocked due to storage failure.

Tuning

Scenario Recommendation
Interactive API (users expect burst then steady) burst = 5 × tokens_per_second
Strict steady-rate enforcement burst = tokens_per_second
LLM tool calls (each call is expensive) burst = tokens_per_second + set cost_source to a per-call cost header
Background batch job (burst allowed at start) burst = 10 × tokens_per_second

Examples

Global rate limit with IP partitioning

{
  "name": "global-rps",
  "limit_keys": ["ip:address"],
  "algorithm": "token_bucket",
  "algorithm_config": {
    "rps": 5,
    "burst": 10
  }
}

Each unique IP address gets its own 10-token bucket that refills at 5 tokens per second.

Per-API-key tiered limits

{
  "rules": [
    {
      "name": "enterprise",
      "limit_keys": ["header:x-api-key"],
      "algorithm": "token_bucket",
      "algorithm_config": { "rps": 1000, "burst": 2000 },
      "match": { "jwt:plan": "enterprise" }
    },
    {
      "name": "free",
      "limit_keys": ["header:x-api-key"],
      "algorithm": "token_bucket",
      "algorithm_config": { "rps": 10, "burst": 20 }
    }
  ]
}