Token Bucket
The token bucket is the primary algorithm for request-rate limiting (requests per second). It allows short bursts while enforcing a steady-state rate over time.
How it works
The implementation uses a continuous-time token bucket (also called a leaky bucket as a meter):
- Each unique limit key has a bucket with capacity
bursttokens - Tokens refill at
tokens_per_secondper second, continuously - Each request consumes
costtokens (default 1) - If the bucket has enough tokens, the request is allowed; otherwise rejected
Refill is calculated lazily on each request — there is no background timer:
elapsed = max(0, now - last_refill)
new_tokens = min(tokens + elapsed × tokens_per_second, burst)
if new_tokens >= cost:
allow, set tokens = new_tokens - cost
else:
reject, retry_after = ceil((cost - new_tokens) / tokens_per_second)
The max(0, ...) guards against backward clock jumps.
State
State is stored in ngx.shared.fairvisor_counters as a single string entry per limit key:
Key: tb:{rule_name}:{composite_limit_key}
Value: "{tokens}:{last_refill}" (both as 6-decimal floats)
Example:
tb:per-org-rps:org-abc → "87.432100:1736940000.000000"
Configuration
{
"algorithm": "token_bucket",
"algorithm_config": {
"tokens_per_second": 100,
"burst": 200,
"cost_source": "fixed",
"fixed_cost": 1,
"default_cost": 1
}
}
| Field | Type | Required | Default | Constraint | Description |
|---|---|---|---|---|---|
tokens_per_second |
number | yes* | — | > 0 | Steady-state refill rate |
rps |
number | yes* | — | > 0 | Alias for tokens_per_second |
burst |
number | yes | — | ≥ tokens_per_second |
Maximum bucket capacity (max burst size) |
cost_source |
string | no | "fixed" |
— | "fixed", "header:<name>", or "query:<name>" |
fixed_cost |
number | no | 1 |
> 0 | Token cost per request when cost_source = "fixed" |
default_cost |
number | no | 1 |
> 0 | Fallback cost when the source header/query param is absent or non-numeric |
*One of tokens_per_second or rps must be provided.
Variable cost
You can charge a variable number of tokens per request by reading the cost from a header or query parameter:
{
"algorithm_config": {
"tokens_per_second": 100,
"burst": 1000,
"cost_source": "header:x-request-weight",
"default_cost": 1
}
}
If X-Request-Weight: 5 is sent, 5 tokens are consumed. If the header is absent or contains a non-numeric value, default_cost (1) is used.
Response headers
On every allowed request the following headers are set (in enforce mode), per draft-ietf-httpapi-ratelimit-headers and Retry-After per RFC 9110:
RateLimit-Limit: 200
RateLimit-Remaining: 87
RateLimit-Reset: 1
RateLimit: "per-org-rps";r=87;t=1
On rejection:
HTTP 429 Too Many Requests
Retry-After: 2
RateLimit-Limit: 200
RateLimit-Remaining: 0
RateLimit-Reset: 2
X-Fairvisor-Reason: token_bucket_exceeded
Retry-After jitter
The Retry-After value is the raw ceil((cost - tokens) / tokens_per_second) plus a deterministic per-identity jitter of up to 50%. The jitter is seeded from the identity (JWT subject, IP, path) so the same client always gets the same offset — different clients get different offsets. This prevents the thundering herd problem when many clients hit the limit at the same time.
Failure behavior
If the shared dict increment fails (for example, under dict memory pressure), the algorithm fails open:
- the request is allowed
- token state is left unchanged (no partial decrement is applied)
- the failure is logged for metrics
Traffic is never blocked due to storage failure.
Tuning
| Scenario | Recommendation |
|---|---|
| Interactive API (users expect burst then steady) | burst = 5 × tokens_per_second |
| Strict steady-rate enforcement | burst = tokens_per_second |
| LLM tool calls (each call is expensive) | burst = tokens_per_second + set cost_source to a per-call cost header |
| Background batch job (burst allowed at start) | burst = 10 × tokens_per_second |
Examples
Global rate limit with IP partitioning
{
"name": "global-rps",
"limit_keys": ["ip:address"],
"algorithm": "token_bucket",
"algorithm_config": {
"rps": 5,
"burst": 10
}
}
Each unique IP address gets its own 10-token bucket that refills at 5 tokens per second.
Per-API-key tiered limits
{
"rules": [
{
"name": "enterprise",
"limit_keys": ["header:x-api-key"],
"algorithm": "token_bucket",
"algorithm_config": { "rps": 1000, "burst": 2000 },
"match": { "jwt:plan": "enterprise" }
},
{
"name": "free",
"limit_keys": ["header:x-api-key"],
"algorithm": "token_bucket",
"algorithm_config": { "rps": 10, "burst": 20 }
}
]
}