LLM Token Limiter
token_bucket_llm is the algorithm for token-based LLM governance: it enforces per-minute (TPM) and optional per-day (TPD) token budgets per limiter key, reserves tokens pessimistically at request time, and refunds unused tokens after the response. Use it on OpenAI-compatible endpoints to control cost by tokens rather than request count.
How it works
At decision time, the limiter runs in this order:
- Estimate prompt tokens
- Determine completion reservation
- Compute
estimated_total = prompt_tokens + reserved_completion - Enforce prompt/request caps
- Consume TPM budget
- Consume TPD budget (if configured)
- Allow or reject
- Reconcile later (refund unused tokens)
Ordering detail: TPM is checked first. If TPM passes but TPD fails, the TPM reservation is refunded immediately — you do not leak minute budget on a TPD rejection.
Reservation model
The limiter is intentionally pessimistic before upstream generation starts.
reserved_completion = request.max_tokens (if present and > 0)
else default_max_completion
if max_completion_tokens is set:
reserved_completion = min(reserved_completion, max_completion_tokens)
estimated_total = prompt_tokens + reserved_completion
Why this matters:
- You block runaway requests before they hit the model
- You keep budgets safe even when actual completion size is unknown
- Reconciliation later corrects over-reservation
Prompt token estimation
Supported estimators:
simple_word(default)header_hint
simple_word
- Fast heuristic: approximately
ceil(chars / 4) - If request body has a
"messages"array, it preferscontentfields - Body scan is capped at 1 MiB for hot-path performance
header_hint
- Reads
X-Token-Estimatefrom request headers (case-insensitive) - If header is absent/invalid, falls back to
simple_word
State
State lives in ngx.shared.fairvisor_counters.
Key format:
- TPM:
tpm:{limit_key} - TPD:
tpd:{limit_key}:{YYYYMMDD}(UTC date key)
Reset behavior:
- TPM refills continuously (token-bucket semantics)
- TPD resets at midnight UTC
- On TPD rejection,
Retry-Afteris seconds until next UTC midnight
Configuration
{
"algorithm": "token_bucket_llm",
"algorithm_config": {
"tokens_per_minute": 120000,
"tokens_per_day": 2000000,
"burst_tokens": 120000,
"max_tokens_per_request": 8192,
"max_prompt_tokens": 4096,
"max_completion_tokens": 4096,
"default_max_completion": 1000,
"token_source": {
"estimator": "header_hint"
},
"streaming": {
"enabled": true,
"enforce_mid_stream": true,
"buffer_tokens": 100,
"on_limit_exceeded": "graceful_close",
"include_partial_usage": true
}
}
}
| Field | Required | Default | Rules / notes |
|---|---|---|---|
tokens_per_minute |
yes | - | Positive number |
tokens_per_day |
no | unset | Positive number |
burst_tokens |
no | tokens_per_minute |
Must be >= tokens_per_minute |
max_tokens_per_request |
no | unset | Positive number |
max_prompt_tokens |
no | unset | Positive number |
max_completion_tokens |
no | unset | Positive number; clamps requested max_tokens |
default_max_completion |
no | 1000 |
Used when request has no usable max_tokens |
token_source.estimator |
no | simple_word |
simple_word, header_hint |
streaming.* |
no | enabled defaults | Mid-stream SSE enforcement controls |
Reconciliation (refund unused tokens)
After response completion, Fairvisor computes:
refund = estimated_total - actual_total
If refund > 0, it credits tokens back to TPM and TPD.
Non-streaming responses
Actual usage is extracted from response JSON (typically usage.*).
If usage cannot be extracted (parse errors, oversized body, missing usage path), runtime falls back safely and marks fallback in internal accounting instead of failing requests.
Streaming responses
For SSE flows, reconciliation occurs in streaming body-filter completion logic. See Streaming Enforcement.
Rejection reasons
prompt_tokens_exceeded— prompt estimate is abovemax_prompt_tokensmax_tokens_per_request_exceeded— prompt + reserved completion exceedsmax_tokens_per_requesttpm_exceeded— per-minute budget exhaustedtpd_exceeded— per-day budget exhausted
Response headers
On allowed requests (in enforce mode):
RateLimit-Limit: 120000
RateLimit-Remaining: 87432
RateLimit-Reset: 23
On rejection:
HTTP 429 Too Many Requests
Retry-After: 23
X-Fairvisor-Reason: tpm_exceeded
For TPD rejection, Retry-After is seconds until next UTC midnight:
HTTP 429 Too Many Requests
Retry-After: 86400
X-Fairvisor-Reason: tpd_exceeded
Performance notes
Hot-path design points:
- No external dependency required (shared dict only)
- O(1) budget checks per request
- No background refill timers (lazy refill on check)
- Prompt estimation avoids full JSON decode in default path
- Body scan is bounded to 1 MiB to cap CPU cost
Failure behavior
If the shared dict increment fails for TPM or TPD counters (for example, under dict memory pressure), the algorithm fails open:
- the request is allowed
- the failed budget check is skipped
- the failure is logged for metrics
Traffic is never blocked due to storage failure.
Tuning
- Start with realistic TPM from provider plan and expected concurrency
- Keep
burst_tokensnear TPM unless you need short spikes - Set
max_tokens_per_requestas hard safety rail for prompt-injection/runaway tools - Set
max_completion_tokensto cap tail latency and cost - Use
header_hintonly if trusted upstream provides reliable token estimate - Monitor rejection reasons distribution (
tpm_exceededvstpd_exceeded) and adjust budgets - Validate estimator quality by comparing reserved vs actual usage over production traffic
Example
{
"name": "chat-llm-budget",
"limit_keys": ["jwt:org_id"],
"algorithm": "token_bucket_llm",
"algorithm_config": {
"tokens_per_minute": 60000,
"tokens_per_day": 1200000,
"burst_tokens": 60000,
"max_prompt_tokens": 12000,
"max_completion_tokens": 1500,
"max_tokens_per_request": 13000,
"default_max_completion": 800,
"token_source": { "estimator": "simple_word" }
}
}
This gives each org:
- steady 60k TPM
- 1.2M TPD ceiling
- hard per-request cap to protect from extreme prompts/completions