LLM Token Limiter

token_bucket_llm is the algorithm for token-based LLM governance: it enforces per-minute (TPM) and optional per-day (TPD) token budgets per limiter key, reserves tokens pessimistically at request time, and refunds unused tokens after the response. Use it on OpenAI-compatible endpoints to control cost by tokens rather than request count.

How it works

At decision time, the limiter runs in this order:

Estimate prompt tokens
Determine completion reservation
Compute estimated_total = prompt_tokens + reserved_completion
Enforce prompt/request caps
Consume TPM budget
Consume TPD budget (if configured)
Allow or reject
Reconcile later (refund unused tokens)

ℹ️

Ordering detail: TPM is checked first. If TPM passes but TPD fails, the TPM reservation is refunded immediately — you do not leak minute budget on a TPD rejection.

Reservation model

The limiter is intentionally pessimistic before upstream generation starts.

reserved_completion = request.max_tokens (if present and > 0)
                      else default_max_completion

if max_completion_tokens is set:
  reserved_completion = min(reserved_completion, max_completion_tokens)

estimated_total = prompt_tokens + reserved_completion

Why this matters:

You block runaway requests before they hit the model
You keep budgets safe even when actual completion size is unknown
Reconciliation later corrects over-reservation

Prompt token estimation

Supported estimators:

simple_word (default)
header_hint

`simple_word`

Fast heuristic: approximately ceil(chars / 4)
If request body has a "messages" array, it prefers content fields
Body scan is capped at 1 MiB for hot-path performance

`header_hint`

Reads X-Token-Estimate from request headers (case-insensitive)
If header is absent/invalid, falls back to simple_word

State

State lives in ngx.shared.fairvisor_counters.

Key format:

TPM: tpm:{limit_key}
TPD: tpd:{limit_key}:{YYYYMMDD} (UTC date key)

Reset behavior:

TPM refills continuously (token-bucket semantics)
TPD resets at midnight UTC
On TPD rejection, Retry-After is seconds until next UTC midnight

Configuration

{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_tokens_per_request": 8192,
    "max_prompt_tokens": 4096,
    "max_completion_tokens": 4096,
    "default_max_completion": 1000,
    "token_source": {
      "estimator": "header_hint"
    },
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}

Field	Required	Default	Rules / notes
`tokens_per_minute`	yes	-	Positive number
`tokens_per_day`	no	unset	Positive number
`burst_tokens`	no	`tokens_per_minute`	Must be `>= tokens_per_minute`
`max_tokens_per_request`	no	unset	Positive number
`max_prompt_tokens`	no	unset	Positive number
`max_completion_tokens`	no	unset	Positive number; clamps requested `max_tokens`
`default_max_completion`	no	`1000`	Used when request has no usable `max_tokens`
`token_source.estimator`	no	`simple_word`	`simple_word`, `header_hint`
`streaming.*`	no	enabled defaults	Mid-stream SSE enforcement controls

Reconciliation (refund unused tokens)

After response completion, Fairvisor computes:

refund = estimated_total - actual_total

If refund > 0, it credits tokens back to TPM and TPD.

Non-streaming responses

Actual usage is extracted from response JSON (typically usage.*).

If usage cannot be extracted (parse errors, oversized body, missing usage path), runtime falls back safely and marks fallback in internal accounting instead of failing requests.

Streaming responses

For SSE flows, reconciliation occurs in streaming body-filter completion logic. See Streaming Enforcement.

Rejection reasons

prompt_tokens_exceeded — prompt estimate is above max_prompt_tokens
max_tokens_per_request_exceeded — prompt + reserved completion exceeds max_tokens_per_request
tpm_exceeded — per-minute budget exhausted
tpd_exceeded — per-day budget exhausted

Response headers

On allowed requests (in enforce mode):

RateLimit-Limit: 120000
RateLimit-Remaining: 87432
RateLimit-Reset: 23

On rejection:

HTTP 429 Too Many Requests
Retry-After: 23
X-Fairvisor-Reason: tpm_exceeded

For TPD rejection, Retry-After is seconds until next UTC midnight:

HTTP 429 Too Many Requests
Retry-After: 86400
X-Fairvisor-Reason: tpd_exceeded

Performance notes

Hot-path design points:

No external dependency required (shared dict only)
O(1) budget checks per request
No background refill timers (lazy refill on check)
Prompt estimation avoids full JSON decode in default path
Body scan is bounded to 1 MiB to cap CPU cost

Failure behavior

If the shared dict increment fails for TPM or TPD counters (for example, under dict memory pressure), the algorithm fails open:

the request is allowed
the failed budget check is skipped
the failure is logged for metrics

Traffic is never blocked due to storage failure.

Tuning

Start with realistic TPM from provider plan and expected concurrency
Keep burst_tokens near TPM unless you need short spikes
Set max_tokens_per_request as hard safety rail for prompt-injection/runaway tools
Set max_completion_tokens to cap tail latency and cost
Use header_hint only if trusted upstream provides reliable token estimate
Monitor rejection reasons distribution (tpm_exceeded vs tpd_exceeded) and adjust budgets
Validate estimator quality by comparing reserved vs actual usage over production traffic

Example

{
  "name": "chat-llm-budget",
  "limit_keys": ["jwt:org_id"],
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 60000,
    "tokens_per_day": 1200000,
    "burst_tokens": 60000,
    "max_prompt_tokens": 12000,
    "max_completion_tokens": 1500,
    "max_tokens_per_request": 13000,
    "default_max_completion": 800,
    "token_source": { "estimator": "simple_word" }
  }
}

This gives each org:

steady 60k TPM
1.2M TPD ceiling
hard per-request cap to protect from extreme prompts/completions