LLM Token Limiter

token_bucket_llm is the algorithm for token-based LLM governance: it enforces per-minute (TPM) and optional per-day (TPD) token budgets per limiter key, reserves tokens pessimistically at request time, and refunds unused tokens after the response. Use it on OpenAI-compatible endpoints to control cost by tokens rather than request count.

How it works

At decision time, the limiter runs in this order:

  1. Estimate prompt tokens
  2. Determine completion reservation
  3. Compute estimated_total = prompt_tokens + reserved_completion
  4. Enforce prompt/request caps
  5. Consume TPM budget
  6. Consume TPD budget (if configured)
  7. Allow or reject
  8. Reconcile later (refund unused tokens)
ℹ️

Ordering detail: TPM is checked first. If TPM passes but TPD fails, the TPM reservation is refunded immediately — you do not leak minute budget on a TPD rejection.

Reservation model

The limiter is intentionally pessimistic before upstream generation starts.

reserved_completion = request.max_tokens (if present and > 0)
                      else default_max_completion

if max_completion_tokens is set:
  reserved_completion = min(reserved_completion, max_completion_tokens)

estimated_total = prompt_tokens + reserved_completion

Why this matters:

  • You block runaway requests before they hit the model
  • You keep budgets safe even when actual completion size is unknown
  • Reconciliation later corrects over-reservation

Prompt token estimation

Supported estimators:

  • simple_word (default)
  • header_hint

simple_word

  • Fast heuristic: approximately ceil(chars / 4)
  • If request body has a "messages" array, it prefers content fields
  • Body scan is capped at 1 MiB for hot-path performance

header_hint

  • Reads X-Token-Estimate from request headers (case-insensitive)
  • If header is absent/invalid, falls back to simple_word

State

State lives in ngx.shared.fairvisor_counters.

Key format:

  • TPM: tpm:{limit_key}
  • TPD: tpd:{limit_key}:{YYYYMMDD} (UTC date key)

Reset behavior:

  • TPM refills continuously (token-bucket semantics)
  • TPD resets at midnight UTC
  • On TPD rejection, Retry-After is seconds until next UTC midnight

Configuration

{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_tokens_per_request": 8192,
    "max_prompt_tokens": 4096,
    "max_completion_tokens": 4096,
    "default_max_completion": 1000,
    "token_source": {
      "estimator": "header_hint"
    },
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}
Field Required Default Rules / notes
tokens_per_minute yes - Positive number
tokens_per_day no unset Positive number
burst_tokens no tokens_per_minute Must be >= tokens_per_minute
max_tokens_per_request no unset Positive number
max_prompt_tokens no unset Positive number
max_completion_tokens no unset Positive number; clamps requested max_tokens
default_max_completion no 1000 Used when request has no usable max_tokens
token_source.estimator no simple_word simple_word, header_hint
streaming.* no enabled defaults Mid-stream SSE enforcement controls

Reconciliation (refund unused tokens)

After response completion, Fairvisor computes:

refund = estimated_total - actual_total

If refund > 0, it credits tokens back to TPM and TPD.

Non-streaming responses

Actual usage is extracted from response JSON (typically usage.*).

If usage cannot be extracted (parse errors, oversized body, missing usage path), runtime falls back safely and marks fallback in internal accounting instead of failing requests.

Streaming responses

For SSE flows, reconciliation occurs in streaming body-filter completion logic. See Streaming Enforcement.

Rejection reasons

  • prompt_tokens_exceeded — prompt estimate is above max_prompt_tokens
  • max_tokens_per_request_exceeded — prompt + reserved completion exceeds max_tokens_per_request
  • tpm_exceeded — per-minute budget exhausted
  • tpd_exceeded — per-day budget exhausted

Response headers

On allowed requests (in enforce mode):

RateLimit-Limit: 120000
RateLimit-Remaining: 87432
RateLimit-Reset: 23

On rejection:

HTTP 429 Too Many Requests
Retry-After: 23
X-Fairvisor-Reason: tpm_exceeded

For TPD rejection, Retry-After is seconds until next UTC midnight:

HTTP 429 Too Many Requests
Retry-After: 86400
X-Fairvisor-Reason: tpd_exceeded

Performance notes

Hot-path design points:

  • No external dependency required (shared dict only)
  • O(1) budget checks per request
  • No background refill timers (lazy refill on check)
  • Prompt estimation avoids full JSON decode in default path
  • Body scan is bounded to 1 MiB to cap CPU cost

Failure behavior

If the shared dict increment fails for TPM or TPD counters (for example, under dict memory pressure), the algorithm fails open:

  • the request is allowed
  • the failed budget check is skipped
  • the failure is logged for metrics

Traffic is never blocked due to storage failure.

Tuning

  1. Start with realistic TPM from provider plan and expected concurrency
  2. Keep burst_tokens near TPM unless you need short spikes
  3. Set max_tokens_per_request as hard safety rail for prompt-injection/runaway tools
  4. Set max_completion_tokens to cap tail latency and cost
  5. Use header_hint only if trusted upstream provides reliable token estimate
  6. Monitor rejection reasons distribution (tpm_exceeded vs tpd_exceeded) and adjust budgets
  7. Validate estimator quality by comparing reserved vs actual usage over production traffic

Example

{
  "name": "chat-llm-budget",
  "limit_keys": ["jwt:org_id"],
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 60000,
    "tokens_per_day": 1200000,
    "burst_tokens": 60000,
    "max_prompt_tokens": 12000,
    "max_completion_tokens": 1500,
    "max_tokens_per_request": 13000,
    "default_max_completion": 800,
    "token_source": { "estimator": "simple_word" }
  }
}

This gives each org:

  • steady 60k TPM
  • 1.2M TPD ceiling
  • hard per-request cap to protect from extreme prompts/completions