Streaming Enforcement

Fairvisor Edge can enforce token budgets mid-stream on Server-Sent Events (SSE) responses — the standard format used by OpenAI-compatible LLM APIs. This page covers how streaming enforcement works and how to configure it.

Overview

Standard rate limiting happens at request time: the edge reserves tokens and either allows or rejects the request before the response begins. For streaming responses, the actual completion length is unknown until the [DONE] event arrives.

Fairvisor Edge handles this in two phases:

  1. Reservation — at request time, reserve prompt_tokens + max_completion_tokens; reject if the budget is insufficient
  2. Mid-stream enforcement — as SSE chunks arrive in the body filter phase, count completion tokens; truncate the stream if the limit is exceeded

Detection

The streaming path activates when any of the following is true:

  • The request includes Accept: text/event-stream
  • The JSON body contains "stream": true
  • The request_context.stream flag is set by the rule engine

Body filter pipeline

upstream SSE chunk
  └─ body_filter_by_lua_block
       └─ buffer until complete SSE event detected
       └─ parse delta.content from JSON
       └─ accumulate token count (ceil(chars / 4))
       └─ check: tokens_used > max_completion_tokens?
           ├─ No: forward chunk unchanged
           └─ Yes: send close event; suppress remaining chunks
  └─ client receives filtered stream

The body filter runs per nginx chunk (which may span multiple SSE events). Chunks are buffered until a complete data: ...\n\n event is detected.

Token counting in the stream

Completion tokens are estimated as ceil(content_chars / 4). This is the same rule-of-thumb used for prompt estimation (simple_word estimator). The count accumulates across all chunks until the stream ends or is truncated.

Checks run every buffer_tokens accumulated tokens (default 100) to avoid per-chunk overhead.

Truncation

When tokens_used > max_completion_tokens, the stream is truncated. Remaining chunks are suppressed and a synthetic final event is injected:

Graceful close (on_limit_exceeded: "graceful_close")

data: {"choices":[{"delta":{},"finish_reason":"length"}],"usage":{"prompt_tokens":52,"completion_tokens":100,"total_tokens":152}}

data: [DONE]

The client receives a well-formed stream with finish_reason: "length" — identical to what a model returns when it reaches its max_tokens limit naturally.

Error chunk (on_limit_exceeded: "error_chunk")

data: {"error":{"message":"max completion tokens exceeded","type":"rate_limit_error","code":"completion_tokens_exceeded"},"usage":{...}}

data: [DONE]

Use error_chunk when you want the client to distinguish a policy-enforced truncation from a natural length limit.

Configuration

Streaming enforcement is configured in the algorithm_config of a token_bucket_llm rule:

{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_completion_tokens": 4096,
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}
Field Type Default Description
enabled bool true Enable streaming body filter.
enforce_mid_stream bool true Actually truncate when limit is exceeded. Set false to count only (useful with shadow mode).
buffer_tokens int 100 How often to check the token budget (every N accumulated tokens).
on_limit_exceeded string "graceful_close" "graceful_close" or "error_chunk".
include_partial_usage bool true Append a usage object to the close event.

Reconciliation

When the stream ends (either naturally at [DONE] or via truncation), the actual token count is reconciled against the reservation:

refund = reserved_tokens - actual_tokens_used

The TPM and TPD buckets are refunded the difference. This keeps the running total accurate over time even though reservations are pessimistic.

For streaming, reconciliation happens when [DONE] is received in the body filter. For non-streaming, it happens immediately after the full response body is received.

Shadow mode

In shadow mode, streaming enforcement is fully simulated:

  • Token counts accumulate normally
  • Truncation is logged as would_truncate with the token count
  • The stream is not interrupted — the client receives the full response
  • Reconciliation still runs

This lets you validate your max_completion_tokens limits against real traffic before enabling enforcement.

Partial usage headers

If include_partial_usage: true, each SSE event in the stream has a usage field appended:

data: {"id":"chatcmpl-...","choices":[...],"usage":{"prompt_tokens":52,"completion_tokens":43,"total_tokens":95}}

This is useful for client-side progress tracking and debugging.

Example: streaming request flow

POST /v1/chat/completions
Authorization: Bearer eyJ...
Content-Type: application/json

{"model":"gpt-4","messages":[...],"stream":true,"max_tokens":500}
  1. Edge receives request; estimates prompt = 80 tokens
  2. max_completion = min(500, 4096) = 500
  3. Reserve 580 tokens from TPM; check TPD
  4. If denied: 429 tpm_exceeded (before stream starts)
  5. If allowed: forward request to upstream; start streaming
  6. Body filter accumulates completion tokens chunk by chunk
  7. At token 500: inject graceful close event, suppress upstream chunks
  8. Reconcile: actual = 500, reserved = 580, refund = 80 tokens to TPM/TPD