Streaming Enforcement

Fairvisor Edge can enforce token budgets mid-stream on Server-Sent Events (SSE) responses — the standard format used by OpenAI-compatible LLM APIs. This page covers how streaming enforcement works and how to configure it.

Overview

Standard rate limiting happens at request time: the edge reserves tokens and either allows or rejects the request before the response begins. For streaming responses, the actual completion length is unknown until the [DONE] event arrives.

Fairvisor Edge handles this in two phases:

Reservation — at request time, reserve prompt_tokens + max_completion_tokens; reject if the budget is insufficient
Mid-stream enforcement — as SSE chunks arrive in the body filter phase, count completion tokens; truncate the stream if the limit is exceeded

Detection

The streaming path activates when any of the following is true:

The request includes Accept: text/event-stream
The JSON body contains "stream": true
The request_context.stream flag is set by the rule engine

Body filter pipeline

upstream SSE chunk
  └─ body_filter_by_lua_block
       └─ buffer until complete SSE event detected
       └─ parse delta.content from JSON
       └─ accumulate token count (ceil(chars / 4))
       └─ check: tokens_used > max_completion_tokens?
           ├─ No: forward chunk unchanged
           └─ Yes: send close event; suppress remaining chunks
  └─ client receives filtered stream

The body filter runs per nginx chunk (which may span multiple SSE events). Chunks are buffered until a complete data: ...\n\n event is detected.

Token counting in the stream

Completion tokens are estimated as ceil(content_chars / 4). This is the same rule-of-thumb used for prompt estimation (simple_word estimator). The count accumulates across all chunks until the stream ends or is truncated.

Checks run every buffer_tokens accumulated tokens (default 100) to avoid per-chunk overhead.

Truncation

When tokens_used > max_completion_tokens, the stream is truncated. Remaining chunks are suppressed and a synthetic final event is injected:

Graceful close (`on_limit_exceeded: "graceful_close"`)

data: {"choices":[{"delta":{},"finish_reason":"length"}],"usage":{"prompt_tokens":52,"completion_tokens":100,"total_tokens":152}}

data: [DONE]

The client receives a well-formed stream with finish_reason: "length" — identical to what a model returns when it reaches its max_tokens limit naturally.

Error chunk (`on_limit_exceeded: "error_chunk"`)

data: {"error":{"message":"max completion tokens exceeded","type":"rate_limit_error","code":"completion_tokens_exceeded"},"usage":{...}}

data: [DONE]

Use error_chunk when you want the client to distinguish a policy-enforced truncation from a natural length limit.

Configuration

Streaming enforcement is configured in the algorithm_config of a token_bucket_llm rule:

{
  "algorithm": "token_bucket_llm",
  "algorithm_config": {
    "tokens_per_minute": 120000,
    "tokens_per_day": 2000000,
    "burst_tokens": 120000,
    "max_completion_tokens": 4096,
    "streaming": {
      "enabled": true,
      "enforce_mid_stream": true,
      "buffer_tokens": 100,
      "on_limit_exceeded": "graceful_close",
      "include_partial_usage": true
    }
  }
}

Field	Type	Default	Description
`enabled`	bool	`true`	Enable streaming body filter.
`enforce_mid_stream`	bool	`true`	Actually truncate when limit is exceeded. Set `false` to count only (useful with shadow mode).
`buffer_tokens`	int	`100`	How often to check the token budget (every N accumulated tokens).
`on_limit_exceeded`	string	`"graceful_close"`	`"graceful_close"` or `"error_chunk"`.
`include_partial_usage`	bool	`true`	Append a `usage` object to the close event.

Reconciliation

When the stream ends (either naturally at [DONE] or via truncation), the actual token count is reconciled against the reservation:

refund = reserved_tokens - actual_tokens_used

The TPM and TPD buckets are refunded the difference. This keeps the running total accurate over time even though reservations are pessimistic.

For streaming, reconciliation happens when [DONE] is received in the body filter. For non-streaming, it happens immediately after the full response body is received.

Shadow mode

In shadow mode, streaming enforcement is fully simulated:

Token counts accumulate normally
Truncation is logged as would_truncate with the token count
The stream is not interrupted — the client receives the full response
Reconciliation still runs

This lets you validate your max_completion_tokens limits against real traffic before enabling enforcement.

Partial usage headers

If include_partial_usage: true, each SSE event in the stream has a usage field appended:

data: {"id":"chatcmpl-...","choices":[...],"usage":{"prompt_tokens":52,"completion_tokens":43,"total_tokens":95}}

This is useful for client-side progress tracking and debugging.

Example: streaming request flow

POST /v1/chat/completions
Authorization: Bearer eyJ...
Content-Type: application/json

{"model":"gpt-4","messages":[...],"stream":true,"max_tokens":500}

Edge receives request; estimates prompt = 80 tokens
max_completion = min(500, 4096) = 500
Reserve 580 tokens from TPM; check TPD
If denied: 429 tpm_exceeded (before stream starts)
If allowed: forward request to upstream; start streaming
Body filter accumulates completion tokens chunk by chunk
At token 500: inject graceful close event, suppress upstream chunks
Reconcile: actual = 500, reserved = 580, refund = 80 tokens to TPM/TPD