Streaming Enforcement
Fairvisor Edge can enforce token budgets mid-stream on Server-Sent Events (SSE) responses — the standard format used by OpenAI-compatible LLM APIs. This page covers how streaming enforcement works and how to configure it.
Overview
Standard rate limiting happens at request time: the edge reserves tokens and either allows or rejects the request before the response begins. For streaming responses, the actual completion length is unknown until the [DONE] event arrives.
Fairvisor Edge handles this in two phases:
- Reservation — at request time, reserve
prompt_tokens + max_completion_tokens; reject if the budget is insufficient - Mid-stream enforcement — as SSE chunks arrive in the body filter phase, count completion tokens; truncate the stream if the limit is exceeded
Detection
The streaming path activates when any of the following is true:
- The request includes
Accept: text/event-stream - The JSON body contains
"stream": true - The
request_context.streamflag is set by the rule engine
Body filter pipeline
upstream SSE chunk
└─ body_filter_by_lua_block
└─ buffer until complete SSE event detected
└─ parse delta.content from JSON
└─ accumulate token count (ceil(chars / 4))
└─ check: tokens_used > max_completion_tokens?
├─ No: forward chunk unchanged
└─ Yes: send close event; suppress remaining chunks
└─ client receives filtered stream
The body filter runs per nginx chunk (which may span multiple SSE events). Chunks are buffered until a complete data: ...\n\n event is detected.
Token counting in the stream
Completion tokens are estimated as ceil(content_chars / 4). This is the same rule-of-thumb used for prompt estimation (simple_word estimator). The count accumulates across all chunks until the stream ends or is truncated.
Checks run every buffer_tokens accumulated tokens (default 100) to avoid per-chunk overhead.
Truncation
When tokens_used > max_completion_tokens, the stream is truncated. Remaining chunks are suppressed and a synthetic final event is injected:
Graceful close (on_limit_exceeded: "graceful_close")
data: {"choices":[{"delta":{},"finish_reason":"length"}],"usage":{"prompt_tokens":52,"completion_tokens":100,"total_tokens":152}}
data: [DONE]
The client receives a well-formed stream with finish_reason: "length" — identical to what a model returns when it reaches its max_tokens limit naturally.
Error chunk (on_limit_exceeded: "error_chunk")
data: {"error":{"message":"max completion tokens exceeded","type":"rate_limit_error","code":"completion_tokens_exceeded"},"usage":{...}}
data: [DONE]
Use error_chunk when you want the client to distinguish a policy-enforced truncation from a natural length limit.
Configuration
Streaming enforcement is configured in the algorithm_config of a token_bucket_llm rule:
{
"algorithm": "token_bucket_llm",
"algorithm_config": {
"tokens_per_minute": 120000,
"tokens_per_day": 2000000,
"burst_tokens": 120000,
"max_completion_tokens": 4096,
"streaming": {
"enabled": true,
"enforce_mid_stream": true,
"buffer_tokens": 100,
"on_limit_exceeded": "graceful_close",
"include_partial_usage": true
}
}
}
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | true |
Enable streaming body filter. |
enforce_mid_stream |
bool | true |
Actually truncate when limit is exceeded. Set false to count only (useful with shadow mode). |
buffer_tokens |
int | 100 |
How often to check the token budget (every N accumulated tokens). |
on_limit_exceeded |
string | "graceful_close" |
"graceful_close" or "error_chunk". |
include_partial_usage |
bool | true |
Append a usage object to the close event. |
Reconciliation
When the stream ends (either naturally at [DONE] or via truncation), the actual token count is reconciled against the reservation:
refund = reserved_tokens - actual_tokens_used
The TPM and TPD buckets are refunded the difference. This keeps the running total accurate over time even though reservations are pessimistic.
For streaming, reconciliation happens when [DONE] is received in the body filter. For non-streaming, it happens immediately after the full response body is received.
Shadow mode
In shadow mode, streaming enforcement is fully simulated:
- Token counts accumulate normally
- Truncation is logged as
would_truncatewith the token count - The stream is not interrupted — the client receives the full response
- Reconciliation still runs
This lets you validate your max_completion_tokens limits against real traffic before enabling enforcement.
Partial usage headers
If include_partial_usage: true, each SSE event in the stream has a usage field appended:
data: {"id":"chatcmpl-...","choices":[...],"usage":{"prompt_tokens":52,"completion_tokens":43,"total_tokens":95}}
This is useful for client-side progress tracking and debugging.
Example: streaming request flow
POST /v1/chat/completions
Authorization: Bearer eyJ...
Content-Type: application/json
{"model":"gpt-4","messages":[...],"stream":true,"max_tokens":500}
- Edge receives request; estimates prompt = 80 tokens
- max_completion = min(500, 4096) = 500
- Reserve 580 tokens from TPM; check TPD
- If denied:
429 tpm_exceeded(before stream starts) - If allowed: forward request to upstream; start streaming
- Body filter accumulates completion tokens chunk by chunk
- At token 500: inject graceful close event, suppress upstream chunks
- Reconcile: actual = 500, reserved = 580, refund = 80 tokens to TPM/TPD