zombie requests

30 Dec, 2025

cat

recently i was looking at a cost spike in prod. our requests completed rate had dropped, but our GPU utilization stayed at 100%.

the problem were the zombie requests - upstream inference keeps running after downstream client is gone, so utilization stays high while completed requests drops.

when a client times out or disconnects we assume the server stops working. in many web apps, we assume disconnect stops work, but that’s not a safe assumption—even more so for LLM inference. but in the world of expensive LLM inference this assumption is might burn a lot of money.

client disconnects → gateway may not notice immediately → upstream inference keeps running until you abort it

few ways we leak the compute

detached background tasks: if you use BackgroundTasks (tasks to be run after returning a response) for inference that code runs to completion regardless of the HTTP connection. so your client disconnected 30 seconds ago, but the model is still generating a 4000-token essay that no one will read.
the cancellation gap: even with streaming, python doesn't always stop your logic immediately. cancellation in Python async is cooperative; you only notice it at await points or explicit checks, so you must design your generation loop and streaming path to surface disconnect quickly.

the fix is to have a defense

detect: observe disconnect
propagate: cancel upstream HTTP request and pass a cancellation token to the inference engine.
enforce: hard caps (e.g. max tokens) so even if cancel fails you don't run forever.

buffering multiplies these requests. it delays token delivery, increases client timeouts, and turns slow stream into disconnected client + still-running GPU.

@app.post("/chat")
async def stream(request):

    1. define the generator that yields data
    2. stream from engine with a timeout
    3. check if client disconnected explicitly
    4. catch if the server cancels the task

    return StreamingResponse

reality

for most real-time streaming endpoints, the only cancellation mechanism you reliably control is aborting the connection. explicit cancel endpoints usually exist for batch/async jobs (e.g. openai batch cancel, anthropic message batches cancel).

If you self-host (vllm/TGI/llama.cpp), you should abort the engine request to free GPU/CPU immediately.

my recommendations

monitor the divergence between request completion rates and GPU/CPU utilization.
never use fire-and-forget tasks for response generation.
trace your requests so you know exactly which job to kill when a client disconnects.