aniket mishrikotkar

zombie requests

cat

recently i was looking at a cost spike in prod. our requests completed rate had dropped, but our GPU utilization stayed at 100%.

the problem were the zombie requests - upstream inference keeps running after downstream client is gone, so utilization stays high while completed requests drops.

when a client times out or disconnects we assume the server stops working. in many web apps, we assume disconnect stops work, but that’s not a safe assumption—even more so for LLM inference. but in the world of expensive LLM inference this assumption is might burn a lot of money.

client disconnects → gateway may not notice immediately → upstream inference keeps running until you abort it

few ways we leak the compute

  1. detached background tasks: if you use BackgroundTasks (tasks to be run after returning a response) for inference that code runs to completion regardless of the HTTP connection. so your client disconnected 30 seconds ago, but the model is still generating a 4000-token essay that no one will read.

  2. the cancellation gap: even with streaming, python doesn't always stop your logic immediately. cancellation in Python async is cooperative; you only notice it at await points or explicit checks, so you must design your generation loop and streaming path to surface disconnect quickly.

the fix is to have a defense

buffering multiplies these requests. it delays token delivery, increases client timeouts, and turns slow stream into disconnected client + still-running GPU.

@app.post("/chat")
async def stream(request):

    1. define the generator that yields data
    2. stream from engine with a timeout
    3. check if client disconnected explicitly
    4. catch if the server cancels the task

    return StreamingResponse

reality

for most real-time streaming endpoints, the only cancellation mechanism you reliably control is aborting the connection. explicit cancel endpoints usually exist for batch/async jobs (e.g. openai batch cancel, anthropic message batches cancel).

If you self-host (vllm/TGI/llama.cpp), you should abort the engine request to free GPU/CPU immediately.

my recommendations