We transcribe after 400ms of silence in 200ms chunks. 3 voice chunks (VAD) automatically interrupts, unless it's a back channel like "yeah" or "right" or something like that.
Whisper can transcribe in <100ms.
We then wait for the turn detection model, LLM, and tts to trigger a streamed response back to eh client.