upvote
Latency versus reliability is a false dichotomy anyway. The alternative to WebRTC isn't to wait for the user to finish speaking before you send any of the audio. Open a websocket and send the coded audio packets as they're generated. Now you're still sending audio packets immediately, but if one is dropped, TCP retransmits it until it makes it through. If the connection is really slow, packets queue up, and the user has to wait, but it still works. You get the low latency in the best case and the robustness in the worst case.
reply
You ultimately still need a jitter buffer large enough to absorb retransmisiones. Otherwise you’ve got stuttering audio. And dynamically adjusting this jitter buffer is hard
reply
> And dynamically adjusting this jitter buffer is hard

Unappreciated part of this entire conversation.

reply
I'm not an expert. Can't we abuse that LLMs don't need to receive audio as a continuous stream without interruptions? Couldn't we just send data and pipe it into the LLM with deduplication (if resending happens)?

  x...y...y[dedup]...z
reply
You’re absolutely correct. A jitter buffer is necessary for a human listener, but a LLM isn’t aware of a time lapse, just like it isn’t aware of the time since your last message in the conversion (unless the chat harness explicitly informs it).
reply
Human spoken conversation doesn’t really work like file buffering.

People can tolerate missing words surprisingly well. If a phrase is slightly clipped, masked by noise, or dropped, the listener can often infer it from context. That happens constantly in real speech.

But pauses and stalls are much more damaging. A sudden freeze in the middle of speech breaks turn-taking, timing, and attention. It feels like the speaker stopped thinking, the connection died, or the system got stuck.

For voice UX, a tiny omission is often less harmful than a perfectly complete sentence that freezes halfway.

reply
deleted
reply