upvote
I agree with everything you've said, I must have written it wrong.

What I was saying is the same as you -- the user will tolerate a total delay of 500ms, and then happiness starts to fall off. We had some Alexa utterances at 500ms, the most basic ones, but most took longer.

However, even with http2 and the like, we could get in that range because of the fact that it was sending data right away, so we were mostly done processing the STT by the time they were done speaking, and we were already working on the answer based on the first part of the utterance.

But I would need to see some really strong evidence to even think about using WebRTC.

reply
I am myself working on something similar, but i have noticed that if I try to pass on early speech from the user to the LLM to reduce latency, chances of interruptions get even higher. For example, the user may say something like “Yes” followed by a brief pause, leading the speech model to count that as a complete turn, triggering the LLM call. But then the user may add something more, so i have to cancel the previous request so that any irreversible state transitions can be avoided. Now due to the lower latency (due to speculative calls), I get an even smaller window to actually cancel the response or even to stop the model from speaking.
reply