Not my experience, running around 6,000 conversations per day with voice, with webrtc + cascading (stt/llm/tts) architecture.
Maybe I misunderstood your comment, but that 500ms is basically the floor of a stat of the art voice implementation these days - if you are lucky and don't skimp, and do various expensive things like speculative decoding and reasoning. 450ms on the LLM pass alone. Every ms counts in commercial applications of voice ai. If you add 200ms or 300ms to that, it really degrades the conversation.
We do a lot of voice stuff to support our business, largely with unsophisticated, non technical users. Last year's attempts, with measured turn to turn latencies of around 1200ms-1500ms, led to a lot of user confusion, interruptions, abandoned conversations and generally very unpleasant experiences. We are at around 700ms turn to turn now, depending on tool usage needed, and its approaching an OK experience, rivalling an interaction with an actual human. We are spending quite a lot to shave another 100ms off that. We do expensive, wasteful things such as speculative LLM passes, we do speculative tool executions (do a few LLM inferences as the user speaks, but don't actually execute non-idempotent tool calls before you know that that LLM pass is usable and the user did not say anything important at the tail end of their sentence) just to shave 100-200ms. When someone says 500ms is irrelevant I am sure they are describing some other use case, not human-to-AI voice interactions.
In my experience with voice AI, the problem is not with some occasional dropped webrtc packets. The real hard problem is with strong background noises, echo, and of course accents. WebRTC with its polished AEC implementations helps quite a lot at least with echos. I get the protocol is a major PITA to implement at OpenAI scale, but for anything but hyperscale applications there is lots of good, viable solutions and commercial providers (say, Daily for instance) that make it a no problem. The real problems to solve are still elswhere. But boy, add 500ms to my latency budget and you've killed my application.
What I was saying is the same as you -- the user will tolerate a total delay of 500ms, and then happiness starts to fall off. We had some Alexa utterances at 500ms, the most basic ones, but most took longer.
However, even with http2 and the like, we could get in that range because of the fact that it was sending data right away, so we were mostly done processing the STT by the time they were done speaking, and we were already working on the answer based on the first part of the utterance.
But I would need to see some really strong evidence to even think about using WebRTC.