upvote
I think these are 2 different layers of "latency". The latency in the article is referring to the transport of the audio stream itself while the latency in your scenario is about how quickly to start responding inside the audio stream.
reply
They are orthogonal.

Suppose you have 100ms audio latency and no wait time. Then, natural pause will trigger response immediately but you won't notice it has started until after ~200ms (round-trip time). Twice as annoying.

reply
I think he’s saying they are doing an insane level of complexity to shave ~100ms off response times in a scenario where that isn’t important and might even be a negative
reply
When GP mentioned reducing conversational latency as a negative that made sense (and should probably be done IMO), it just wasn't the same category of latency the article talks about reducing. I.e. increasing "network latency" just makes the conversation feel more and more out of sync, it doesn't change the rate at which the AI will interrupt ("turn latency") because the latter is based on the duration of the pause in the audio stream, not the duration it took to deliver that audio stream.

If you meant there is a case where reducing the network latency at the same delivery reliability for a given audio stream is actually a negative then I'd love to hear more about it as I'm a network guy always in search of an excuse for latency :D.

reply
By you want to be able to interject “hold on…” and have it immediately stop talking, when it goes off the rails.

And GP is correctly pointing out that the only negative here (silence waiting latency maybe being too low) is tunable separately from the network latency number.

reply
I want to be able to click the "Stop" button on my earphones remote. I want to be able to interject "woah" or "stop!" or "wait!" or that it would detect that I've inhaled a breath, or that my eyes glazed over. I want the LLM to figure out that every speed setting for its voice output is in "auctioneer" territory rather than "lecturing university professor with tenure and a pension" pacing.

But we won't get any of that, because the prime directive of LLMs is to burn tokens like there's no tomorrow. Burn tokens on a naïve answer without asking clarifying questions. Burn tokens on writing, debugging, and running a Python script or accessing and parsing 10 websites without asking for consent. Burn tokens on half-baked images with misspellings and 31 fingers. Burn tokens arguing "how many 'r's in strawberry?". Burn tokens asking a followup question at the end of every single answer, begging the user to re-engage and burn more tokens.

There is a little red "Stop" control when text output is being produced, at least, but does "Stop" halt everything and throw away the context? Re-prompt from the beginning?

The "maximize tokens burnt" prime directive is not to be found in any system prompt or user documentation. It is seemingly a common feature of the training for any consumer model.

Currently, if I'm using voice for an LLM, I use the voice dictation in the keyboard feature, because then the response is in text. There is no way to prevent "responding in kind" if I query the thing with audio. Or in Swahili.

reply
newer models tend to use fewer thinking tokens to solve the same problems, and is a strong counterexample to your entire comment
reply
deleted
reply
deleted
reply
Agreed. It’s stressful. I think they need to have an option to adopt a suffix, so they don’t start babbling until there is an “over” followed by a pause like in the old army walkie talkie days.
reply
I’ve also experienced this and it’s really annoying. There is this pressure to keep talking if I’m not done with my thought that feels pretty unnatural at least for me. If I’m searching for the right word, I want the opportunity to find it.

I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.

reply
100%. I have to hold the floor by filling the space with "ummmmmmmm.... uhhhh...." which inevitably distracts me from my point altogether. Poor user experience.
reply
Seems like there's a big risk of having that habit leak into human conversation. A lot of people try really hard to train themselves not to add those fillers.
reply
Have you tried telling it to pause to let you think?

I often use it while I’m walking and tell it to not respond until I initiate a conversation.

reply
I’ve tried this and it says it will but just keeps cutting in. I hate this feature so much.
reply
If anyone has an alternative I’m all ears.

This would be a killer feature for me and something I’ve tried to use on cross-country road trips.

reply
If you're setting this up yourself instead of using a lab's built-it speech functionality, you can run a small LLM in parallel, on a local model or small model like Haiku, that acts as a gate for either doing TTS on the response or not. Its only job is to decide if the transcription it receives is of someone being done talking or if that person is likely to still be mid-thought or mid-sentence.
reply
I know it's not the perfect solution for you, but I use a voice recorder and send the LLM the transcript. And my god is it working great.

Usually I just explain the things I want it to do. The longest was 30 minutes rambling of explaining the methods section of a paper in non chronological order. It worked unbelievable good for me.

reply
I find this is a problem even with human conversations. Some people just aren’t very good at telegraphing when they’ve finished ‘their turn’ talking. Or worse yet, aren’t willing to take turns in the first place.
reply
This has more to do with Voice Activity Detection (VAD) than the latency described in the article
reply
That seems to be the issue: VAD is insufficient here.

Knowing when to respond requires semantic understanding, which probably only the model itself is capable enough.

Maybe it’s hard for them to train it to only respond once it seems appropriate to do so?

reply
I am excited for VAD to go away. PersonaPlex totally seems like the future.

However things like 'Call center helpline' turn based actually seems better! You don't want to be interrupted when giving information back and forth (I think?)

reply
Exactly. It's a tangent, but clearly a pain point for enough users.
reply
There's a really interesting project in Japanese natural language processing called J-Moshi that had a novel approach and in my opinion good results.

They tried to make it mimic the way Japanese is full of really quick acknowledgement sounds and it seems to allow it to handle those pauses and interruptions really well.

https://en.nagoya-u.ac.jp/news/articles/say-hello-to-j-moshi... (english)

https://nu-dialogue.github.io/j-moshi/ (japanese and english)

I must admit it's a bit weird when LLMs laugh, I don't really know how I feel about that but it seems to laugh at the right times. Very tangential, but cockatoos have been known to mimic the right time to laugh presumably based on tonal cues that a joke was just made (I have experienced this first hand with rescue birds who li e amongst humans)

reply
In voice conversations I tell it not to reply at all or only say “Understood” until I use some kind of code word. Not perfect, but less intrusive.
reply
Roger that, over.
reply
Reducing the network latency helps with this exactly. OpenAI can make better timed decisions when to begin responding so it'll feel less like an interruption. I've also seen some research on full duplex voice models that handle interruption more like an organic conversation and low latency will help there as well
reply
People are migrating to the "End Of Thought" triggers. Deepgram does that wonderfully.
reply
This is more of a VAD/turn detection issue. It's gotten a lot better over the last few years, but it's a hard problem. The extra ~100ms of latency makes a huge difference otherwise, especially when you have use cases that require tool calling that can easily add 500ms+ of latency.
reply
It seems that tool calling shouldn't be 500ms of latency?
reply
If you have tool calling complex enough that it necessitates a higher reasoning level, and you would otherwise have reasoning set to "none", this can easily come out to 500ms.
reply
Hard problem. I find myself adding in filler to stop the thing from jabbering.

I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.

reply
Their voice capable model is several generations behind the state of the art text-only one, as far as I know.

I don’t think it even has reasoning tokens, so it’s no surprise that it’s as most as smart as the “instant” models (i.e., not very).

reply
Fwiw you can prompt it to respond differently to you.
reply
Strongly agree, some of us like to choose our words more carefully when interacting with an LLM.

I've tried to convey this to OpenAI through various available channels (dev forums, app feedback, etc.).

Grok solves this by having an optional push-to-talk mode, but this is not hands-free and thus more cumbersome than just having a user-configurable variable like seconds_delay_before_sending_voice_input.

reply
yeh exactly, you cannot get a strong signal that a user is done speaking without some amount of “wait for 500ms of silence”. You could kick of processing and abandon if they continued talking, but that seems over optimized.

1-2s replies feel natural and like you pointed out pausing for 2-3s mid sentence is super normal.

reply
The AI should be able to model a probability for when is a natural moment to start talking.
reply
With higher latency this would be even more of an issue. When you pause and start talking again, the model wouldn't catch that until it has already interrupted you.

The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.

But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.

reply
deleted
reply
It’s possible to change the amount of time it waits if you’re using the API
reply
[dead]
reply
[flagged]
reply