undefined

points

[-]

Also in general I don't know get what the appeal of a 7b full-duplex (speech-to-speech) model is: 7b can't be very intelligent on its own, and for anything useful, you'd need tool-calls, which speech-to-speech models can't do. This is also why ChatGPT voice mode annoys by never doing a web search or reading a link (in fact it pretends to search or read, outright makes up stuff, and when pushed admits it can't really read web pages or do web searches).

There are probably definitely use cases for this though, open to be educated on those.

by satvikpendem10 hours ago|

parent|

[-]

Why can't a speech to speech model do tool calls? Others like Gemini live do it just fine.

by d4rkp4ttern10 hours ago|

parent|

[-]

Ok, I was wrong. I just tested ChatGPT voice, Claude Voice and Gemini Live. And all three are able to do web search. For some reason, I thought when I tested ChatGPT voice a few weeks ago, it sometimes said it can’t directly open links, but it can do web search, which was strange.

by raw_anon_111110 hours ago|

parent|

prev|

[-]

If it is doing a tool call, it has to convert the speech to text or at least a JSON object of the necessary parameters for the tool and convert the result to speech doesn’t it? Is it truly speech to speech then?

by satvikpendem9 hours ago|

parent|

[-]

It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then,

Audio Tokens: "Let me check that for you..." (Sent to the speaker)

Special Token: [CALL_TOOL: get_weather]

Text Tokens: {"location": "Seattle, WA"}

Special Token: [STOP]

The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.

by water-drummer10 hours ago|

parent|

prev|

[-]

Gemini live api and grok voice api can make tool calls and they're speech to speech models

by d4rkp4ttern10 hours ago|

parent|

[-]

Right, turns out Claude and ChatGPT voice can also do web-search. So I guess behind the scenes there is more than a "pure" voice-voice model being used, i.e. there's probably a rudimentary agent loop with tools + tool-exec interposed.

by WhitneyLand10 hours ago|

parent|

prev|

[-]

Yes. Is there a basic chat app for iOS that prioritizes full intelligence over full duplex?

Agree ChatGpt advanced voice mode is so bad for quality of the actual responses. Old model, no reasoning, little tool use.

I just want hands free conversations with SOTA models and don’t care if I have to wait a couple of seconds for a reply.

by dahcryn10 hours ago|

parent|

[-]

I saw a demo of parloa (or maybe it was a different provider), and no joke, they insert sound of typing on a keyboard or stuff like that during an LLM tool call, its weird but surprisingly effective lol

by mrkstu6 hours ago|

prev|

[-]

Quoted from linked article:

"PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.

Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.

The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:

No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”

With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”"

by jayavanth2 hours ago|

prev|

[-]

what is your context size?

by scotty799 hours ago|

prev|

[-]

On something around rtx 5070 it reacted faster than a human would.

by butILoveLife11 hours ago|

prev|

[-]

[flagged]