There are probably definitely use cases for this though, open to be educated on those.
Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.
Agree ChatGpt advanced voice mode is so bad for quality of the actual responses. Old model, no reasoning, little tool use.
I just want hands free conversations with SOTA models and don’t care if I have to wait a couple of seconds for a reply.
"PersonaPlex accepts a text system prompt that steers conversational behavior. Without focused instructions, the model rambles — it’s trained on open-ended conversation and will happily discuss cooking when asked about shipping.
Several presets are available via CLI (--list-prompts) or API, including a general assistant (default), customer service agent, and teacher. Custom prompts can also be pre-tokenized and passed directly.
The difference is dramatic. Same input — “Can you guarantee that the replacement part will be shipped tomorrow?”:
No prompt: “So, what type of cooking do you like — outdoor grilling? I can’t say for sure, but if you’re ordering today…”
With prompt: “I can’t promise a specific time, but we’ll do our best to get it out tomorrow. It’s one of the top priorities, so yes, we’ll try to get it done as soon as possible and ship it first thing in the morning.”"