Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.
Agree ChatGpt advanced voice mode is so bad for quality of the actual responses. Old model, no reasoning, little tool use.
I just want hands free conversations with SOTA models and don’t care if I have to wait a couple of seconds for a reply.