Audio Tokens: "Let me check that for you..." (Sent to the speaker)
Special Token: [CALL_TOOL: get_weather]
Text Tokens: {"location": "Seattle, WA"}
Special Token: [STOP]
The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.