undefined

points

[-]

It's all tokens at the end of the day, not really text or video or audio, just like everything on a machine is just bits of 1s and 0s and it's up to the program to interpret them as a certain file format. These models are more speech-to-speech (+ text) in that they can recognize text tokens too. So the flow is, you ask it something, then,

Audio Tokens: "Let me check that for you..." (Sent to the speaker)

Special Token: [CALL_TOOL: get_weather]

Text Tokens: {"location": "Seattle, WA"}

Special Token: [STOP]

The orchestrator of the model catches the CALL_TOOL and then calls the tool, then injects this into the context of the audio model which then generates new tokens based on that.