upvote
An eventual goal is likely to allow interacting with the LLM directly via audio tokens in input/output skipping tts and stt completely.
reply