upvote
Agree. Audio has strongly temporal so there is almost certainly some positional encoding one way or another.
reply
Audio is 1 dimensional so the usual RoPE position encoding should handle it like it does for text tokens. You only need extra position encoding for higher-dimensional stuff like images.
reply
Ah yeah, thinking further it's probably just using some positioning embedding based on sequence numbering added in the LLM layers. For vision it needs the patch location as well.
reply
No there isn't - read the paper. It's just 40msec raw audio samples. Multiplied by one matrix to translate to 3800 input vector. That's it. The next 40 msec are fed in the next transformer input step. Without any positional encoding. Repeat ad infinitum
reply