upvote
Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

reply
In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.
reply
ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.
reply
There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.
reply
I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?
reply
Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.
reply
MLA makes it so the keys and values used are a function of a smaller latent vector you cache instead of a key and a value for each token. KV cache quantization reduces the size of the values in the cache by using less bits to store each value. These two approaches operate on different parts of the process so they can be used in combination. For example, you can quantize the latents that are stored for MLA.
reply
If they didn't cite your paper that's bullshit.

But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.

reply
To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.
reply
That's rationalizing like crazy. If they knew about it they should have cited it.
reply
Doesn't matter, you should still cite. It's basic manners in science.
reply
Exactly, that's why the section is called "Related Work".
reply
The earlier paper was from 2021!
reply
> But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it

That's more than a stretch. They likely invited them because someone thought the abstract sounded interesting, or something like that.

reply
Schmidhuber'd
reply