undefined

points

by amitport14 hours ago |

comments

by eecc10 hours ago|

[-]

Pardon my simplistic question, but when you mean rotation you’re essentially talking about diagonalization aren’t you?

So storing the diagonal as a matrix and the new bases is more compact?

by amitport9 hours ago|

parent|

[-]

In this context, the rotation is for spreading energy and ensuring predictable coordinate distributions rather than diagonalization; it makes coordinate-wise quantization much more computationally efficient, though it throws away learnable structure.

by eecc7 hours ago|

parent|

[-]

ah ok, so intuitively it's like minimizing the error when replacing the values with a well-known distribution. So all you need to carry along is the rotation and the assumption that there is some amount of loss.

by tripplyons4 hours ago|

parent|

prev|

[-]

There are papers that try to quantize angles associated with weights because angles have a more uniform distribution. I haven't read this specific paper, but it looks like it uses a similar trick at a glance.

by busfahrer11 hours ago|

prev|

[-]

I just today learned about Multi-Head Latent Attention, which is also sort of a way of compressing the KV cache. Can someone explain how this new development relates to MHLA?

by yorwba10 hours ago|

parent|

[-]

Multi-Head Latent attention is a redesigned attention mechanism that produces lower-dimensional KV-cache entries. Vector quantization can store KV-cache entries using a small number of bits per dimension while ensuring that the resulting attention scores don't change too much. So MLA needs to be part of the model from the beginning of training, whereas VQ can be retrofitted afterwards, and you could also combine the two.

by tripplyons4 hours ago|

parent|

prev|

[-]

MLA makes it so the keys and values used are a function of a smaller latent vector you cache instead of a key and a value for each token. KV cache quantization reduces the size of the values in the cache by using less bits to store each value. These two approaches operate on different parts of the process so they can be used in combination. For example, you can quantize the latents that are stored for MLA.

by jmalicki8 hours ago|

prev|

[-]

If they didn't cite your paper that's bullshit.

But if they read your paper enough that they invited you to a talk, that probably means they were far enough along to independently inventing it they were going to do so anyway, and wanted to chat with someone who was also doing the thing they were already doing. Good ideas tend to reveal themselves to anyone who is aware of the problem.

by amitport7 hours ago|

parent|

[-]

To be clear, I am not claiming they stole an idea. They have made significant independent research. However, a specific part regarding the treatment of rotation with bias correction relates to prior work, and it would be appropriate to have that recognized.

by CyberDildonics5 hours ago|

parent|

prev|

[-]

That's rationalizing like crazy. If they knew about it they should have cited it.

by ekjhgkejhgk8 hours ago|

parent|

prev|

[-]

Doesn't matter, you should still cite. It's basic manners in science.

by kleiba8 hours ago|

parent|