upvote
I think the primary reason it works is because the difference between K and Q, which is not all that obvious is that it’s allowing the model to have an asymmetric relationship between tokens, so one token can attend to another without the reverse being true. It seems to me if you just have a single value that you’re representing symmetric relationship, which might degrade the quality of reasoning over a set of tokens, but also is probably possible.
reply
it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.
reply
It confused me too.

A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).

reply
> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

reply
Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.
reply
Would it have killed them to use a comma instead?!
reply
Wha? Why didn't they use Q=K=V for that?
reply
The notation is supposed to mean: you have a matrix Q, and also a shared K=V matrix.

I agree with GP that it's super confusing to us the minus sign as a delimiter between formulas. The tuple notation suggested elsewhere would be way clearer.

reply
Its not a math paper
reply
Does it not being an English philology paper mean they are free to spell “fish” as “ghoti”?
reply
Definitely an applied maths paper given that it has been published under CS/ML and been accepted at ICML.
reply
It’s not typeset in math mode so you can’t expect the hyphen to correspond to minus.
reply
By this logic a lot of applied maths papers become “does not compile” :D
reply
Cannot tell whether sarcasm or not.
reply