undefined

points

[-]

I think the primary reason it works is because the difference between K and Q, which is not all that obvious is that it’s allowing the model to have an asymmetric relationship between tokens, so one token can attend to another without the reverse being true. It seems to me if you just have a single value that you’re representing symmetric relationship, which might degrade the quality of reasoning over a set of tokens, but also is probably possible.

by joshuamoyers4 minutes ago|

parent|

[-]

it seems to be something that’s similar to the class of optimizations associated with with linear or state space attention when things models often do is once they figure out an optimization like this they create a ratio between full resolution blocks and blocks that have the optimization implemented.

by kanbankaren16 hours ago|

prev|

[-]

It confused me too.

A n-tuple notation would have been more readable and mathematically accurate like (Q=K, V), (Q, K=V), and (Q=K=V).

by amemi19 hours ago|

prev|

[-]

> Maybe it works because the sequences are short and the dimension is high and there's plenty of room for interesting results to fit in the merged key/value space.

In fact, on the second last page of the paper, they discuss this very problem. There is a clear correlation between performance and increasing sequence lengths for the Q-K=V model. While limited to a tight n=3 sample between 512, 1024, 2048 lengths, the degradation decreases from 5.4% to 2.2% as context is increased, suggesting that it is unlikely shorter sequences are the reason K=V performs acceptably.

by xiaoyu200619 hours ago|

prev|

[-]

Yeah the weird notation confused me too. Their own Limitations also says their experiments are too small. I am quite curious how it will play out big now, but unironically I cannot afford the hardware lol.

by ssivark13 hours ago|

prev|

[-]

Would it have killed them to use a comma instead?!

by sfink14 hours ago|

prev|

[-]

Wha? Why didn't they use Q=K=V for that?

by simsla6 hours ago|

parent|

[-]

The notation is supposed to mean: you have a matrix Q, and also a shared K=V matrix.

I agree with GP that it's super confusing to us the minus sign as a delimiter between formulas. The tuple notation suggested elsewhere would be way clearer.

by semiinfinitely16 hours ago|

prev|

[-]

Its not a math paper

by volemo14 hours ago|

parent|

[-]

Does it not being an English philology paper mean they are free to spell “fish” as “ghoti”?

by srean13 hours ago|

parent|

prev|

[-]