undefined

upvote

points

by kristjansson8 hours ago |

upvote

by logicchains7 hours ago|

[-]

It can't be successful at that any more than 1+1 can equal 3. Fundamentally, if every token wants to be able to look at every previous token without loss of information, it must be O(n^2); N tokens looking at N tokens is quadratic. Any sub-quadratic attention must hence necessarily lose some information and be unable to support perfect recall on longer sequences.

reply

upvote

by orlp5 hours ago|

[-]

> N tokens looking at N tokens is quadratic

Convolving two arrays can be done perfectly accurately in O(n log n), despite every element being combined with every other element.

Or consider the even more basic sum of products a[i] * b[j] for all possible i, j:

    total = 0
    for i in range(len(a)):
        for j in range(len(b)):
            total += a[i] * b[j]

This can be computed in linear time as sum(a) * sum(b).

Your logic that 'the result contains terms of all pairs, therefore the algorithm must be quadratic' simply doesn't hold.

reply

upvote

by CrazyStat1 hours ago|

[-]

One of my favorite bits of my PhD dissertation was factoring an intractable 3-dimensional integral

\iiint f(x, y, z) dx dy dz = \int [\int g(x, y) dx]*[\int h(y, z) dz] dy

which greatly accelerated numerical integration (O(n^2) rather than O(n^3)).

My advisor was not particularly impressed and objectively I could have skipped it and let the simulations take a bit longer (quite a bit longer--this integration was done millions of times for different function parameters in an inner loop). But it was clever and all mine and I was proud of it.

reply

upvote

by noosphr32 minutes ago|

[-]

Convolution is a local operation.

Attention is a global operation.

reply

upvote

by anvuong4 hours ago|

[-]

This brings me back to DSP class, man learning about FFT was eye-opening.

reply

upvote

by logicchains2 hours ago|

[-]

That's like saying sorting can be done in O(n) because radix sort exists. If you assume some structure, you lose generality, i.e. there'll be some problems it's no longer able to solve. It can no longer approximate any arbitrary function that needs perfect memory over the sequence.

reply

upvote

by hellohello27 hours ago|

[-]

I'm not saying if the paper is correct or not (since I can't tell), but I don't think your argument really holds. Consider applying it to multiplication:

Fundamentally, multiplication need to look at every pair of integer from the two input numbers. It must be O(n^2); N digits looking at N other digits is quadratic. Any sub-quadratic multiplication must hence necessarily lose some information.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by logicchains2 hours ago|

[-]

Multiplication has some properties like being cumulative. If we assume the sequence has any specific properties then we no longer have a general sequence model.

reply

upvote

by direwolf2049 minutes ago|

[-]

I think you meant commutative.

Attention also has some specific properties.

And sometimes results are just unexpected. Did you know that anything a Turing machine can do in t tome steps, a different Turing machine can do in O(sqrt(t log t)) memory cells? https://news.ycombinator.com/item?id=44055347

reply

upvote

by actionfromafar6 hours ago|

[-]

Doesn't that have to do with how many bits you allow in the actual calculation in physical reality?

reply

upvote

by hellohello25 hours ago|

[-]

Well, for multiplication complexity is defined in terms of on the number of digits/bits digits directly. For attention, complexity is defined on terms of the number of input vectors which are all at fixed precision. I don't understand what happens to the method proposed in the paper at higher precision (since I don't understand the paper), but in reality in doesn't matter since there is no value in anything over float16 for machine learning.

reply

upvote

by naasking5 hours ago|

[-]

Your argument just assumes there is no latent structure that can be exploited. That's a big assumption.

reply

upvote

by logicchains2 hours ago|

[-]

It's a necessary assumption for the universal approximation property; if you assume some structure then your LLM can no longer solve problems that don't fit into that structure as effectively.

reply

upvote

by direwolf2042 minutes ago|

[-]

Neural nets are structured as matrix multiplication, yet, they are universal approximators.

reply

upvote

by noosphr29 minutes ago|

[-]

You're missing the non-linear activations.

reply

upvote

by naasking1 hours ago|

[-]

But language does have structure, as does logic and reasoning. Universal approximation is great when you don't know the structure and want to brute force search to find an approximate solution. That's not optimal by any stretch of the imagination though.

reply

upvote

by oasisaimlessly6 hours ago|

[-]

That argument could also be used to say that the FFT's time complexity of O(n log n) should be impossible.

reply

upvote

by energy1238 hours ago|

[-]

It's like claims of room temperature superconductors or millenium prize solutions. Earth shattering if true. It'd be such a black swan. Terrible for Nvidia.

reply

upvote

by SeanAnderson7 hours ago|

[-]

Well, we solved one of the Millennium Prize problems (honestly kinda quickly) so maybe there's hope :)

reply