upvote
> I can't help but think that thete's got to be a better mechanism

What matters is not how good it is in isolation, but how well it scales to giant datasets and supercomputers. So far attention scales the best. It's the most "brute force"-able mechanism

reply
It kinda reminds me of general relativity and gravity bending space-time. I'm sure I sound nuts right now, but the model fits in my head.
reply
>It's always the trade off of a smart complex operation against an absolute crapload of dumb ones.

You can't make attention more specialized without making it less general, which makes LLMs worse as a universal approximator.

reply