What matters is not how good it is in isolation, but how well it scales to giant datasets and supercomputers. So far attention scales the best. It's the most "brute force"-able mechanism
You can't make attention more specialized without making it less general, which makes LLMs worse as a universal approximator.