The models now whaste a vast amount of useless neurons memorising the character count the entire English language so that people can ask how many r's are in strawberry and check a tickbox in a benchmark.
The architecture cannot efficiently or consistently represent counting letters in words. We should never have forced trained them to do it.
This goes for other more important "skills" that are unsuited to tranformer models.
Most models can now do decent arithmetics. But if you knew how it has encoded that ability in its neurons then you would never ever ever ever trust any arithmetic it ever outputs, even in seems to "know" it (unless it called a calculator MCP to achieve it).
There are fundamental limitations, but we're currently brute forcing ourselves through problems we could trivially solve with a different tool.
Are you using only frontier models that are gated behind openai/anthropic/google APIs? Those use tools to help them out behind the scenes. It remains no less impressive, but I think we should be clear.
Some limitations are not rigorously demonstrated to be fundamental, but continuously present from the first early LLMs yes. Shouldn't the burden of proof be on those who say it can be done?
And some limitations are fundamental, and have been rigorously demonstrated, e.g.:
People have an <opinion> which hasn't been rigorously proven, while <not rigorously proven counteropinion>.
As such, I am not sure what you're trying to achieve here.
You can try this out locally with any mid-sized current-gen LLM. You’ll find that it can spell out most atomic tokens from its input just fine. It simply learned to do so.
We have improved hallucinations significantly, and yet it seems clear that they are inherent to the technology and so will always exist to some extent.
There are also limitations due to maths and/or physics that aren't fixable under any design. Outside science fiction, there is no technology whose limitations are all fixable.
Here's one: https://arxiv.org/abs/2401.11817?utm_source=chatgpt.com