As for chess, although an LLM knows the rules of chess, it is not expected to have been trained on many optimal chess games. As such, it's not fair to gauge its skill in chess, especially without even showing it generated images of its candidate moves. Even if representational and training limitations were addressed, we know that LLMs are architecturally crippled in that they have no neural memory beyond their context. Imagine a next-gen LLM that if presented with a chess puzzle would first update its internal weights for playing optimal chess via a simulation of a billion games, and then return to address the puzzle you gave it. Even with the current arch, it could equivalently create a fork of itself for the same purpose, a new trained model in effect, but the rushing human's desire for wanting the answer immediately comes in the way.
If anything, I see greater verticality of specialized software that is using LLMs at their core, but with much aid and technology around it to really make the most out of it.
Why do these distinctions matter?
is it an LLM, or symbolic, or a combo, or a dozen technologies stitched together. Who cares. It is all automation. It is all artificial.