It’s no different than someone testing a calculator with 2+2. If it gets that wrong, there’s a hardware issue. That doesn’t mean the only purpose of the calculator is to calculate 2+2. It is for debugging.
You could just as uncharitably complain that “these days no one does arithmetic anymore, they use a calculator for 2+2”.
The LLM that malfunctioned was there to slap categories on things. And something was going wrong in either the hardware or the compiler.
I don't get the snark about LLMs overall in this context; this author uses LLM to help write their code, but is also clearly competent enough to dig in and determine why things don't work when the LLM fails, and performed an LLM-out-of-the-loop debugging session once they decided it wasn't trustworthy. What else could you do in this situation?