I understand that moving the goalpost every release is unfair, but it's similarly concerning to consider that people were letting GPT 4.X vibe code and ship entire products.
Someone trying to solve similar problems will have similar results if the "silent failure" applies consistently in aggregate. So, this is the model's performance.
Stockfish does use neural nets but they are tiny, on the order of 10M params. Frontier LLMs are probably 100k or 1M times larger than that.