> you still see improvements
This is expected if they are training their models on it, right?
> objectively-bad results
Keen to learn when this has been the case, i.e. across version increments in major models.
I've been enjoying seeing how the quality of individual models differ based on the amount of reasoning effort you give them. If they were baking an a good pelican you wouldn't expect them to differ so much.
(Google Gemini are the only lab that have very clearly paid attention to the quality of SVG animals-riding-vehicles, see their announcement for Gemini 3.1: https://twitter.com/JeffDean/status/2024525132266688757 )