Collecting a bunch of "Hard questions for LLMs" in one place will invariably result in Goodhart's law (When a measure becomes a target, it ceases to be a good measure). You'll have no idea if the next round of LLMs is better because they're generally smarter, or because they were trained specifically on these questions.
How many examples does OpenAI train on now that are just variants of counting the Rs in strawberry?
I guess they have a bunch of different wine glasses in their image set now, since that was a meme, but they still completely fail to draw an open book with the cover side up.
Well, that's easy: zero.
Because even a single training example would 'solved' it by memorizing the simple easy answer within weeks of 'strawberry' first going viral , which was like a year and a half ago at this point - and dozens of minor and major model upgrades since. And yet, the strawberry example kept working for most (all?) of that time.
So you can tell that if anything, OA probably put in extra work to filter all those variants out of the training data...
(This is, by the way, why you can't believe any LLM paper about 'forecasting' where they are just doing backtesting, and didn't actually hold out future events. Because there are way too many forms of leakage at this point. This logic may have worked for davinci-001 and davinci-002, or a model whose checkpoints you downloaded yourself, but not for any of the big APIs like GPT or Claude or Gemini...)
Because it gets tokenised, of course a model could never count the rs.
But I suppose if we want these models to be capable of anything then these things need to be accounted for.
Keeping it secret because I don't want my answers trained into a model.
Think of it this way, FizzBuzz used to be a good test to weed out bad actors. It's simple enough that any first year programmer can do it and do it quickly. But now everybody knows to prep for FizzBuzz so you can't be sure if your candidate knows basic programming or just memorized a solution without understanding what it does.