upvote
Yeah I really dislike this kind of question from SimpleBench. I've suggested many improvements to some of the publicly available questions but not had a good response.

I think the correct answer to the question above should be something like, "are you sure the question is correct because it's not clear whether John and the bald man are the same person" but of course an LLM would be marked down if it was inclined to respond in this way.

reply
At a glance, it sounds like John is the bald man? If we're treating this as a riddle, it doesn't seem incomprehensible. Whether riddles are a fair test is another question.
reply
I think it's a horrible example, but I just got a very professional response from Gemini 2.5:

> This scenario seems like a bit of a riddle! Let's break it down:

> The Setting: John is alone in an "otherwise-empty" bathroom.

> The Action: He is looking in the mirror.

> The Observation: He sees a bulb falling towards the head of a bald man he is examining in the mirror.

> The Physics: Since he is alone and looking in the mirror, the bald man he is "closely examining" must be his own reflection.

> The Incident: Therefore, the bulb fell towards John's own head. He looked up (at the actual bulb falling towards him), failed to catch it, and it hit him.

> The Reaction: The "bald man" who cursed, yelled "what an idiot!", and left the bathroom was actually John himself, reacting to being hit on the head and possibly feeling foolish for not catching the bulb.

> Conclusion: No, John should not text an apology. The person who was hit by the bulb, got angry, and left was John himself. There is no other bald man to apologize to.

reply
This example has a lot of common-sense reasoning, linguistic ambiguity (e.g. in NP coreferences) etc. going on.

Just a few years ago, most folks at a computational linguistics conference would probably have said such abilities are impossible to achieve at least during their lifetime.

reply
I'd argue that's a pretty good test for an LLM - can it overcome the red herrings and get at the actual problem?
reply
I think that the "actual problem" when you've been given such a problem is with the person posing it either having dementia, or taking the piss. In either case, the response shouldn't be of trying to guess their intent and come up with a "solution", but of rejecting it and dealing with the person.
reply