upvote
For example, I just asked ChatGPT "The boat wash is 50 meters down the street. Should I drive, sail, or walk there to get my yacht detailed?" and it recommended walking. I'm sure with a tiny bit more effort, OpenAI could patch it to the point where it's a lot harder to confuse with this specific flavor of problem, but it doesn't alter the overall shape.
reply
This question is obviously ambiguous. The context here on HN includes "questions LLMs are stupid about, I mention boat wash, clearly you should take the boat to the boat wash."

But this question posed to humans is plenty ambiguous because it doesn't specify whether you need to get to the boat or not, and whether or not the boat is at the wash already. ChatGPT Free Tier handles the ambiguity, note the finishing remark:

"If the boat wash is 50 meters down the street…

Drive? By the time you start the engine, you’re already there.

Sail? Unless there’s a canal running down your street, that’s going to be a very short and very awkward voyage.

Walk? You’ll be there in about 40 seconds.

The obvious winner is walk — unless this is a trick question and your yacht is currently parked in your living room.

If your yacht is already in the water and the wash is dock-accessible, then you’d idle it over. But if you’re just going there to arrange detailing, definitely walk."

reply
You can make the argument that the boat variant is ambiguous (but a stretch), it's really not relevant since the point was revealing the underlying failure mode is unchanged, just concealed now.

The original car question is not ambiguous at all. And the specific responses to the car question weren't even concerned with ambiguity at all, the logic was borderline LLM psychosis in some examples like you'd see in GPT 3.5 but papered over by the well-spoken "intelligence" of a modern SOTA model.

reply
I don't understand what occasional hiccups prove. The models can pass college acceptance tests in advanced educational topics better than 99% of the human population, and because they occasionally have a shortcoming, it means they're worse than humans somehow? Those edge cases are quickly going from 1% -> 0.01% too...

"any human can instantly grok the right answer."

When asking a human about general world knowledge, they don't have the generality to give good answers for 90% of it. Even very basic questions humans like this, humans will trip up on many many more than the frontier LLMs.

reply