upvote
This reminds me of the old brain-teaser/joke that goes something like 'An airplane crashes on the boarder of x/y, where do they bury the survivors?' The point being that this exact style of question has real examples where actual people fail to correctly answer it. We mostly learn as kids through things like brain teasers to avoid these linguistic traps, but that doesn't mean we don't still fall for them every once in a while too.
reply
That’s less a brain teaser than running into the error correction people use with language. This is useful when you simply can’t hear someone very well or when the speaker makes a mistake, but fails when language is intentionally misused.
reply

  > This is useful when you simply can’t hear someone very well or when the speaker makes a mistake
I have a few friends with pretty heavy accents and broken English. Even my partner makes frequent mistakes as a non native English speaker. It's made me much better at communicating but it's also more work and easier for miscommunication to happen. I think a lot of people don't realize this also happens with variation in culture. So even within people speaking the same language. It's just that the accent serves as a flag for "pay closer attention". I suspect this is a subtle but contributing problem to miscommunication on the and why fights are so frequent.
reply
Yeah but I might ask a malformed question about a domain I know nothing about and not know it was malformed. An expert would ask for clarification
reply
I'm actually having a hard time interpreting your meaning.

Are you criticizing LLMs? Highlighting the importance of this training and why we're trained that way even as children? That it is an important part of what we call reasoning?

Or are you giving LLMs the benefit of the doubt, saying that even humans have these failure modes?[0]

Though my point is more that natural language is far more ambiguous than I think people give credit to. I'm personally always surprised that a bunch of programmers don't understand why programming languages were developed in the first place. The reason they're hard to use is explicitly due to their lack of ambiguity, at least compared to natural languages. And we can see clear trade offs with how high level a language is. Duck typing is both incredibly helpful while being a major nuisance. It's the same reason even a technical manager often has a hard time communicating instructions. Compression of ideas isn't very easy

[0] I've never fully understood that argument. Wouldn't we call a person stupid for giving a similar answer? How does the existence of stupid mean we can't call LLMs stupid? It's simultaneously anthropomorphising while being mechanistic.

reply
>bury the *survivors*

I did not catch that in the first pass.

I read it as the casualties, who would be buried wherever the next of kin or the will says they should.

reply
same things as the old, "what's heavier, a tonne of coal or a tonne of feathers". many, many people will say a ton a coal...
reply
> All the people responding saying "You would never ask a human a question like this"

That's also something people seem to miss in the Turing Test thought experiment. I mean sure just deceiving someone is a thing, but the simplest chat bot can achieve that. The real interesting implications start to happen when there's genuinely no way to tell a chatbot apart.

reply
But it isn't just a brain-teaser. If the LLM is supposed to control say Google Maps, then Maps is the one asking "walk or drive" with the API. So I voice-ask the assistant to take me to the car wash, it should realize it shouldn't show me walking directions.
reply
That’s not the problem with this post.

The problem is that most LLM models answer it correctly (see the many other comments in this thread reporting this). OP cherry picked the few that answered it incorrectly, not mentioning any that got it right, implying that 100% of them got it wrong.

reply
You can see up-thread that the same model will produce different answers for different people or even from run to run.

That seems problematic for a very basic question.

Yes, models can be harnessed with structures that run queries 100x and take the "best" answer, and we can claim that if the best answer gets it right, models therefore "can solve" the problem. But for practical end-user AI use, high error rates are a problem and greatly undermine confidence.

reply
My understanding is that it mainly fails when you try it in speech mode, because it is the fastest model usually. I tried yesterday all major providers and they were all correct when I typed my question.
reply
Nay-sayers will tell you all OpenAI, Google and Anthropic 'monkeypatched' their models (somehow!) after reading this thread and that's why they answer it correctly now.

You can even see those in this very thread. Some commenters even believe that they add internal prompts for this specific question (as if people are not attempting to fish ChatGPT's internal prompts 24/7. As if there aren't open weight models that answer this correctly.)

You can't never win.

reply
I recently asked an AI a chemistry question which may have an extremely obvious answer. I never studied chemistry so I can't tell you if it was. I included as much information about the situation I found myself in as I could in the prompt. I wouldn't be surprised if the ai's response was based on the detail that's normally important but didn't apply to the situation, just like the 50 meters
reply
If you're curious or actually knowledgeable about chemistry, here's what happened. My apartment's dishwasher has gaps in the enamel from which rust can drip onto plates and silverware. I tried soaking but I presume to be a stainless steel knife with a drip of rust on it in citric acid. The rust turned black and the water turned a dark but translucent blue/purple.

I know nothing about chemistry. My smartest move was to not provide the color and ask what the color might have been. It never guessed blue or purple.

In fact, it first asked me if this was highschool or graduate chemistry. That's not... and it makes me think I'll only get answers to problems that are easily graded, and therefore have only one unambiguous solution

reply
I'm a little confused by your question myself. Stainless steel rust should be that same brown color. Though it can get very dark when dried. Blue is weird but purple isn't an uncommon description, assuming everything is still dark and there's lots of sediment.

But what's the question? Are you trying to fix it? Just determine what's rusting?

reply
Oh yeah, the question is "can I use the knife and the glass again?"

Although, now that I look closely at them, the butter knife got eaten away in spots and it's already pretty cheap, so I'll toss it.

reply
I'd answer "probably" if you've cleaned everything. But if the rust comes back then probably should just toss. Rust is an oxide layer, so outside only
reply
>People regularly ask questions that are structured poorly or have a lot of ambiguity.

The difference between someone who is really good with LLM's and someone who isn't is the same as someone who's really good with technical writing or working with other people.

Communication. Clear, concise communication.

And my parents said I would never use my English degree.

reply
This is the LLM equivalent of a riddle, eg: “A farmer has 17 sheep. All but 9 die. How many are left?”
reply
Exactly! The problem isn't this toy example. It's all of the more complicated cases where this same type of disconnect is happening, but the users don't have all of the context and understanding to see it.
reply
> All the people responding saying "You would never ask a human a question like this"

It would be interesting to actually ask a group a people this question. I'm pretty sure a lot of people would fail.

It feels like one of those puzzles which people often fail. E.g: 'Ten crows are sitting on a power line. You shoot one. How many crows are left to shoot?' People often think it's a subtraction problem and don't consider that animals flee after gunshots. (BTW, ChatGPT also answers 9.)

reply
You assumed gunshots. He could have used a bow and arrow, or a blowpipe.
reply
Other leading LLMs do answer the prompt correctly. This is just a meaningless exercise in kicking sand in OpenAI's face. (Well-deserved sand, admittedly.)
reply