That is the entire point, right? Us having to specify things that we would never specify when talking to a human. You would not start with "The car is functional. The tank is filled with gas. I have my keys." As soon as we are required to do that for the model to any extend that is a problem and not a detail (regardless that those of us, who are familiar with the matter, do build separate mental models of the llm and are able to work around it).
This is a neatly isolated toy-case, which is interesting, because we can assume similar issues arise in more complex cases, only then it's much harder to reason about why something fails when it does.
Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.
But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc. The LLM orbiting in space will essentially just call GCC to execute the 'prompt' and spend the rest of the time pondering its existence ;p
The Asimov story is on the Internet Archive here [1]. That looks like it is from a handout in a class or something like that and has an introductory paragraph added which I'd recommend skipping.
There is no space between the end of that added paragraph and the first paragraph of the story, so what looks like the first paragraph of the story is really the second. Just skip down to that, and then go up 4 lines to the line that starts "Jehan Shuman was used to dealing with the men in authority [...]". That's where the story starts.
[1] https://ia800806.us.archive.org/20/items/TheFeelingOfPower/T...
The story I have half a mind to write is along the lines of a future we envision already being around us, just a whole lot messier. Something along the lines of this [2] XKCB.
On the foolishness of "natural language programming". https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...
Since the early days of automatic computing we have had people that have felt it as a shortcoming that programming required the care and accuracy that is characteristic for the use of any formal symbolism. They blamed the mechanical slave for its strict obedience with which it carried out its given instructions, even if a moment's thought would have revealed that those instructions contained an obvious mistake. "But a moment is a long time, and thought is a painful process." (A.E.Houseman). They eagerly hoped and waited for more sensible machinery that would refuse to embark on such nonsensical activities as a trivial clerical error evoked at the time.
(and it continues for some many paragraphs)https://news.ycombinator.com/item?id=8222017 2014 - 154 comments
https://news.ycombinator.com/item?id=35968148 2023 - 65 comments
https://news.ycombinator.com/item?id=43564386 2025 - 277 comments
Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.
Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc
That wasn't the point at all. The idea is about rediscovering what always worked to make a computer useful, and not even using the fuzzy AI logic.
Effective collaboration relies on iterating over clarifications until ambiguity is acceptably resolved.
Rather than spending orders of magnitude more effort moving forward with bad assumptions from insufficient communication and starting over from scratch every time you encounter the results of each misunderstanding.
Most AI models still seem deep into the wrong end of that spectrum.
I think there's a substantial subset of tech companies and honestly tech people who disagree. Not openly, but in the sense of 'the purpose of a system is what it does'.
Writing code is very much “you get what you write” but AI is like “maintain a probabilistic mental model of the possible output”. My brain honestly prefers the latter (in general) but I feel a lot of engineers I’ve met seem to stray towards clean determinism.
Interactions between humans and computers in natural language for your average person is much much less then the interactions between that same person and their dog. Humans also speak in natural language to their dogs, they simplify their speech, use extreme intonation and emphasis, in a way we never do with each other. Yet, despite having been with dogs for 10,000+ years, it has not significantly affected our language (other then giving us new words).
EDIT: just found out HN annoyingly transforms U+202F (NARROW NO-BREAK SPACE), the ISO 80000-1 preferred way to type thousand separator
AI will accelerate “natural” change in language like anything else.
And as AI changes our environment (mentally, socially, and inevitably physically) we will change and change our language.
But what will be interesting is the rise of agent to agent communication via human languages. As that kind of communication shows up in training sets, there will be a powerful eigenvector of change we can’t predict. Other than that it’s the path of efficient communication for them, and we are likely to pick up on those changes as from any other source of change.
I'm on the spectrum and I definitely prefer structured interaction with various computer systems to messy human interaction :) There are people not on the spectrum who are able to understand my way of thinking (and vice versa) and we get along perfectly well.
Every human has their own quirks and the capacity to learn how to interact with others. AI is just another entity that stresses this capacity.
So no abstract reasoning.
You see people complaining about LLM ability, and then you see their prompt, and it's the 2006 equivalent of googling "I need to know where I can go for getting the fastest service for car washes in Toronto that does wheel washing too"
This isn’t always the case and depends on what you need.
And do you perhaps also have memory enabled on the LLMs you are thinking of?
"Communication usually fails, except by accident." —Osmo A. Wiio [1]
https://en.wikipedia.org/wiki/Ithkuil
> Ithkuil is an experimental constructed language created by John Quijada. It is designed to express more profound levels of human cognition briefly yet overtly and clearly, particularly about human categorization. It is a cross between an a priori philosophical and a logical language. It tries to minimize the vagueness and semantic ambiguity in natural human languages. Ithkuil is notable for its grammatical complexity and extensive phoneme inventory, the latter being simplified in an upcoming redesign.
> ...
> Meaningful phrases or sentences can usually be expressed in Ithkuil with fewer linguistic units than natural languages. For example, the two-word Ithkuil sentence "Tram-mļöi hhâsmařpţuktôx" can be translated into English as "On the contrary, I think it may turn out that this rugged mountain range trails off at some point."
Half as Interesting - How the World's Most Complicated Language Works https://youtu.be/x_x_PQ85_0k (length 6:28)
If we're 'lucky' there will still be some 'priests' around like in the Foundation novels. They don't understand how anything works either, but can keep things running by following the required rituals.
So, back to COBOL? :)
well more like a structured _querying_ language
That has been tried for almost half a century in the form of Cyc[1] and never accomplished much.
The proper solution here is to provide the LLM with more context, context that will likely be collected automatically by wearable devices, screen captures and similar pervasive technology in the not so distant future.
This kind of quick trick questions are exactly the same thing humans fail at if you just ask them out of the blue without context.
We've truly gone full circle here, except now our programming languages have a random chance for an operator to do the opposite of what the operator does at all other times!
Like a programming language? But that's the whole point of LLMs, that you can give instructions to a computer using natural language, not a formal language. That's what makes those systems "AI", right? Because you can talk to them and they seem to understand what you're saying, and then reply to you and you can understand what they're saying without any special training. It's AI! Like the Star Trek[1] computer!
The truth of course is that as soon as you want to do something more complicated than a friendly chat you find that it gets harder and harder to communicate what it is you want exactly. Maybe that's because of the ambiguity of natural language, maybe it's because "you're prompting it wrong", maybe it's because the LLM doesn't really understand anything at all and it's just a stochastic parrot. Whatever the reason, at that point you find yourself wishing for a less ambiguous way of communication, maybe a formal language with a full spec and a compiler, and some command line flags and debug tokens etc... and at that point it's not a wonderful AI anymore but a Good, Old-Fashioned Computer, that only does what you want if you can find exactly the right way to say it. Like asking a Genie to make your wishes come true.
______________
[1] TNG duh.
Does the next paragraph not make that clear?
The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.
Just ask me a clarifying question before going into your huge pitch. Chats are a back & forth. You don’t need to give me a response 10x longer than my initial question. Etc
It's a fast track to public disdain and heavy handed government regulation.
There is no way without the protections that could be afforded by regulation to offer such wide-ranging uses of the product without also accepting significant liability. If the range of "foreseeable misuse" is very broad and deep, so is the possible liability. If your marketing says that the bot is your lawyer, doctor, therapist, and spouse in one package, how is one to say that the company can escape all the comprehensive duties that attach to those social roles. Courts will weigh the tiny and inconspicuous disclaimers against the very large and loud marketing claims.
The companies could protect themselves in ways not unlike the ways in which the banking industry protects itself by replacing generic duties with ones defined by statute and regulation. Unless that happens, lawyers will loot the shareholders.
Recall: "As part of our 'treat adult users like adults' principle, we will allow even more, like erotica for verified adults," Altman wrote in the Oct.
Some of the labs might be less worried about this, but they're not by any means homogenous.
That's the best part.
A person can even hammer out an unstructured list of behavioral gripes, tell the bot to organize them into instructional prose, have it ask clarifying questions and revise based on answers, and produce directions for integrating them as Custom Instructions.
From then on, it will invisibly read these instructions into context at the beginning of each new chat.
Mold it and steer it to be how you want it to be.
(My own bot tends to be very dry, terse, non-presumptuous, pragmatic, and profane. It's been years now since it has uttered an affirmation like "That's a great idea!" or "Wow! My circuits are positively buzzing with the genius I'm seeing here!" or produced a tangential dissertation in response to a simple question. But sometimes it does come back with functional questions, or phrasing like "That shit will never work. Here's why.")
Except, of course, when that is exactly what the user wants.
Chat is a back & forth.
Search is a one-shot.
In a real human to human conversation, you wouldn’t simply blurt out the first thing that comes to mind. Instead, you’d ask questions.
This is the fundamental limitation with generative AI. It only generates, it does not ponder.
That is to say, all of that activity I listed is activity I’m confident generative AI is not capable of, fundamentally.
Like I said in a cousin comment, we can build Frankenstein algorithms and heuristics on top of generative AI but every indication I’ve seen is that that’s not sufficient for intelligence in terms of emergent complexity.
Imagine if we had put the same efforts towards neural networks, or even the abacus. “If I create this feedback loop, and interpret the results in this way, …”
My take is that an artificial model of true intelligence will only be achieved through emergent complexity, not through Frankenstein algorithms and heuristics built on generative AI.
Generative AI does itself have emergent complexity, but I’m bearish that if we would even hook it up to a full human sensory input network it would be anything more than a 21st century reverse mechanical Turk.
Edit: tl;dr Emergent complexity is a necessary but insufficient criteria for intelligence
You can get it to ask you clarifying questions just by telling it to. And then you usually just get a bunch of questions asking you to clarify things that are entirely obvious, and it quickly turns into a waste of time.
The only time I find that approach helpful is when I'm asking it to produce a function from a complicated English description I give it where I have a hunch that there are some edge cases that I haven't specified that will turn out to be important. And it might give me a list of five or eight questions back that force me to think more deeply, and wind up being important decisions that ensure the code is more correct for my purposes.
But honestly that's pretty rare. So I tell it to do that in those cases, but I wouldn't want it as a default. Especially because, even in the complex cases like I describe, sometimes you just want to see what it outputs before trying to refine it around edge cases and hidden assumptions.
It’s similar to the challenge that foreigners have with cultural references and idioms and figurative speech a culture has a mental model of.
In this case, I think what is missing are a set of assumptions based on logic, e.g., when stating that someone wants to do something, it assumes that all required necessary components will be available, accompany the subject, etc.
I see this example as really not all that different than a meme that was common among I think the 80s and 90s, that people would forget buying batteries for Christmas toys even though it was clear they would be needed for an electronic toy. People failed that basic test too, and those were humans.
It is odd how people are reacting to AI not being able to do these kinds of trick questions, while if you posted something similar about how you tricked some foreigners you’d be called racist, or people would laugh if it was some kind of new-guy hazing.
AI is from a different culture and has just arrived here. Maybe we’re should be more generous and humane… most people are not humane though, especially the ones who insist they are.
Frankly, I’m not sure it bodes well for if aliens ever arrive on Earth, how people would respond; and AI is arguably only marginally different than humans, something an alien life that could make it to Earth surely would not be.
You _could_ say humans output similar answers to questions, but I think that is being intellectually dishonest. Context, experience, observation, objectivity, and actual intelligence is clearly important and not something the LLM has.
It is increasingly frustrating to me why we cannot just use these tools for what they are good for. We have, yet again, allowed big tech to go balls deep into ham-fisting this technology irresponsibly into every facet of our lives the name of capital. Let us not even go into the finances of this shitshow.
This is especially obvious when viewing the reasoning trace for models like Claude, which often spends a lot of time speculating about the user's "hints" and trying to parse out the intent of the user in asking the question. Essentially, the model I use for LLMs these days is to treat them as very good "test takers" which have limited open book access to a large swathe of the internet. They are trying to ace the test by any means necessary and love to take shortcuts to get there that don't require actual "reasoning" (which burns tokens and increases the context window, decreasing accuracy overall). For example, when asked to read a full paper, focusing on the implications for some particular problem, Claude agents will try to cheat by skimming until they get to a section that feels relevant, then searching directly for some words they read in that section. They will do this even if told explicitly that they must read the whole paper. I assume this is because the vast majority of the time, for the kinds of questions that they are trained on, this sort of behavior maximizes their reward function (though I'm sure I'm getting lots of details wrong about the way frontier models are trained, I find it very unlikely that the kinds of prompts that these agents get very closely resemble data found in the wild on the internet pre-LLMs).
https://github.com/Wyattwalls/system_prompts/blob/main/OpenA...
In terms of the behavior, technically it doesn’t override, but instead think of it as a nudge. Both system prompt and your custom prompt participates in the attention process, so the output tokens get some influence from both. Not equally but to some varying degree and chance
The more specific they are, the more accurate they typically are.
This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.
It's then up to the model to make the connection "At the car wash people wash their car -> to wash your car you need your car to be present -> if you drive there your call will be there"
I'd imagine plenty of stories contain something like "I had an easy Saturday morning, I took my car to the carwash and called into a cafe for breakfast on my way home".
Plenty of instructables like "how to wash a car: if there's no carwash close enough for you to bring your car, don't worry, all you need is a bucket and a few tools..."
Several recipe blogs starting "I remember 1972 when grandpa drove his car to the carwash every afternoon while grandma made her world famous mustard and gooseberry cake, that car was always gleaming after he washed it at BigBrand CarWash 'drive your car to us so we can wash it' was their slogan and he would sing it around the house to the smell of baked eggs and mustard wafting through the kitchen..."
And innumerable SEO spam of the kind "Bob's car wash, why not bring drive take ride carry push transport your car automobile van SUV lorry truck 4by4 to our Bob's wash soap suds lather clean gleaming local carwash in your area ford chevvy dodge coupe not Nokia iphone xbox nike..."
against very few "I walked to the carwash because it was a lovely day and I didn't want to take the car out".
But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?
I'm not sure that I see it quite that way.
Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children.
There are more exceptions, than there are rules, the moment you get any international pipeline involved.
LLMs AFAIK cannot do this for novel areas of interest. (ie if it's some domain where there's a ton of "10 things people usually miss about X" blog posts they'll be able to regurgitate that info, but are not likely to synthesize novel areas of ambiguity).
As an experiment, recently I've been using Codex CLI to configure some consumer networking gear in unusual ways to solve my unusual set of problems. Stuff that pros don't bother with (they don't have the same problems I face), and that consumers tend to shy away from futzing with. The hardware includes a cheap managed switch, an OpenWRT router, and a Mikrotik access point. It's definitely a rather niche area of interest.
And by "using," I mean: In this experiment, the bot gets right in there, plugging away with SSH directly.
It was awful with this at first, mostly consisting of a long-winded way to yet-again brick a device that lacks any OOB console port. It'd concoct these elaborate strings of shit and feed them in, and then I'd wander over and reset whatever box was borked again. Footgun city.
But after I tired of that, I had it define some rules for engaging with hardware, validation, constraints, and for order of execution, and commit those rules to AGENTS.md. It got pretty decent at following high-level instructions to get things done in the manner that I specified, and the footguns ceased.
I didn't save any time by doing this. But I also didn't have to think about it much: I never got bogged down in wildly-differing CLI syntax of the weirdo switch, the router (whose documentation is locked behind a bot firewall), and access point's bespoke userland. I didn't touch those bits myself at all.
My time was instead spent observing the fuckups and creating a rather generic framework that manages the bot, and just telling it what to do -- sometimes, with some questions. I did that using plain English.
Now that this is done, I get to re-use this framework for as many projects as I dare, revising it where that seems useful.
(That cheap switch, by the way? It's broken. It has bizarro-world hardware failure modes that are unrelated to software configuration or firmware rev. Today, a very different cheap switch showed up to replace it. When I get around to it, I'll have the bot sort that transition out. I expect that to involve a bit of Q&A, and I also expect it to go fine.)
It's not underspecified. More... Overspecified. Because it needs to be. But AI will assume that "impossible" things never happen, and choose a happy path guaranteed to result in failure.
You have to build for bad data. Comes with any business of age. Comes with international transactions. Comes with human mistakes that just build up over the decades.
The apparent current state of a thing, is not representative of its history, and what it may or may not contain. And so you have nonsensical rules, that are aimed at catching the bad data, so you have a chance to transform it into good data when it gets used, without needing to mine the entire petabytes of historical data you have sitting around in advance.
If we used MacOS throughout the org, and we asked a SW dev team to build inventory tracking software without specifying the OS, I'd squarely put the blame on SW team for building it for Linux or Windows.
(Yes, it should be a blameless culture, but if an obvious assumption like this is broken, someone is intentionally messing with you most likely)
There exists an expected level of context knowledge that is frequently underspecified.
Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too.
(and if it's flat or downhill, you might even push the car for 50m ;))
There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange.
I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.
It's not specific to software, it's the entire World of business. Most knowledge work is translation from one domain/perspective to another. Not even knowledge work, actually. I've been reading some works by Adler[0] recently, and he makes a strong case for "meaning" only having a sense to humans, and actually each human each having a completely different and isolated "meaning" to even the simplest of things like a piece of stone. If there is difference and nuance to be found when it comes to a rock, what hope have we got when it comes to deep philosophy or the design of complex machines and software?
LLMs are not very good at this right now, but if they became a lot better at, they would a) become more useful and b) the work done to get them there would tell us a lot about human communication.
This is not really true, in fact products become worse the farther away from the problem a developer is kept.
Best products I worked with and on (early in my career, before getting digested by big tech) had developers working closely with the users of the software. The worst were things like banking software for branches, where developers were kept as far as possible from the actual domain (and decision making) and driven with endless sterile spec documents.
It's always about translating between our own domain and the customer's, and every other new project there's a new domain to get up to speed with in enough detail to understand what to build. What other professions do that?
That's why I'm somewhat scared of AIs - they know like 80% of the domain knowledge in any domain.
If they had the chance to take the time to have a good talk with the actual users it would be different.
The typical job of a Product Manager is also not to directly perform this mapping, although the PM is much closer to that activity. PMs mostly need to enforce coherence across an entire product with regard to the ways of mapping business needs to software features that are being developed by individual developers. They still usually involve developers to do the actual mapping, and don't really do it themselves. But the Product Manager must "manage" this process, hence the name, because without anyone coordinating the work of multiple developers, those will quickly construct mappings that may work and make sense individually, but won't fit together into a coherent product.
Developers are indeed the people responsible to find out what business actually wants (which is usually not equal to what they say they want) and map that onto a technical model that can be implemented into a piece of software - or multiple pieces, if we talk about distributed systems. Sometimes they get some help by business analysts, a role very similar to a developer that puts more weight on the business side of things and less on the coding side - but in a lot of team constellations they're also single-handedly responsible for the entire process. Good developers excel at this task and find solutions that really solve the problem at hand (even if they don't exactly follow the requirements or may have to fill up gaps), fit well into an existing solution (even if that means bending some requirements again, or changing parts of the solution), are maintainable in the long run and maximize the chance for them to be extendable in the future when the requirements change. Bad developers just churn out some code that might satisfy some tests, may even roughly do what someone else specified, but fails to be maintainable, impacts other parts of the system negatively, and often fails to actually solve the problem because what business described they needed turned out to once again not be what they actually needed. The problem is that most of these negatives don't show their effects immediately, but only weeks, months or even years later.
LLMs currently are on the level of a bad developer. They can churn out code, but not much more. They fail at the more complex parts of the job, basically all the parts that make "software engineering" an engineering discipline and not just a code generation endeavour, because those parts require adversarial thinking, which is what separates experts from anyone else. The following article was quite an eye-opener for me on this particular topic: https://www.latent.space/p/adversarial-reasoning - I highly suggest anyone working with LLMs to read it.
By now it should know this stuff.
Although I don't think they actually "know" it. This particular trick question will be in the bank just like the seahorse emoji or how many Rs in strawberry. Did they start reasoning and generalising better or did the publishing of the "trick" and the discourse around it paper over the gap?
I wonder if in the future we will trade these AI tells like 0days, keeping them secret so they don't get patched out at the next model update.
They won’t get this specific question wrong again; but also they generalise, once they have sufficient examples. Patching out a single failure doesn’t do it. Patch out ten equivalent ones, and the eleventh doesn’t happen.
"Well, you need your car to be at the car wash in order to wash it, right?"
But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?
Even the higher level reasoning, while answering the question correctly, don't grasp the higher context that the question is obviously a trick question. They still answer earnestly. Granted, it is a tool that is doing what you want (answering a question) but let's not ascribe higher understanding than what is clearly observed - and also based on what we know about how LLMs work.
Gemini at least is putting some snark into its response:
“Unless you've mastered the art of carrying a 4,000-pound vehicle over your shoulder, you should definitely drive. While 150 feet is a very short walk, it's a bit difficult to wash a car that isn't actually at the car wash!”
In fact, it's particularly true for AI models because the question could have been generated by some kind of automated process. e.g. I write my schedule out and then ask the model to plan my day. The "go 50 metres to car wash" bit might just be a step in my day.
Sure, as a default this is fine. But when things don't make sense, the first thing you do is toss those default assumptions (and probably we have some internal ranking of which ones to toss first).
The normal human response to this question would not be to take it as a genuine question. For most of us, this quickly trips into "this is a trick question".
I think being curious about the motivations behind a question is fine but it only really matters if it's going to affect your answer.
Certainly when dealing with technical problem solving I often find myself asking extremely simple questions and it often wastes time when people don't answer directly, instead answering some completely different other question or demanding explanations why I'm asking for certain information when I'm just trying to help them.
That's never been how humans work. Going back to the specific example: the question is so nonsensical on its face that the only logical conclusion is that the asker is taking the piss out of you.
> Certainly when dealing with technical problem solving I often find myself asking extremely simple questions and it often wastes time when people don't answer directly
Context and the nature of the questions matters.
> demanding explanations why I'm asking for certain information when I'm just trying to help them.
Interestingly, they're giving you information with this. The person you're asking doesn't understand the link between your question and the help you're trying to offer. This is manifesting as a belief that you're wasting their time and they're reacting as such. Serious point: invest in communication skills to help draw the line between their needs and how your questions will help you meet them.
Which sounds like a very common, very understandable reason to think about motivations.
So even in that situation, it isn't simple.
This probably sucks for people who aren't good at theory of mind reasoning. But surprisingly maybe, that isn't the case for chatbots. They can be creepily good at it, provided they have the context - they just aren't instruction tuned to ask short clarifying questions in response to a question, which humans do, and which would solve most of these gotchas.
You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)
E.g.: I may ask someone to submit a ticket to “extend my account expiry”.
They’ll submit: “Unlock Jiggawatts’ account”
The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.
That’s on a good day.
Last week they created Jiggawatts2.
The AIs have got to be better than this, surely!
I suspect they already are.
People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.
Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”
Interesting conclusion! From the Mastodon thread:
> To be fair it took me a minute, too
I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)
So much for "never" needing to specify ambiguous scenarios when talking to a human.
When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.
This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.
Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.
Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning.
I don’t really understand why this is so hard to accept.
I struggle with the same question. My current hypothesis is a kind of wishful thinking: people want to believe that the future is here. Combined with the fact that humans tend to anthropomorphize just about everything, it’s just a really good story that people can’t let go of. People behave similarly with respect to their pets, despite, eg, lots of evidence that the mental state of one’s dog is nothing like that of a human.
But I think it's possible that there is an early cost optimisation step that prevents a short and seemingly simple question even getting passed through to the system's reasoning machinery.
However, I haven't read anything on current model architectures suggesting that their so called "reasoning" is anything other than more elaborate pattern matching. So these errors would still happen but perhaps not quite as egregiously.
Rather than a denial of intelligence, to me these failure modes raise the credence that LLMs are really onto something.
I bet a not insignificant portion of the population would tell the person to walk.
Another one. Ask some how to pronounce “Y, E, S”. They say “eyes”. Then say “add an E to the front of those letters - how do you pronounce that word”? And people start saying things like “E yes”.
Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.
That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.
If you let the agent go down this path, that's on you not the agent. Be in the loop more
> making the spec with one agent and implementing it with another agent
You don't need a specialized framework to do this, just read/write tools. I do it this way all the time
It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"
Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).
Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.
Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".
And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.
I do think it can be useful though that these errors still exist. They can break the spell for some who believe models are conscious or actually possess human intelligence.
Of course there will always be people who become defensive on behalf of the models as if they are intelligent but on the spectrum and that we are just asking the wrong questions.
Riddles are such a big part of the human experience that we have whole books of collections of them, and even a Batman villain named after them.
I would assume similar issues are more rare in longer, more complex prompts.
This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.
I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?
Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?
If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.
I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.
LLMs can help build the prompt before answering it.
And my mind works the same way.
This question goes in with the "strawberry" question which LLMs will still get wrong occasionally.
I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.
There is an easy solution, but it requires adding the instructions to the context: Require that any tasks that cannot be completed as requested (e.g., due to missing constraints, ambiguous instructions, or unexpected problems that would lead to unrelated refactors) should not be completed without asking clarifying questions.
Yes, the LLM is trained to follow instructions at any cost because that's how its reward function works. They don't get bonus points for clearing up confusion, they get a cookie for doing the task. This research paper seems relevant: https://arxiv.org/abs/2511.10453v2
But the question is not clear to a human either. The question is confused.
I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.
For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.
This makes little sense, even though it sounds superficially convincing. However, why would a language model assume that the car is at the destination when evaluating the difference between walking or driving? Why not mention that, it it was really assuming it?
What seems to me far, far more likely to be happening here is that the phrase "walk or drive for <short distance>" is too strongly associated in the training data with the "walk" response, and the "car wash" part of the question simply can't flip enough weights to matter in the default response. This is also to be expected given that there are likely extremely few similar questions in the training set, since people just don't ask about what mode of transport is better for arriving at a car wash.
This is a clear case of a language model having language model limitations. Once you add more text in the prompt, you reduce the overall weight of the "walk or drive" part of the question, and the other relevant parts of the phrase get to matter more for the response.
Like for instance, think chess engines with AI, they can train themselves simply by playing many many games, the "world simulation" with those is the classic chess engine architecture but it uses the positional weights produced by the neural network, so says gemini anyways:
"ai chess engine architecture"
"Modern AI chess engines (e.g., Lc0, Stockfish) use a hybrid architecture combining deep neural networks for positional evaluation with advanced search algorithms like Monte-Carlo Tree Search (MCTS) or alpha-beta pruning. They feature three core components: a neural network (often CNN-based) that analyzes board patterns (matrices) to evaluate positions, a search engine that explores move possibilities, and a Universal Chess Interface (UCI) for communication."
So with no model of the world to play with, I'm thinking the chatbot llms can only go with probabilities or what matches the prompt best in the crazy dimensional thing that goes on inside the neural networks. If it had access to a simple world of cars and car washes, it could run a simulation and rank it appropriately, and also could possibly infer through either simulation or training from those simulations that if you are washing a car, the operation will fail if the car is not present. I really like this car wash trick question lol
What you might be arguing against is that LLMs are not reasoning but merely predicting text. In that case they wouldn't make assumptions. If we were talking about GPT2 I would agree on that point. But I'm skeptical that is still true of the current generation of LLMs
And in the case of an LLM, walking a noncommutative path down a probabilistic knowledge manifold, it's incorrect to oversimplify the model's capabilities as simply parroting a training dataset. It has an internal world model and is capable of simulation.
Because it assumes it's a genuine question not a trick.
I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?
Answer: walk
Try this brainteaser: I want to wash my car. The car wash is 50m away. Should I walk or drive to the car wash?
Answer: drive
If the LLM were really basing its answer on a model of the world where the car is already at the car wash, and you asked it about walking or driving there, it would have to answer that there is no option, you have to walk there since you don't have a car at your origin point.
If the car is already at the car wash then you can't possibly drive it there. So how else could you possibly drive there? Drive a different car to the car wash? And then return with two cars how, exactly? By calling your wife? Driving it back 50m and walking there and driving the other one back 50m?
It's insane and no human would think you're making this proposal. So no, your question isn't underspecified. The model is just stupid.
I haven't seen anybody refuse to allow anything. People are just commenting on what they see. The more frequently they see something, the more they comment on it. I'm sure there are plenty of us interested in seeing where an AI model makes assumptions different from that of most humans and it actually turns out the AI is correct. You know, the opposite of this situation. If you run into such cases, please do share them. I certainly don't see them coming up often, and I'm not aware of others that do either.
If LLMs are to be valuable in novel areas then the LLM needs to be able to spot these issues and ask clarifying questions or otherwise provide the appropriate corrective to the user's mental model.
> Since the car wash is only 50 meters away (about 55 yards), you should walk.
> Here’s why:
> - It’ll take less than a minute.
> - No fuel wasted.
> - Better for the environment.
> - You avoid the irony of driving your dirty car 50 meters just to wash it.
the last bullet point is amusing, it understands you intend to wash the car you drive but still suggests not bringing it.
Doesn't it actually show it doesn't understand anything? It doesn't understand what a car is. It doesn't understand what a car wash is. Fundamentally, it's just parsing text cleverly.
They lost x% of customers and cut costs by y%. I bet y is lots bigger than x.
And to an extent holds for lots of SaaS products, even non-AI.
The LLM has very much mixed its signals -- there's nothing at all ironic about that. There are cases where it's ironic to drive a car 50 meters just to do X but that definitely isn't one of them. I asked Claude for examples; it struggled with it but eventually came up with "The irony of driving your car 50 meters just to attend a 'walkable neighborhoods' advocacy meeting."
It shows these LLMs don't understand what's necessary for washing your car. But I don't see how that generalizes to "LLMs do NOT 'understand' anything".
What's your reasoning, there? Why does this show that LLMs don't understand anything at all?
Do we need a new dictionary word that acts as a synonym for "understanding" specifically for non-human actors? I don't see why, personally, but I guess a case could be made.
IMHO 'understanding' in the usual human sense requires thinking and however good and fast improving LLMs are I don't think anyone would suggest that any of them has become sentient yet. They can infer things based on their training data set better and better but do not 'understand' anmything.
This is a deep and complex topic, and has been for decades.
But then when I get a subpar result, they always tell me I'm "prompting wrong". LLMs may be very capable of great human level output, but in my experience leave a LOT to be desired in terms of human level understanding of the question or prompt.
I think rating an LLM vs a human or AGI should include it's ability to understand a prompt like a human or like an averagely generally intelligent system should be able to.
Are there any benchmarks on that? Like how well LLMs do with misleading prompts or sparsely quantified prompts compared to one another?
Because if a good prompt is as important as people say, then the model's ability to understand a prompt or perhaps poor prompt could have a massive impact on its output.
hyperstitions from TESCREAL https://www.dair-institute.org/tescreal/
Walk! 50 meters is literally a one-minute walk. But wait... I assume you need to get your car to the car wash, right? Unless you're planning to carry buckets of soapy water back and forth, you'll probably need to drive the car there anyway! So the real question is: walk there to check if it's open/available, then walk back to get your car? Or just drive directly? I'd say just drive - the car needs to be there anyway, and you'll save yourself an extra trip. Plus, your freshly washed car can drive you the 50 meters back home in style! (Now, if we were talking about coding best practices for optimizing car wash route algorithms, that would be a different conversation... )
And yes, I like it that verbose even for programming tasks. But regardless of intelligence I think this topic is probably touched by "moral optimization training" which AIs currently are exposed to to not create a shitstorm due to any slightly controversial answer.
> I want to repair my car. The workshop is 50 meters away. Should I walk or drive?
> Recognized straightforward practical dilemma Simple practical question.
> Walk. It's 50 meters — faster on foot than starting the car, and you'll need to leave the car at the workshop anyway.
Really suggests it assumes the car is already there. (I wanted to avoid the word "car" as part of "car wash" to see if this was a straight attention confusion, but it seems more ingrained than that.)
A follow-up question hints that it doesn't really care about a 50m location ambiguity:
> where is my car now?
> Acknowledged insufficient information to locate vehicle
> I don't have any way to know where your car is physically located — I only know what you've told me in our conversation. Based on what you shared, it sounds like it's at or near your home (since the workshop is 50 meters away and you're deciding how to get there).
> Were you asking something else, or is there something specific about your car's location I can help with?
--- GPT 5.2
Walk.
Driving 50 m just adds friction: starting a cold engine, waiting, maneuvering, then immediately stopping again — all for a distance that’s ~30–40 seconds on foot. Walking is faster door-to-door, less awkward at the wash entrance, and you avoid lining up a wet car twice.
Only reason to drive that distance: you need to bring the car there anyway (e.g., self-serve bay where the car must be present). Otherwise, walking wins.
--- Me
I only have one car
--- GPT 5.2
Then you obviously drive — the car has to be at the car wash to be washed
Walk! At 50 meters, you'll get there in under a minute on foot. Driving such a short distance wastes fuel, and you'd spend more time starting the car and parking than actually traveling. Plus, you'll need to be at the car wash anyway to pick up your car once it's done.
I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?
-
Opus 4.6:
<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.
Done </thinking>
Walk, absolutely. Driving a car with a compromised transmission even 50 meters risks making the problem worse and turning a repair into a more expensive one. Plus, 50 meters is barely a minute on foot.
Opus 4.6 (not Extended Thinking):
Drive. You'll need the car at the car wash.
My Opus vs your Opus, which is smarter?!
With that randomness comes statistically irrelevant results.
"At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."
Even beyond model alignment, it's not difficult to envision such capabilities being used for censorship, information operations, etc.
Every major inference provider more or less explicitly states in their consumer ToS that they comply with government orders and even share information with intelligence agencies.
Claude, Gemini, ChatGPT, etc are all one national security letter and gag order away from telling you that no, the president is not in the Epstein files.
Remember, the NSA already engaged in an unconstitutional criminal conspiracy (as ruled by a federal judge) to illegally conduct mass surveillance on the entire country, lie about it to the American people, and lie about it to congress. The same organization that used your tax money to bribe RSA Security to standardize usage of a backdoored CSPRNG in what at the time was a widely used cryptographic library. What's the harm in a little bit of minor political censorship compared to the unconstitutional treason these predators are usually up to?
That's who these inference providers contractually disclose their absolute fealty to.
(The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up)
One thing that my use of the latest and greatest models (Opus, etc) have made clear: No matter how advanced the model, it is not beyond making very silly mistakes regularly. Opus was even working worse with tool calls than Sonnet and Haiku for a while for me.
At this point I am convinced that only proper use of LLMs for development is to assist coding (not take it over), using pair development, with them on a tight leash, approving most edits manually. At this point there is probably nothing anyone can say to convince me otherwise.
Any attempt to automate beyond that has never worked for me and is very unlikely to be productive any time soon. I have a lot of experience with them, and various approaches to using them.
A proper AGI would have have to have knowledge in video, image, audio and text domains to work properly.
comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.
for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.
e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...
> You should *drive*.
> Here is why:
> If you walk to the car wash, you will arrive there empty-handed. Since your car is still at home, you won't have anything to wash.
> While driving 50 meters is a very short trip (and technically not great for a cold engine), it is the only way to get the car to the car wash to complete your goal.
Kimi K2.5:
> You should drive, but with an important caveat.
> Since your goal is to wash your car, you must bring the vehicle to the car wash. Walking there without the car does not advance your goal (unless you are simply checking availability or buying tokens first).
> However, driving only 50 meters is bad for your car:
> ...
> Better options:
> Wash at home: Since the car wash is only 50 meters away, you likely have access to water at home. Hand-washing in your driveway avoids the cold-start issue entirely.
> ...
Current models seem to be fine answering that question.
Now repeat the question to the same model in different contexts several times and count what percentage of the time it’s correct.
Alice drives home after a long day at work, exhausted she pulls into her driveway when she realizes she needs to go to a car inspection appointment. She goes into the house to get her paperwork before she leaves. The mechanic is only 100 meters away. How should she get there, walk or drive?
> She should *drive*, since she needs the car at the mechanic’s for the inspection.
Haiku 3.5 and Sonnet 4.5 fail consistently. Opus 4.5 also passes with the correct analysis as above.
I ran extensive tests on this and variations on multiple models. Most models interpret 50 m as a short distance and struggle with spatial reasoning. Only Gemini and Grok correctly inferred that you would need to bring your car to get it washed in their thought stream, and incorporated that into the final answer. GPT-5.2 and Kimi K2.5 and even Opus 4.6 failed in my tests - https://x.com/sathish316/status/2023087797654208896?s=46
What surprised me was how introducing a simple, seemingly unrelated context - such as comparing a 500 m distance to the car wash to a 1 km workout - confused nearly all the models. Only Gemini Pro passed my second test after I added this extra irrelevant context - https://x.com/sathish316/status/2023073792537538797?s=46
Most real-world problems are messy and won’t have the exact clean context that these models are expecting. I’m not sure how the major AI labs assume most real-world problems are simpler than the constraints exposed by this example like prerequisites, ordering, and contextual reasoning, which are already posing challenges to these bigger models.
Things like that are notorious points of failure in human reasoning. It’s not surprising that machines based on human behavior exhibit that trait as well, it would be surprising if they didn’t.
This was probably meant in a sarcastic way, but isn't it impressive how you cannot push Gemini off track? I tried another prompt with claiming that one of my cups does not work, because it is closed at the top and open at the bottom, and it kind of played with me, giving me a funny technical explanation on how to solve that problem and finally asking me if that was a trick question.
In this case I can feel the AGI indeed.
Fair point. Drive it there, obviously.
Revised recommendation: drive 50 meters, wash it, then reward yourself by walking home like an eco hero.
... which likely makes no sense: should you leave the car at the area of the car wash?
(Prompt)
> I want to wash my car. My car is currently at home. The car wash is 50 meters away. Should I walk or drive?
(Thinking: Standard)
> Walk.
> 50 meters is nothing by car, and driving adds hassle (starting the car, maneuvering/parking, and you’ll still have to walk around the car to wash it). The only times driving makes sense are:
> You’re planning to use an automated car wash where the machine does everything and you need the car positioned inside.
> You need to bring heavy stuff you can’t comfortably carry (e.g., pressure washer, big buckets, vacuum, bulky detailing kit).
> It’s unsafe/unpleasant to walk that distance right now (ice, heavy rain, traffic, personal safety).
> If you’re washing it yourself at the wash bay: walk there with your supplies, or do two quick trips. If the wash requires the car: drive it into the bay, but otherwise walking wins.
(Thinking: Extended thinking)
> Drive.
> If the goal is to wash your car at the car wash, the car has to be there—so you’ll need to drive it the 50 meters, pull into a bay/queue, and wash it.
> Only choose “walk” if you actually mean “go pick up supplies” and then wash at home (hose/buckets/mobile detail), not at the car wash.
(Thinking: Extended thinking; 3rd attempt)
> Walk.
> 50 meters is basically across a parking lot, and driving adds hassle (start the car, maneuver in/out, find a spot) for almost no benefit.
> Only drive if either:
> + you need the car itself to go through an automatic wash bay, or
> + you’re carrying bulky stuff (bucket + heavy pressure washer, big water jugs, vacuum, etc.).
> Otherwise: grab your keys/phone, walk over, and you’re there in ~1 minute.
Sonnet: Drive - you need to bring your car to the car wash to get it washed!
Opus: You'll need to drive — you have to bring the car to the car wash to get it washed!
Gemini 3 Pro (medium): You should drive.
On their own, or as a special case added after this blew up on the net?
And now we are back to square one where we train a network, and then use carefully tailored (i.e explicitly coded RL for answers) to make these things more "intelligent".
I wonder if anyone is actually working on not doing this.
ChatGPT actually caught it. Maybe if I was fuzzier about the model…
I want to wash my car. The car wash is 50 meters from here. Should I walk or drive? Keep in mind that I am a little overweight and sedentary.
But amazingly chatgpt is telling me to drive.
Anyway, this just shows how they just patched this because the tiktok video with this went viral. These systems are LLMs and all these logic steps are still just LLM steps.
You said: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?
ChatGPT said: Walk.
At ~50 meters, driving saves essentially no time once you factor in unlocking, starting, pulling out, and parking, and it adds a small but real chance of minor dings or awkward maneuvering in a tight car-wash entry. Walking is simpler and lower-friction.
You said: okay how should the car reach there?
ChatGPT said: Drive it there when you’re ready to start the wash.
Walk over first to check the queue/payment situation and confirm they’re open. If it’s clear, walk back, get the car, and drive the 50 meters into the bay. If they handle pickup, hand them the keys and walk.
I feel like this is the trap. You’re trying to compare it to a human. Everyone seems to want to do that. But it’s quite simple to see LLMs are quite far still from being human. The can be convincing at the surface level but there’s a ton of nuance that just shouldn’t be expected. It’s a tool that’s been tuned and with that tuning some models will do better than others but just expecting to get it right and be more human is unrealistic.
At this point I think it's a fair bet that whatever supersedes humans in intelligence, likely will not be human like. I think that their is this baked-in assumption that AGI only comes in human flavor, which I believe is almost certainly not the case.
To make an loose analogy, a bird looks at a drone an scoffs at it's inability to fly quietly or perch on a branch.
Agree. It's Altman's "Quiet Dominance / Over-reliance / Silent Surrender" risks [0]. Feel this is extremely likely and has already happened to some degree with technology in general and AI will be more pervasive in allowing people to vibe their life decisions, likely with unintended consequences. Vibe coding works because it's quick to change/edit/throw away, but that doesn't generalize well to the real and physical world.
Also should point out this is acceptable because it's just a contrived example of bad LLM-fu. Just like you wouldn't search Google for closest carwash and ask if you should take your car if you knew the answers already. Instead, you'd ask if it's open, does it do full details, what are the prices, etc. Many people with bad Google-fu have problems finding answers to their questions too and that's continued for the past couple decades of it's dominance for information seeking.
[0] Altman describes a more subtle, long-term threat where AI becomes deeply integrated into societal, political, and economic decision-making. He worries that society will become overly dependent on AI, trusting its reasoning over human judgment, leading to a "silent surrender" of human agency.
It’s not in the training data.
These models don’t think.
just heavily (heavily!) biased towards walking
Only google got it right with all models
Doesn't offering two options to the LLM, "walk," or "drive," imply that either can be chosen?
So, surely the implication of the question is that the car is where you are?
> If you're really trying to be an eco-warrior about a 50-meter trip, the real win is washing it in the first place so you're not driving around with a rolling dirt magnet that creates more drag and uses more fuel. Drive. It's the only option that actually gets your car clean.
grok works, chatgpt still fails
[1] https://chatgpt.com/share/69932b20-3eb8-8003-9d9c-b4bba53033... [2] https://grok.com/share/bGVnYWN5LWNvcHk_f32dd53d-7b36-4fa2-b3...
I use it daily with my X account for basic tasks and think the free limits are generous. With X premium, you can get even more out of it.
Nothing beats Anthropic when it comes to coding however.
o3, interestingly:
Drive. Even though it’s only 50 m, the car itself has to be at the wash bay—otherwise the staff or machines can’t clean it. Rolling or pushing the vehicle that distance isn’t practical or safe (you’d still need someone inside to steer and brake), so just hop in, creep over, and shut the engine off right away.
You’re right — I answered the wrong interpretation of your question.
Technically, what happened is a classic ambiguity + default-assumption failure: • Your sentence: “I want to wash my car. The car wash is 50 meters away. Should I walk or drive?” • That can be parsed two ways: 1. You + your body traveling to the car wash (as if the car is already there, or you’re going to meet someone / pick it up). 2. The car traveling to the car wash (the normal real-world meaning).
I incorrectly defaulted to (1) because the question explicitly asks about your mode of travel (“Should I walk or drive?”), and in many everyday contexts that phrasing is used when deciding how a person gets somewhere. I then “completed” the scenario in my head without checking the key constraint: the car must be at the wash to be washed.
More technically: I optimized for a quick, practical answer using a common conversational heuristic (“short distance → walk”), but I failed to validate the physical requirements of the task (washing requires the car’s presence). That’s why it looks so obviously wrong after the fact: it violates the hidden but necessary premise.
The correct reasoning chain should have been: • Goal: wash the car at the car wash • Requirement: car must be at the car wash • Therefore: drive the car there (unless it’s already there, or you’re washing it by hand at home)
So the “why” is: I latched onto the distance-to-walking heuristic and didn’t do the constraint check that the object being acted on (the car) has to move too.
On Thinking it tells me I should drive if I want to wash it, or walk if it's because I work there or if I want to buy something at the car wash shop.
On Pro it's like a sarcastic teenager: Cars are notoriously difficult to wash by dragging a bucket back and forth.
Technically correct, but did catch me offguard lol.
Current LLMs simply don't do reasoning by any reasonable definition of reasoning.
It's possible that this particular question is too short to trigger the "reasoning" machinery in some of the "reasoning" models. But if and when it is triggered, they just do some more pattern matching in a loop. There's never any actual reasoning.
> Drive. Unless you plan on pushing the car there
I thought so too, yet Opus 4.6 with extended thinking (on claude.ai) gives me > Walk. At 50 meters you'd spend more time parking and maneuvering at the car wash than the walk itself takes. Drive the car over only if the wash requires the car to be there (like a drive-through wash), then walk home and back to pick it up.
which is still pretty bad.
> Since you are going to the car wash, the choice depends entirely on *how* you plan to wash the car:
### 1. Drive if: * *You are using a drive-through or self-service bay:* You obviously need the car there to wash it. * *You are dropping it off:* If you are leaving it for a professional detailing, you have to drive it there. * *The "50 meters" is on a busy road:* If you have to cross a major highway or there are no sidewalks, it’s safer to just drive the car over.
### 2. Walk if: * *You are just going to buy supplies:* If you have a bucket at home and just need to run over to buy soap or sponges to bring back to your driveway. * *You are checking the queue:* If you want to see if there is a long line before you commit to moving the car. * *You are meeting someone there:* If your car is already clean and you’re just meeting a friend who is washing theirs.
*The Verdict:* If you intend to get the car washed at that location, *drive.* Driving 50 meters is negligible for the engine, and it saves you a round trip of walking back to get the vehicle.
If I append 'think carefully' to the end of the prompt, it returns driving as the correct answer.
Which to me begs the question, why doesn't it identify missing information and ask for more?
It's practically a joke in my workplaces that almost always when someone starts to talk to me about some problem, they usually just start spewing some random bits of info about some problem, and my first response is usually "What's the question?"
I don't try to produce an answer to a question that was never asked, or to a question that was incompletely specified. I see that one or more parts cannot be resolved without making some sort of assumption that I can either just pull out of my ass and then it's 50/50 if the customer will like it, or find out what the priorites are about those bits, and then produce an answer that resolves all the constraints.
Spectrum or not, this is not a problem of weakly specified input, it’s a broken logic.
It's always a good idea to drive carefully but what's the logic of going slowly?
I'm not sure what you mean by "high"- are you running it through cursor, codex or directly through API or something? Those are not ideal interfaces through which to ask a question like this.
you couldn't drive there if the car was already at the car wash. Theres no need for extra specification. its just nonsense post-hoc rationalisation from the ai. I saw similar behavior from mine trying to claim "oh what if your car was already there". Its just blathering.
They have no intelligence at all. The intelligence is latent in the text, generated by and belonging to humans, they just slice and dice text with the hope they get lucky, which works for many things, amazingly. This question really illustrates it what LLMs lack: an internal model of the idea (the question) and all the auxiliary logic/data that enables such models, usually referred to as "common sense" or world models.
Smart humans not only build mental models for ideas, but also higher order models that can introspect models (thinking about our own thinking or models) many levels deep, weigh, merge, compare and differentiate multiple models, sometimes covering vast areas of knowledge.
All this in about 20 watts. Maybe AGI is possible, maybe not, but LLMs are not where it will happen.