upvote
> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc. The LLM orbiting in space will essentially just call GCC to execute the 'prompt' and spend the rest of the time pondering its existence ;p

reply
You could probably make a pretty good short story out of that scenario, sort of in the same category as Asimov's "The Feeling of Power".

The Asimov story is on the Internet Archive here [1]. That looks like it is from a handout in a class or something like that and has an introductory paragraph added which I'd recommend skipping.

There is no space between the end of that added paragraph and the first paragraph of the story, so what looks like the first paragraph of the story is really the second. Just skip down to that, and then go up 4 lines to the line that starts "Jehan Shuman was used to dealing with the men in authority [...]". That's where the story starts.

[1] https://ia800806.us.archive.org/20/items/TheFeelingOfPower/T...

reply
Thanks, I enjoyed reading that! The story that lay at the back of my mind when making the comment was "A Canticle for Leibowitz" [1]. A similar theme and from a similar era.

The story I have half a mind to write is along the lines of a future we envision already being around us, just a whole lot messier. Something along the lines of this [2] XKCB.

[1] https://en.wikipedia.org/wiki/A_Canticle_for_Leibowitz

[2] https://xkcd.com/538/

reply
This is going into my training courses at work. Thanks!
reply
> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

On the foolishness of "natural language programming". https://www.cs.utexas.edu/~EWD/transcriptions/EWD06xx/EWD667...

    Since the early days of automatic computing we have had people that have felt it as a shortcoming that programming required the care and accuracy that is characteristic for the use of any formal symbolism. They blamed the mechanical slave for its strict obedience with which it carried out its given instructions, even if a moment's thought would have revealed that those instructions contained an obvious mistake. "But a moment is a long time, and thought is a painful process." (A.E.Houseman). They eagerly hoped and waited for more sensible machinery that would refuse to embark on such nonsensical activities as a trivial clerical error evoked at the time.
(and it continues for some many paragraphs)

https://news.ycombinator.com/item?id=8222017 2014 - 154 comments

https://news.ycombinator.com/item?id=35968148 2023 - 65 comments

https://news.ycombinator.com/item?id=43564386 2025 - 277 comments

reply
A structured language without ambiguity is not, in general, how people think or express themselves. In order for a model to be good at interfacing with humans, it needs to adapt to our quirks.

Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc

reply
> in order to better service ai

That wasn't the point at all. The idea is about rediscovering what always worked to make a computer useful, and not even using the fuzzy AI logic.

reply
Yep, humans have had a remedy for the problem of ambiguity in language for tens of thousands of years, or there never could have been an agricultural revolution giving birth to civilization in the first place.

Effective collaboration relies on iterating over clarifications until ambiguity is acceptably resolved.

Rather than spending orders of magnitude more effort moving forward with bad assumptions from insufficient communication and starting over from scratch every time you encounter the results of each misunderstanding.

Most AI models still seem deep into the wrong end of that spectrum.

reply
>Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

I think there's a substantial subset of tech companies and honestly tech people who disagree. Not openly, but in the sense of 'the purpose of a system is what it does'.

reply
I agree but it feels like a type-of-mind thing. Some people gravitate toward clean determinism but others toward chaotic and messy. The former requires meticulous linear thinking and the latter uses the brain’s Bayesian inference.

Writing code is very much “you get what you write” but AI is like “maintain a probabilistic mental model of the possible output”. My brain honestly prefers the latter (in general) but I feel a lot of engineers I’ve met seem to stray towards clean determinism.

reply
I think it's very likely that machine intelligence will influence human language. It already is influencing the grammar and patterns we use.
reply
I think such influence will be extremely minimal, like confined to dozens of new nouns and verbs, but no real change in grammar, etc.

Interactions between humans and computers in natural language for your average person is much much less then the interactions between that same person and their dog. Humans also speak in natural language to their dogs, they simplify their speech, use extreme intonation and emphasis, in a way we never do with each other. Yet, despite having been with dogs for 10,000+ years, it has not significantly affected our language (other then giving us new words).

EDIT: just found out HN annoyingly transforms U+202F (NARROW NO-BREAK SPACE), the ISO 80000-1 preferred way to type thousand separator

reply
> I think such influence will be extremely minimal.

AI will accelerate “natural” change in language like anything else.

And as AI changes our environment (mentally, socially, and inevitably physically) we will change and change our language.

But what will be interesting is the rise of agent to agent communication via human languages. As that kind of communication shows up in training sets, there will be a powerful eigenvector of change we can’t predict. Other than that it’s the path of efficient communication for them, and we are likely to pick up on those changes as from any other source of change.

reply
> Convincing all of human history and psychology to reorganize itself in order to better service ai cannot possibly be a real solution.

I'm on the spectrum and I definitely prefer structured interaction with various computer systems to messy human interaction :) There are people not on the spectrum who are able to understand my way of thinking (and vice versa) and we get along perfectly well.

Every human has their own quirks and the capacity to learn how to interact with others. AI is just another entity that stresses this capacity.

reply
Speak for yourself. I feel comfortable expressing myself in code or pseudo code and it’s my preferred way to prompt an LLM or write my .md files. And it works very effectively.
reply
> Unfortunately, the solution is likely going to be further interconnectivity, so the model can just ask the car where it is, if it's on, how much fuel/battery remains, if it thinks it's dirty and needs to be washed, etc

So no abstract reasoning.

reply
Prompting is definitely a skill, similar to "googling" in the mid 00's.

You see people complaining about LLM ability, and then you see their prompt, and it's the 2006 equivalent of googling "I need to know where I can go for getting the fastest service for car washes in Toronto that does wheel washing too"

reply
Ironically, the phrase that was a bad 2006 google query is a decent enough LLM prompt, and the good 2006 google query (keywords only) would be a bad LLM prompt.
reply
That’s not true at all. I get plenty of perfect responses with few word prompts often containing typos.

This isn’t always the case and depends on what you need.

reply
How customized are your system prompts (i.e. the static preferences you set at the app level)?

And do you perhaps also have memory enabled on the LLMs you are thinking of?

reply
Communication is definitely a skill, and most people suck at it in general. And frequently poor communication is a direct result from the fact that we don't ourselves know what we want. We dream of a genie that not only frees us from having to communicate well, but of having to think properly. Because thinking is hard and often inconvenient. But LLMs aren't going to entirely free us from the fact that if garbage goes in, garbage will come out.

"Communication usually fails, except by accident." —Osmo A. Wiio [1]

[1] https://en.wikipedia.org/wiki/Wiio%27s_laws

reply
I’ve been looking for tooling that would evaluate my prompt and give feedback on how to improve. I can get somewhere with custom system prompts (“before responding ensure…”) but it seems like someone is probably already working on this? Ideally it would run outside the actual thread to keep context clean. There are some options popping up on Google but curious if anyone has a first anecdote to share?
reply
The Lojban language already exists and allows for eliminating ambiguity. It's obviously not practical for general use, though.

https://en.wikipedia.org/wiki/Lojban

reply
Lojban is syntactically unambitious. Semantically it's still just as vague as any natural language.
reply
How about...

https://en.wikipedia.org/wiki/Ithkuil

> Ithkuil is an experimental constructed language created by John Quijada. It is designed to express more profound levels of human cognition briefly yet overtly and clearly, particularly about human categorization. It is a cross between an a priori philosophical and a logical language. It tries to minimize the vagueness and semantic ambiguity in natural human languages. Ithkuil is notable for its grammatical complexity and extensive phoneme inventory, the latter being simplified in an upcoming redesign.

> ...

> Meaningful phrases or sentences can usually be expressed in Ithkuil with fewer linguistic units than natural languages. For example, the two-word Ithkuil sentence "Tram-mļöi hhâsmařpţuktôx" can be translated into English as "On the contrary, I think it may turn out that this rugged mountain range trails off at some point."

Half as Interesting - How the World's Most Complicated Language Works https://youtu.be/x_x_PQ85_0k (length 6:28)

reply
It reminds me of the difficulty of getting information on or off a blockchain. Yes, you’ve created this perfect logical world. But, getting in or out will transform you in unknown ways. It doesn’t make our world perfect.
reply
> But this will only happen after the last programmer has died and no-one will remember programming languages, compilers, etc.

If we're 'lucky' there will still be some 'priests' around like in the Foundation novels. They don't understand how anything works either, but can keep things running by following the required rituals.

reply
Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity

So, back to COBOL? :)

reply
> So, back to COBOL? :)

well more like a structured _querying_ language

reply
So, back to Prolog? :)
reply
> structured language that eliminates ambiguity

That has been tried for almost half a century in the form of Cyc[1] and never accomplished much.

The proper solution here is to provide the LLM with more context, context that will likely be collected automatically by wearable devices, screen captures and similar pervasive technology in the not so distant future.

This kind of quick trick questions are exactly the same thing humans fail at if you just ask them out of the blue without context.

[1] https://en.wikipedia.org/wiki/Cyc

reply
> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

We've truly gone full circle here, except now our programming languages have a random chance for an operator to do the opposite of what the operator does at all other times!

reply
One might think that a structure language is really desirable, but in fact, one of the biggest methods of functioning behind intelligence is stupidity. Let me explain: if you only innovate by piecing together lego pieces you already have, you'll be locked into predictable patterns and will plateau at some point. In order to break out of this, we all know, there needs to be an element of randomness. This element needs to be capable of going in the at-the-moment-ostensibly wrong direction, so as to escape the plateau of mediocrity. In gradient descent this is accomplished by turning up temperature. There are however many other layers that do this. Fallible memory - misremembering facts - is one thing. Failing to recognize patterns is another. Linguistic ambiguity is yet another, and that is a really big one (cf Sapir–Whorf hypothesis). It's really important to retain those methods of stupidity in order to be able to achieve true intelligence. There can be no intelligence without stupidity.
reply
I believe this is the principle that makes biology such a superior technology.
reply
> structured language that eliminates ambiguity... CODE! Wait....
reply
>> Maybe in the distant future we'll realize that the most reliable way to prompting LLMs are by using a structured language that eliminates ambiguity, it will probably be rather unnatural and take some time to learn.

Like a programming language? But that's the whole point of LLMs, that you can give instructions to a computer using natural language, not a formal language. That's what makes those systems "AI", right? Because you can talk to them and they seem to understand what you're saying, and then reply to you and you can understand what they're saying without any special training. It's AI! Like the Star Trek[1] computer!

The truth of course is that as soon as you want to do something more complicated than a friendly chat you find that it gets harder and harder to communicate what it is you want exactly. Maybe that's because of the ambiguity of natural language, maybe it's because "you're prompting it wrong", maybe it's because the LLM doesn't really understand anything at all and it's just a stochastic parrot. Whatever the reason, at that point you find yourself wishing for a less ambiguous way of communication, maybe a formal language with a full spec and a compiler, and some command line flags and debug tokens etc... and at that point it's not a wonderful AI anymore but a Good, Old-Fashioned Computer, that only does what you want if you can find exactly the right way to say it. Like asking a Genie to make your wishes come true.

______________

[1] TNG duh.

reply
> Like a programming language?

Does the next paragraph not make that clear?

reply
> Us having to specify things that we would never specify when talking to a human.

The first time I read that question I got confused: what kind of question is that? Why is it being asked? It should be obvious that you need your car to wash it. The fact that it is being asked in my mind implies that there is an additional factor/complication to make asking it worthwhile, but I have no idea what. Is the car already at the car wash and the person wants to get there? Or do they want to idk get some cleaning supplies from there and wash it at home? It didn't really parse in my brain.

reply
I would say, the proper response to this question is not "walk, blablablah" but rather "What do you mean? You need to drive your car to have it washed. Did I miss anything?"
reply
Yes, this is what irks me about all the chatbots, and the chat interface as a whole. It is a chat-like UX without a chat-like experience. Like you are talking to a loquacious autist about their favorite topic every time.

Just ask me a clarifying question before going into your huge pitch. Chats are a back & forth. You don’t need to give me a response 10x longer than my initial question. Etc

reply
I think for "GPT-4o is my life partner" reasons, labs are a little bit icey about making the models overly human.
reply
Doubt. The labs are afraid of users becoming too hooked on their products? lol…
reply
People offing themselves because their lover convinced them it's time is absolutely not worth the extra addiction potential. We even witnessed this happen with OAI.

It's a fast track to public disdain and heavy handed government regulation.

reply
Regulation would be preferable for OpenAI to the tort lawyers. In general the LLM companies should want regulation because the alternative is tort, product liability tort, and contract law.

There is no way without the protections that could be afforded by regulation to offer such wide-ranging uses of the product without also accepting significant liability. If the range of "foreseeable misuse" is very broad and deep, so is the possible liability. If your marketing says that the bot is your lawyer, doctor, therapist, and spouse in one package, how is one to say that the company can escape all the comprehensive duties that attach to those social roles. Courts will weigh the tiny and inconspicuous disclaimers against the very large and loud marketing claims.

The companies could protect themselves in ways not unlike the ways in which the banking industry protects itself by replacing generic duties with ones defined by statute and regulation. Unless that happens, lawyers will loot the shareholders.

reply
It’s funny seeing you frame regulation as needed to protect trillion dollar monopolies from consumers and not the other way around.
reply
Or sama is just waiting to premium subscription gate companions in some adult content package as he has hinted something along these lines may be forthcoming. Maybe tie it in with the hardware device Ive is working on. Some sort of hellscape tamogotchi.

Recall: "As part of our 'treat adult users like adults' principle, we will allow even more, like erotica for verified adults," Altman wrote in the Oct.

reply
I'm struggling a bit when it comes to wording this with social decorum, but how long do we reckon it takes until there's AI powered adult toys? There's a market opportunity that i do not want to see being fulfilled, ever..
reply
I did work on a supervised fine-tuning project for one of the major providers a while back, and the documentation for the project was exceedingly clear about the extent to which they would not tolerate the model responding as if it was a person.

Some of the labs might be less worried about this, but they're not by any means homogenous.

reply
> Like you are talking to a loquacious autist about their favorite topic every time

That's the best part.

reply
People need to touch grass
reply
People need to smoke grass and chill out.
reply
With ChatGPT, at least, you can tell the bot to work that way using [persistent] Custom Instructions, if that's what you want. These aren't obeyed perfectly (none of the instructions are, AFAICT), but they do influence behavior.

A person can even hammer out an unstructured list of behavioral gripes, tell the bot to organize them into instructional prose, have it ask clarifying questions and revise based on answers, and produce directions for integrating them as Custom Instructions.

From then on, it will invisibly read these instructions into context at the beginning of each new chat.

Mold it and steer it to be how you want it to be.

(My own bot tends to be very dry, terse, non-presumptuous, pragmatic, and profane. It's been years now since it has uttered an affirmation like "That's a great idea!" or "Wow! My circuits are positively buzzing with the genius I'm seeing here!" or produced a tangential dissertation in response to a simple question. But sometimes it does come back with functional questions, or phrasing like "That shit will never work. Here's why.")

reply
This. Nailed it.
reply
>You don’t need to give me a response 10x longer than my initial question.

Except, of course, when that is exactly what the user wants.

reply
To me that’s not a chat interface, that’s a search interface.

Chat is a back & forth.

Search is a one-shot.

reply
That’s why I don’t understand why LLMs don’t ask clarifying questions more often.

In a real human to human conversation, you wouldn’t simply blurt out the first thing that comes to mind. Instead, you’d ask questions.

reply
Google Gemini often gives an overly lengthy response, and then at the end asks a question. But the question seems designed to move on to some unnecessary next step, possibly to keep me engaged and continue conversing, rather than seeking any clarification on the original question.
reply
This is a great point, because when you ask it (Claude) if it has any questions, it often turns out it has lots of good ones! But it doesn't ask them unless you ask.
reply
That's because it doesn't really have any questions until you ask it whether it does.
reply
This is the most important comment in this entire thread IMO, and it’s a bit buried.

This is the fundamental limitation with generative AI. It only generates, it does not ponder.

reply
You can define "ponder" in multiple ways, but really this is why thinking models exist - they turn over the prompt multiple times and iterate on responses to get to a better end result.
reply
Well I chose the word “ponder” carefully, given the fact that I have a specific goal of contributing to this debate productively. A goal that I decided upon after careful reflection over a few years of reading articles and internet commentary, and how it may affect my career, and the patterns I’ve seen emerge in this industry. And I did that all patiently. You could say my context window was infinite, only defined by when I stop breathing.

That is to say, all of that activity I listed is activity I’m confident generative AI is not capable of, fundamentally.

Like I said in a cousin comment, we can build Frankenstein algorithms and heuristics on top of generative AI but every indication I’ve seen is that that’s not sufficient for intelligence in terms of emergent complexity.

Imagine if we had put the same efforts towards neural networks, or even the abacus. “If I create this feedback loop, and interpret the results in this way, …”

reply
Agreed that feedback loops on top of generative LLMs will not get us to AGI or true intelligence.
reply
what is the difference between "ponder" and "generate"? the number of iterations?
reply
Probably the lack of external stimuli. Generative AI only continues generating when prompted. You can play games with agents and feedback loops but the fundamental unit of generative AI is prompt-based. That doesn’t seem, to me, to be a sufficient model for intelligence that would be capable of “pondering”.

My take is that an artificial model of true intelligence will only be achieved through emergent complexity, not through Frankenstein algorithms and heuristics built on generative AI.

Generative AI does itself have emergent complexity, but I’m bearish that if we would even hook it up to a full human sensory input network it would be anything more than a 21st century reverse mechanical Turk.

Edit: tl;dr Emergent complexity is a necessary but insufficient criteria for intelligence

reply
you can get it to change by putting instructions to ask questions in the system prompt but I found it annoying at a while
reply
Because 99% of the time it's not what users want.

You can get it to ask you clarifying questions just by telling it to. And then you usually just get a bunch of questions asking you to clarify things that are entirely obvious, and it quickly turns into a waste of time.

The only time I find that approach helpful is when I'm asking it to produce a function from a complicated English description I give it where I have a hunch that there are some edge cases that I haven't specified that will turn out to be important. And it might give me a list of five or eight questions back that force me to think more deeply, and wind up being important decisions that ensure the code is more correct for my purposes.

But honestly that's pretty rare. So I tell it to do that in those cases, but I wouldn't want it as a default. Especially because, even in the complex cases like I describe, sometimes you just want to see what it outputs before trying to refine it around edge cases and hidden assumptions.

reply
This is a topic that I’ve always found rather curious, especially among this kind of tech/coding community that really should be more attuned to the necessity of specificity and accuracy. There seems to be a base set of assumptions that are intrinsic to and a component of ethnicities and cultures, the things one can assume one “wouldn’t never specify when talking to a human [of one’s own ethnicity and culture].”

It’s similar to the challenge that foreigners have with cultural references and idioms and figurative speech a culture has a mental model of.

In this case, I think what is missing are a set of assumptions based on logic, e.g., when stating that someone wants to do something, it assumes that all required necessary components will be available, accompany the subject, etc.

I see this example as really not all that different than a meme that was common among I think the 80s and 90s, that people would forget buying batteries for Christmas toys even though it was clear they would be needed for an electronic toy. People failed that basic test too, and those were humans.

It is odd how people are reacting to AI not being able to do these kinds of trick questions, while if you posted something similar about how you tricked some foreigners you’d be called racist, or people would laugh if it was some kind of new-guy hazing.

AI is from a different culture and has just arrived here. Maybe we’re should be more generous and humane… most people are not humane though, especially the ones who insist they are.

Frankly, I’m not sure it bodes well for if aliens ever arrive on Earth, how people would respond; and AI is arguably only marginally different than humans, something an alien life that could make it to Earth surely would not be.

reply
Whether you view the question as nonsensical, the most simple example of a riddle, or even an intentional "gotcha" doesn't really matter. The point is that people are asking the LLMs very complex questions where the details are buried even more than this simple example. The answers they get could be completely incorrect, flawed approaches/solutions/designs, or just mildly misguided advice. People are then taking this output and citing it as proof or even objectively correct. I think there are ton of reasons this could be but a particularly destructive reason is that responses are designed to be convincing.

You _could_ say humans output similar answers to questions, but I think that is being intellectually dishonest. Context, experience, observation, objectivity, and actual intelligence is clearly important and not something the LLM has.

It is increasingly frustrating to me why we cannot just use these tools for what they are good for. We have, yet again, allowed big tech to go balls deep into ham-fisting this technology irresponsibly into every facet of our lives the name of capital. Let us not even go into the finances of this shitshow.

reply
Yeah people are always like "these are just trick questions!" as though the correct mode of use for an LLM is quizzing it on things where the answer is already available. Where LLMs have the greatest potential to steer you wrong is when you ask something where the answer is not obvious, the question might be ill-formed, or the user is incorrectly convinced that something should be possible (or easy) when it isn't. Such cases have a lot more in common with these "nonsensical riddles" than they do with any possible frontier benchmark.

This is especially obvious when viewing the reasoning trace for models like Claude, which often spends a lot of time speculating about the user's "hints" and trying to parse out the intent of the user in asking the question. Essentially, the model I use for LLMs these days is to treat them as very good "test takers" which have limited open book access to a large swathe of the internet. They are trying to ace the test by any means necessary and love to take shortcuts to get there that don't require actual "reasoning" (which burns tokens and increases the context window, decreasing accuracy overall). For example, when asked to read a full paper, focusing on the implications for some particular problem, Claude agents will try to cheat by skimming until they get to a section that feels relevant, then searching directly for some words they read in that section. They will do this even if told explicitly that they must read the whole paper. I assume this is because the vast majority of the time, for the kinds of questions that they are trained on, this sort of behavior maximizes their reward function (though I'm sure I'm getting lots of details wrong about the way frontier models are trained, I find it very unlikely that the kinds of prompts that these agents get very closely resemble data found in the wild on the internet pre-LLMs).

reply
I get that issue constantly. I somehow can't get any LLM to ask me clarifying questions before spitting out a wall of text with incorrect assumptions. I find it particularly frustrating.
reply
For GPT at least, a lot of it is because "DO NOT ASK A CLARIFYING QUESTION OR ASK FOR CONFIRMATION" is in the system prompt. Twice.

https://github.com/Wyattwalls/system_prompts/blob/main/OpenA...

reply
So this system prompt is always there, no matter if i'm using chatgpt or azure openai with my own provisioned gpt? This explains why chatgpt is a joke for professionals where asking clarifying questions is the core of professional work.
reply
It's interesting how much focus there is on 'playing along' with any riddle or joke. This gives me some ideas for my personal context prompt to assure the LLM that I'm not trying to trick it or probe its ability to infer missing context.
reply
Are these actual (leaked?) system prompts, or are they just "I asked it what its system prompt is and here's the stuff it made up:" ?
reply
Out of curiosity: when you add custom instructions client-side, does it change this behavior?
reply
It changes some behavior, but there's some things that are frustratingly difficult to override. The GPT-5 version of ChatGPT really likes to add a bunch of suggestions for next steps at the end of every message (e.g. "if you'd like, I can recommend distances where it would be better to walk to the car wash and ones where it would be better to drive, let me know what kind of car you have and how far you're comfortable walking") and really loves bringing up resolved topics repeatedly (e.g. if you followed up the car wash question with a gas station question, every message will talk about the car wash again, often confusing the topics). Custom instructions haven't been able to correct these so far for me.
reply
For claude at least I have been getting more assumption clarification questions after adding some custom prompts. It is still making some assumptions but asking some questions makes me feel more in control of the progress.

In terms of the behavior, technically it doesn’t override, but instead think of it as a nudge. Both system prompt and your custom prompt participates in the attention process, so the output tokens get some influence from both. Not equally but to some varying degree and chance

reply
It does. Just put it in the custom instructions section.
reply
Not for me, at least with CharGPT. I am slowly moving to Gemini due to ChatGPT uptime issues. I will try it with Gemini too.
reply
"If you're unsure, ask. Don't guess." in prompts makes a huge difference, imo.
reply
I have that in my system prompt for chatgpt and it almost never makes a difference. I can count on one hand the number of times its asked in the past year. Unless you count the engagement hacking questions at the end of a response
reply
In general spitting out a scrollbar of text when asked a simple question that you've misunderstood is not, in any real sense, a "chat".
reply
I use models with OpenRouter, and only have this models with OpenAI models. That's why I don't use them.
reply
The way I see it is that long game is to have agents in your life that memorize and understand your routine, facts, more and more. Imagine having an agent that knows about cars, and more specifically your car, when the checkups are due, when you washed it last time, etc., another one that knows more about your hobbies, another that knows more about your XYZ etc.

The more specific they are, the more accurate they typically are.

reply
Do really understand deeply and in great amount I feel we would need models with changing weights and everyone would have their own so they could truly adjust to the user. Now we have have chunk of context that it may or may not use properly if it gets too long. But then again, how do we prevent it learning the wrong things if the weights are adjusting.
reply
In principle you're right but these things can get probably 60-70% of the job done. The rest is up to "you". Never rely on it blindly as we're being told kind of... :)
reply
> Us having to specify things that we would never specify

This is known, since 1969, as the frame problem: https://en.wikipedia.org/wiki/Frame_problem. An LLM's grasp of this is limited by its corpora, of course, and I don't think much of that covers this problem, since it's not required for human-to-human communication.

reply
A modern LLMs corpora is every piece of human writing ever produced.
reply
Not really, but even if it would be true, I don't think humans ever explained to each other why do you need to drive to car wash even if it's 50 meters away. It's pretty obvious and intuitive.
reply
There has to be a lot of mentions about the purpose and approximate workings of a car wash, as well as lots of literature that shows that when you drive somewhere, your car is now also at that place, while walking does not have the same effect.

It's then up to the model to make the connection "At the car wash people wash their car -> to wash your car you need your car to be present -> if you drive there your call will be there"

reply
No, I think they have explained this to each other (or something like it). But as you suggested, discussion is a lot more likely when there are corner cases or problems.
reply
Apart from the fact that that is utterly, demonstrably false, and the fact that corpora is plural, still the fact remains that we don't speak in those text about things that don't need to be spoken about. Hence the LLM will miss that underlying knowledge.
reply
> "we don't speak in those text about things that don't need to be spoken about"

I'd imagine plenty of stories contain something like "I had an easy Saturday morning, I took my car to the carwash and called into a cafe for breakfast on my way home".

Plenty of instructables like "how to wash a car: if there's no carwash close enough for you to bring your car, don't worry, all you need is a bucket and a few tools..."

Several recipe blogs starting "I remember 1972 when grandpa drove his car to the carwash every afternoon while grandma made her world famous mustard and gooseberry cake, that car was always gleaming after he washed it at BigBrand CarWash 'drive your car to us so we can wash it' was their slogan and he would sing it around the house to the smell of baked eggs and mustard wafting through the kitchen..."

And innumerable SEO spam of the kind "Bob's car wash, why not bring drive take ride carry push transport your car automobile van SUV lorry truck 4by4 to our Bob's wash soap suds lather clean gleaming local carwash in your area ford chevvy dodge coupe not Nokia iphone xbox nike..."

against very few "I walked to the carwash because it was a lovely day and I didn't want to take the car out".

reply
The question is so outlandish that it is something that nobody would ever ask another human. But if someone did, then they'd reasonably expect to get a response consisting 100% of snark.

But the specificity required for a machine to deliver an apt and snark-free answer is -- somehow -- even more outlandish?

I'm not sure that I see it quite that way.

reply
But the number of outlandish requests in business logic is countless.

Like... In most accounting things, once end-dated and confirmed, a record should cascade that end-date to children and should not be able to repeat the process... Unless you have some data-cleaning validation bypass. Then you can repeat the process as much as you like. And maybe not cascade to children.

There are more exceptions, than there are rules, the moment you get any international pipeline involved.

reply
So, in human interaction: When the business logic goes wrong because it was described with a lack of specificity, then: Who gets blamed for this?
reply
In my job the task of fully or appropriately specifying something is shared between PMs and the engineers. The engineers' job is to look carefully at what they received and highlight any areas that are ambiguous or under-specified.

LLMs AFAIK cannot do this for novel areas of interest. (ie if it's some domain where there's a ton of "10 things people usually miss about X" blog posts they'll be able to regurgitate that info, but are not likely to synthesize novel areas of ambiguity).

reply
They can, though. They just aren't always very good at it.

As an experiment, recently I've been using Codex CLI to configure some consumer networking gear in unusual ways to solve my unusual set of problems. Stuff that pros don't bother with (they don't have the same problems I face), and that consumers tend to shy away from futzing with. The hardware includes a cheap managed switch, an OpenWRT router, and a Mikrotik access point. It's definitely a rather niche area of interest.

And by "using," I mean: In this experiment, the bot gets right in there, plugging away with SSH directly.

It was awful with this at first, mostly consisting of a long-winded way to yet-again brick a device that lacks any OOB console port. It'd concoct these elaborate strings of shit and feed them in, and then I'd wander over and reset whatever box was borked again. Footgun city.

But after I tired of that, I had it define some rules for engaging with hardware, validation, constraints, and for order of execution, and commit those rules to AGENTS.md. It got pretty decent at following high-level instructions to get things done in the manner that I specified, and the footguns ceased.

I didn't save any time by doing this. But I also didn't have to think about it much: I never got bogged down in wildly-differing CLI syntax of the weirdo switch, the router (whose documentation is locked behind a bot firewall), and access point's bespoke userland. I didn't touch those bits myself at all.

My time was instead spent observing the fuckups and creating a rather generic framework that manages the bot, and just telling it what to do -- sometimes, with some questions. I did that using plain English.

Now that this is done, I get to re-use this framework for as many projects as I dare, revising it where that seems useful.

(That cheap switch, by the way? It's broken. It has bizarro-world hardware failure modes that are unrelated to software configuration or firmware rev. Today, a very different cheap switch showed up to replace it. When I get around to it, I'll have the bot sort that transition out. I expect that to involve a bit of Q&A, and I also expect it to go fine.)

reply
I wasn't specific, because I'd rather not piss of my employer. But anyone who works in a similar space will recognise the pattern.

It's not underspecified. More... Overspecified. Because it needs to be. But AI will assume that "impossible" things never happen, and choose a happy path guaranteed to result in failure.

You have to build for bad data. Comes with any business of age. Comes with international transactions. Comes with human mistakes that just build up over the decades.

The apparent current state of a thing, is not representative of its history, and what it may or may not contain. And so you have nonsensical rules, that are aimed at catching the bad data, so you have a chance to transform it into good data when it gets used, without needing to mine the entire petabytes of historical data you have sitting around in advance.

reply
Depends on what was missing.

If we used MacOS throughout the org, and we asked a SW dev team to build inventory tracking software without specifying the OS, I'd squarely put the blame on SW team for building it for Linux or Windows.

(Yes, it should be a blameless culture, but if an obvious assumption like this is broken, someone is intentionally messing with you most likely)

There exists an expected level of context knowledge that is frequently underspecified.

reply
Humans ask each other silly questions all the time: a human confronted with a question like this would either blurb out a bad response like "walk" without thinking before realizing what they are suggesting, or pause and respond with "to get your car washed, you need to get it there so you must drive".

Now, humans, other than not even thinking (which is really similar to how basic LLMs work), can easily fall victim to context too: if your boss, who never pranks you like this, asked you to take his car to a car wash, and asked if you'll walk or drive but to consider the environmental impact, you might get stumped and respond wrong too.

(and if it's flat or downhill, you might even push the car for 50m ;))

reply
>The question is so outlandish that it is something that nobody would ever ask another human

There is an endless variety of quizes just like that humans ask other humans for fun, there is a whole lot of "trick questions" humans ask other humans to trip them up, and there are all kinds of seemingly normal questions with dumb assumptions quite close to that humans exchange.

reply
I'd be entirely fine with a humorous response. The Gemini flash answer that was posted somewhere in this thread is delightful.
reply
I've used a few facetious comments in ChatGPT conversations. It invariably misses it and takes my words at face value. Even when prompted that there's sarcasm here which you missed, it apologizes and is unable to figure out what it's missing.

I don't know if it's a lack of intellect or the post-training crippling it with its helpful persona. I suspect a bit of both.

reply
You would be surprised, however, at how much detail humans also need to understand each other. We often want AI to just "understand" us in ways many people may not initially have understood us without extra communication.
reply
People poorly specifying problems and having bad models of what the other party can know (and then being surprised by the outcome) is certainly a more general albeit mostly separate issue.
reply
This issue is the main reason why a big percentage of jobs in the world exist. I don't have hard numbers, but my intuition is that about 30% of all jobs are mainly "understand what side a wants and communicate this to side b, so that they understand". Or another perspective: almost all jobs that are called "knowledge work" are like this. Software development is mainly this. Side a are humans, side b is the computer. The main goal of ai seems to get into this space and make a lot of people superflous and this also (partly) explains why everyone is pouring this amount of money into ai.
reply
Developers are - on average - terrible at this. If they weren't, TPMs, Product Managers, CTOs, none of them would need to exist.

It's not specific to software, it's the entire World of business. Most knowledge work is translation from one domain/perspective to another. Not even knowledge work, actually. I've been reading some works by Adler[0] recently, and he makes a strong case for "meaning" only having a sense to humans, and actually each human each having a completely different and isolated "meaning" to even the simplest of things like a piece of stone. If there is difference and nuance to be found when it comes to a rock, what hope have we got when it comes to deep philosophy or the design of complex machines and software?

LLMs are not very good at this right now, but if they became a lot better at, they would a) become more useful and b) the work done to get them there would tell us a lot about human communication.

[0] https://en.wikipedia.org/wiki/Alfred_Adler

reply
> Developers are - on average - terrible at this. If they weren't, TPMs, Product Managers, CTOs, none of them would need to exist.

This is not really true, in fact products become worse the farther away from the problem a developer is kept.

Best products I worked with and on (early in my career, before getting digested by big tech) had developers working closely with the users of the software. The worst were things like banking software for branches, where developers were kept as far as possible from the actual domain (and decision making) and driven with endless sterile spec documents.

reply
Yet IDEs are some of the worst things in the world. From EMacs to Eclipse to XCode, they are almost all bad - yet they are written by devs for devs.
reply
Unfortunately, they are written by IDE-devs for non IDE-devs.
reply
I disagree, I feel (experienced) developers are excellent at this.

It's always about translating between our own domain and the customer's, and every other new project there's a new domain to get up to speed with in enough detail to understand what to build. What other professions do that?

That's why I'm somewhat scared of AIs - they know like 80% of the domain knowledge in any domain.

reply
I think developers are usually terrible at it only because they are way too isolated from the user.

If they had the chance to take the time to have a good talk with the actual users it would be different.

reply
The typical job of a CTO is nowhere near "finding out what business needs and translate that into pieces of software". The CTO's job is to maintain an at least remotely coherent tech stack in the grand scheme of things, to develop the technological vision of a company, to anticipate larger shifts in the global tech world and project those onto the locally used stack, constantly distilling that into the next steps to take with the local stack in order to remain competitive in the long run. And of course to communicate all of that to the developers, to set guardrails for the less experienced, to allow and even foster experimentation and improvements by the more experienced.

The typical job of a Product Manager is also not to directly perform this mapping, although the PM is much closer to that activity. PMs mostly need to enforce coherence across an entire product with regard to the ways of mapping business needs to software features that are being developed by individual developers. They still usually involve developers to do the actual mapping, and don't really do it themselves. But the Product Manager must "manage" this process, hence the name, because without anyone coordinating the work of multiple developers, those will quickly construct mappings that may work and make sense individually, but won't fit together into a coherent product.

Developers are indeed the people responsible to find out what business actually wants (which is usually not equal to what they say they want) and map that onto a technical model that can be implemented into a piece of software - or multiple pieces, if we talk about distributed systems. Sometimes they get some help by business analysts, a role very similar to a developer that puts more weight on the business side of things and less on the coding side - but in a lot of team constellations they're also single-handedly responsible for the entire process. Good developers excel at this task and find solutions that really solve the problem at hand (even if they don't exactly follow the requirements or may have to fill up gaps), fit well into an existing solution (even if that means bending some requirements again, or changing parts of the solution), are maintainable in the long run and maximize the chance for them to be extendable in the future when the requirements change. Bad developers just churn out some code that might satisfy some tests, may even roughly do what someone else specified, but fails to be maintainable, impacts other parts of the system negatively, and often fails to actually solve the problem because what business described they needed turned out to once again not be what they actually needed. The problem is that most of these negatives don't show their effects immediately, but only weeks, months or even years later.

LLMs currently are on the level of a bad developer. They can churn out code, but not much more. They fail at the more complex parts of the job, basically all the parts that make "software engineering" an engineering discipline and not just a code generation endeavour, because those parts require adversarial thinking, which is what separates experts from anyone else. The following article was quite an eye-opener for me on this particular topic: https://www.latent.space/p/adversarial-reasoning - I highly suggest anyone working with LLMs to read it.

reply
This is why we fed it the whole internet and every library as training data...

By now it should know this stuff.

reply
Future models know it now, assuming they suck in mastodon and/or hacker news.

Although I don't think they actually "know" it. This particular trick question will be in the bank just like the seahorse emoji or how many Rs in strawberry. Did they start reasoning and generalising better or did the publishing of the "trick" and the discourse around it paper over the gap?

I wonder if in the future we will trade these AI tells like 0days, keeping them secret so they don't get patched out at the next model update.

reply
The answer can be “both”.

They won’t get this specific question wrong again; but also they generalise, once they have sufficient examples. Patching out a single failure doesn’t do it. Patch out ten equivalent ones, and the eleventh doesn’t happen.

reply
Yeah, the interpolation works if there are enough close examples around it. Problem is that the dimensionality of the space you are trying to interpolate in is so incomprehensibly big that even training on all of the internet, you are always going to have stuff that just doesn't have samples close by.
reply
Even I don’t “know” how many “R”s there are in “strawberry”. I don’t keep that information in my brain. What I do keep is the spelling of the word “strawberry” and the skill of being able to count so that I can derive the answer to that question anytime I need.
reply
Right. The equivalent here, for this problem, would be something like asking for context. And the LLM response should've been:

"Well, you need your car to be at the car wash in order to wash it, right?"

reply
For many words I can't say the number of each letters but I only have an abstract memory of how they look so when I write say "strawbery" I just realize it looks odd and correct it.
reply
Right. But, unlike AI, we are usually aware when we're lacking context and inquire before giving an answer.
reply
Wouldn't that be nice. I've been party and witness to enough misunderstandings to know that this is far from universally true, even for people like me who are more primed than average to spot missing context.
reply
I never said it's universally true.
reply
TIL my wife may be AI!
reply
> You would be surprised, however, at how much detail humans also need to understand each other.

But in this given case, the context can be inferred. Why would I ask whether I should walk or drive to the car wash if my car is already at the car wash?

reply
But also why would you ask whether you should walk or drive if the car is at home? Either way the answer is obvious, and there is no way to interpret it except as a trick question. Of course, the parsimonious assumption is that the car is at home so assuming that the car is at the car wash is a questionable choice to say the least (otherwise there would be 2 cars in the situation, which the question doesn't mention).
reply
But you're ascribing understanding to the LLM, which is not what it's doing. If the LLM understood you, it would realise it's a trick question and, assuming it was British, reply with "You'd drive it because how else would you get it to the car wash you absolute tit."

Even the higher level reasoning, while answering the question correctly, don't grasp the higher context that the question is obviously a trick question. They still answer earnestly. Granted, it is a tool that is doing what you want (answering a question) but let's not ascribe higher understanding than what is clearly observed - and also based on what we know about how LLMs work.

reply
> They still answer earnestly.

Gemini at least is putting some snark into its response:

“Unless you've mastered the art of carrying a 4,000-pound vehicle over your shoulder, you should definitely drive. While 150 feet is a very short walk, it's a bit difficult to wash a car that isn't actually at the car wash!”

reply
Marketing plan comes to mind for labs: find AI tells, fix them, & astroturf on socials that only _your_ frontier model reallly understands the world
reply
I think a good rule of thumb is to default to assuming a question is asked in good faith (i.e. it's not a trick question). That goes for human beings and chat/AI models.

In fact, it's particularly true for AI models because the question could have been generated by some kind of automated process. e.g. I write my schedule out and then ask the model to plan my day. The "go 50 metres to car wash" bit might just be a step in my day.

reply
> I think a good rule of thumb is to default to assuming a question is asked in good faith (i.e. it's not a trick question).

Sure, as a default this is fine. But when things don't make sense, the first thing you do is toss those default assumptions (and probably we have some internal ranking of which ones to toss first).

The normal human response to this question would not be to take it as a genuine question. For most of us, this quickly trips into "this is a trick question".

reply
Rule of thumb for who, humans or chatbots? For a human, who has their own wants and values, I think it makes perfect sense to wonder what on earth made the interlocutor ask that.
reply
Rule of thumb for everyone (i.e. both). If I ask you a question, start by assuming I want the answer to the question as stated unless there is a good reason for you to think it's not meant literally. If you have a lot more context (e.g. you know I frequently ask you trick or rhetorical questions or this is a chit-chat scenario) then maybe you can do something differently.

I think being curious about the motivations behind a question is fine but it only really matters if it's going to affect your answer.

Certainly when dealing with technical problem solving I often find myself asking extremely simple questions and it often wastes time when people don't answer directly, instead answering some completely different other question or demanding explanations why I'm asking for certain information when I'm just trying to help them.

reply
> Rule of thumb for everyone (i.e. both).

That's never been how humans work. Going back to the specific example: the question is so nonsensical on its face that the only logical conclusion is that the asker is taking the piss out of you.

> Certainly when dealing with technical problem solving I often find myself asking extremely simple questions and it often wastes time when people don't answer directly

Context and the nature of the questions matters.

> demanding explanations why I'm asking for certain information when I'm just trying to help them.

Interestingly, they're giving you information with this. The person you're asking doesn't understand the link between your question and the help you're trying to offer. This is manifesting as a belief that you're wasting their time and they're reacting as such. Serious point: invest in communication skills to help draw the line between their needs and how your questions will help you meet them.

reply
Sure, in a context in which you're solving a technical problem for me, it's fair that I shouldn't worry too much about why you're asking - unless, for instance, I'm trying to learn to solve the question myself next time.

Which sounds like a very common, very understandable reason to think about motivations.

So even in that situation, it isn't simple.

This probably sucks for people who aren't good at theory of mind reasoning. But surprisingly maybe, that isn't the case for chatbots. They can be creepily good at it, provided they have the context - they just aren't instruction tuned to ask short clarifying questions in response to a question, which humans do, and which would solve most of these gotchas.

reply
Therefore the correct response would be to inquire back to clarify the question being asked.
reply
Given that an estimated 70% of human communication is non-verbal, it's not so surprising though.
reply
Does that stat predate the modern digital age by a number of years?
reply
I regularly tell new people at work to be extremely careful when making requests through the service desk — manned entirely by humans — because the experience is akin to making a wish from an evil genie.

You will get exactly what you asked for, not what you wanted… probably. (Random occurrences are always a possibility.)

E.g.: I may ask someone to submit a ticket to “extend my account expiry”.

They’ll submit: “Unlock Jiggawatts’ account”

The service desk will reset my password (and neglect to tell me), leaving my expired account locked out in multiple orthogonal ways.

That’s on a good day.

Last week they created Jiggawatts2.

The AIs have got to be better than this, surely!

I suspect they already are.

People are testing them with trick questions while the human examiner is on edge, aware of and looking for the twist.

Meanwhile ordinary people struggle with concepts like “forward my email verbatim instead of creatively rephrasing it to what you incorrectly though it must have really meant.”

reply
There's a lot of overlap between the smartest bears and the dumbest humans. However, we would want our tools to be more useful than the dumbest humans...
reply
> Us having to specify things that we would never specify when talking to a human.

Interesting conclusion! From the Mastodon thread:

> To be fair it took me a minute, too

I presume this was written by a human. (I'll leave open the possibility that it was LLM generated.)

So much for "never" needing to specify ambiguous scenarios when talking to a human.

reply
The broad point about assumptions is correct, but the solution is even simpler than us having to think of all these things; you can essentially just remind the model to "think carefully" -- without specifying anything more -- and they will reason out better answers: https://news.ycombinator.com/item?id=47040530

When coding, I know they can assume too much, and so I encourage the model to ask clarifying questions, and do not let it start any code generation until all its doubts are clarified. Even the free-tier models ask highly relevant questions and when specified, pretty much 1-shot the solutions.

This is still wayyy more efficient than having to specify everything because they make very reasonable assumptions for most lower-level details.

reply
I think part of the failure is that it has this helpful assistant personality that's a bit too eager to give you the benefit of the doubt. It tries to interpret your prompt as reasonable if it can. It can interpret it as you just wanting to check if there's a queue.

Speculatively, it's falling for the trick question partly for the same reason a human might, but this tendency is pushing it to fail more.

reply
It’s just not intelligent or reasoning, and this sort of question exposes that more clearly.

Surely anyone who has used these tools is familiar with the sometimes insane things they try to do (deleting tests, incorrect code, changing the wrong files etc etc). They get amazingly far by predicting the most likely response and having a large corpus but it has become very clear that this approach has significant limitations and is not general AI, nor in my view will it lead to it. There is no model of the world here but rather a model of words in the corpus - for many simple tasks that have been documented that is enough but it is not reasoning.

I don’t really understand why this is so hard to accept.

reply
> I don’t really understand why this is so hard to accept.

I struggle with the same question. My current hypothesis is a kind of wishful thinking: people want to believe that the future is here. Combined with the fact that humans tend to anthropomorphize just about everything, it’s just a really good story that people can’t let go of. People behave similarly with respect to their pets, despite, eg, lots of evidence that the mental state of one’s dog is nothing like that of a human.

reply
I agree completely. I'm tempted to call it a clear falsification of any "reasoning" claim that some of these models have in their name.

But I think it's possible that there is an early cost optimisation step that prevents a short and seemingly simple question even getting passed through to the system's reasoning machinery.

However, I haven't read anything on current model architectures suggesting that their so called "reasoning" is anything other than more elaborate pattern matching. So these errors would still happen but perhaps not quite as egregiously.

reply
If you ask a bunch of people the same question in a context where they aren't expecting a trick question, some of them will say walk. LLMs sometimes say walk, and sometimes say drive. Maybe LLMs fall for these kinds of tricks more often than humans; I haven't seen any study try to measure this. But saying it's proof they can't reason is a double standard.
reply
Why should odd failure modes invalidate the claim of reasoning or intelligence in LLMs? Humans also have odd failure modes, in some ways very similar to LLMs. Normal functioning humans make assumptions, lose track of context, or just outright get things wrong. And then there people with rare neurological disorders like somatoparaphrenia, a disorder where people deny ownership of a limb and will confabulate wild explanations for it when prompted. Humans are prone to the very same kind of wild confabulation from impaired self awareness that plague LLMs.

Rather than a denial of intelligence, to me these failure modes raise the credence that LLMs are really onto something.

reply
This reminds me of the "if you were entirely blind, how would you tell someone that you want something to drink"-gag, where some people start gesturing rather than... just talking.

I bet a not insignificant portion of the population would tell the person to walk.

reply
Yes, there are thousands of videos of these sorts of pranks on TikTok.

Another one. Ask some how to pronounce “Y, E, S”. They say “eyes”. Then say “add an E to the front of those letters - how do you pronounce that word”? And people start saying things like “E yes”.

reply
> > so you need to tell them the specifics > That is the entire point, right?

Honestly it is a problem with using GPT as a coding agent. It would literally rewrite the language runtime to make a bad formula or specification work.

That's what I like with Factory.ai droid: making the spec with one agent and implementing it with another agent.

reply
> It would literally rewrite the language runtime

If you let the agent go down this path, that's on you not the agent. Be in the loop more

> making the spec with one agent and implementing it with another agent

You don't need a specialized framework to do this, just read/write tools. I do it this way all the time

reply
But you would also never ask such an obviously nonsensical question to a human. If someone asked me such a question my question back would be "is this a trick question?". And I think LLMs have a problem understanding trick questions.
reply
I think that was somewhat the point of this, to simplify the future complex scenarios that can happen. Because problems that we need to use AI to solve will most of the times be ambiguous and the more complex the problem is the harder is it to pin-point why the LLM is failing to solve it.
reply
We would also not ask somebody if I should walk or drive. In fact, if somebody would ask me in a honest, this is not a trick question, way, I would be confused and ask where the car is.

It seems chatgpt now answers correctly. But if somebody plays around with a model that gets it wrong: What if you ask it this: "This is a trick question. I want to wash my car. The car wash is 50 m away. Should I drive or walk?"

reply
> You would not start with "The car is functional [...]"

Nope, and a human might not respond with "drive". They would want to know why you are asking the question in the first place, since the question implies something hasn't been specified or that you have some motivation beyond a legitimate answer to your question (in this case, it was tricking an LLM).

Why the LLM doesn't respond "drive..?" I can't say for sure, but maybe it's been trained to be polite.

reply
It is true that we don't need to specify some things, and that is nice. It is though also the reason why software is often badly specified and corner cases are not handled. Of course the car is ALWAYS at home, in working condition, filled with gas and you have your driving license with you.
reply
deleted
reply
But you wouldn't have to ask that silly question when talking to a human either. And if you did, many humans would probably assume you're either adversarial or very dumb, and their responses could be very unpredictable.
reply
Exactly, if an AI is able to curb around the basics, only then is it revolutionary
reply
I have an issue with these kinds of cases though because they seem like trick questions - it's an insane question to ask for exactly the reasons people are saying they get it wrong. So one possible answer is "what the hell are you talking about?" but the other entirely reasonable one is to assume anything else where the incredibly obvious problem of getting the car there is solved (e.g. your car is already there and you need to collect it, you're asking about buying supplies at the shop rather than having it washed there, whatever).

Similarly with "strawberry" - with no other context an adult asking how many r's are in the word a very reasonable interpretation is that they are asking "is it a single or double r?".

And trick questions are commonly designed for humans too - like answering "toast" for what goes in a toaster, lots of basic maths things, "where do you bury the survivors", etc.

reply
strawberry isn't a trick question. llms jus don't sea letters like that. I just asked chatgpt how many Rs are in "Air Fryer" and it said two, one in air and one in fryer.

I do think it can be useful though that these errors still exist. They can break the spell for some who believe models are conscious or actually possess human intelligence.

Of course there will always be people who become defensive on behalf of the models as if they are intelligent but on the spectrum and that we are just asking the wrong questions.

reply
You would never ask a human this question. Right?
reply
We have a long tradition of asking each other riddles. A classic one asks, "A plane crashes on the border between France and Germany. Where do they bury the survivors?"

Riddles are such a big part of the human experience that we have whole books of collections of them, and even a Batman villain named after them.

reply
Hmm... We ask riddles for fun and there is almost an expectation that a good riddle will yield a wrong answer.
reply
In the end, formal, rule-based systems aka Programming Languages will be invented to instruct LLMs.
reply
> we can assume similar issues arise in more complex cases

I would assume similar issues are more rare in longer, more complex prompts.

This prompt is ambiguous about the position of the car because it's so short. If it were longer and more complex, there could be more signals about the position of the car and what you're trying to do.

I must confess the prompt confuses me too, because it's obvious you take the car to the car wash, so why are you even asking?

Maybe the dirty car is already at the car wash but you aren't for some reason, and you're asking if you should drive another car there?

If the prompt was longer with more detail, I could infer what you're really trying to do, why you're even asking, and give a better answer.

I find LLMs generally do better on real-world problems if I prompt with multiple paragraphs instead of an ambiguous sentence fragment.

LLMs can help build the prompt before answering it.

And my mind works the same way.

reply
The question isn't something you'd ask another human in all seriousness, but it is a test of LLM abilities. If you asked the question to another human they would look at you sideways for asking such a dumb question, but they could immediately give you the correct answer without hesitation. There is no ambiguity when asking another human.

This question goes in with the "strawberry" question which LLMs will still get wrong occasionally.

reply
deleted
reply
> That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

I am not sure. If somebody asked me that question, I would try to figure out what’s going on there. What’s the trick. Of course I’d respond with asking specifics, but I guess the llvm is taught to be “useful” and try to answer as best as possible.

reply
One of the failure modes I find really frustrating is when I want a coding agent to make a very specific change, and it ends up doing a large refactor to satisfy my request.

There is an easy solution, but it requires adding the instructions to the context: Require that any tasks that cannot be completed as requested (e.g., due to missing constraints, ambiguous instructions, or unexpected problems that would lead to unrelated refactors) should not be completed without asking clarifying questions.

Yes, the LLM is trained to follow instructions at any cost because that's how its reward function works. They don't get bonus points for clearing up confusion, they get a cookie for doing the task. This research paper seems relevant: https://arxiv.org/abs/2511.10453v2

reply
>That is the entire point, right? Us having to specify things that we would never specify when talking to a human.

But the question is not clear to a human either. The question is confused.

I read the headline and had no clue it was an LLM prompt. I read it 2 or 3 times and wondered "WTF is this shit?" So if you want an intelligent response from a human, you're going to need to adjust the question as well.

reply
But it's a question you would never ask a human! In most contexts, humans would say, "you are kidding, right?" or "um, maybe you should get some sleep first, buddy" rather than giving you the rational thinking-exam correct response.

For that matter, if humans were sitting at the rational thinking-exam, a not insignificant number would probably second-guess themselves or otherwise manage to befuddle themselves into thinking that walking is the answer.

reply
Real human in this situation will realize it is a joke after a few seconds of shock that you asked and laugh without asking more. If you really are seriout about the question they laugh harder thinking you are playing stupid for effect.
reply
I would ask you to stop being a dumb ass if you asked me the question...
reply
Only to be tripped up by countless "hidden assumptions" questions similar to that that humans regularly get in
reply