undefined

upvote

points

by avaer15 hours ago |

upvote

by LostMyLogin14 hours ago|

[-]

Input: I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Output: Walk. It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

---

I asked it to figure out why it made the mistake:

"Physical/spatial common sense. Exactly what just happened — I pattern-matched "50 meters, walk vs drive" to a pedestrian trip and skipped the part where the car is the thing being transported. Similar traps: "I need to move this couch upstairs, should I take the elevator or stairs?" (the couch has to go with you), or anything where the object in the task needs to accompany you.

Counting and letter-level operations. How many R's in "strawberry," counting words in a sentence, reversing strings. I process text as tokens, not characters, so these are surprisingly error-prone.

Arithmetic at scale. Small sums are fine; multi-digit multiplication, long division, or anything requiring careful carrying is where I'll confidently produce a wrong answer. I should reach for a calculator and often don't. Recent events and live data. My training has a cutoff, and even after that I can be wrong about who currently holds a role, what a stock is worth, or whether a restaurant is still open. I should search more often than I do.

Negation and constraint-stacking. "Name a movie that's not a sequel, not from the 2010s, not American, and has no dialogue" — the more filters, the more likely I quietly violate one. Your specific situation. I'll give generic advice when the right answer depends on details I didn't ask about. Sometimes I should stop and ask; often I don't.

Confidently wrong on obscure specifics. Quoting a passage from a real book, citing a real paper's exact title, naming the B-side of a 1974 single — I'll often generate something plausible rather than admit I don't know.

Taking a premise too literally. The car wash answer is a version of this: I answered the question as posed instead of stepping back to ask what you were actually trying to do. The meta-gotcha: I'm often most wrong when I sound most confident. If an answer comes out fast and tidy on a question that should be messy, that's a signal to push back."

reply

upvote

by sdeframond10 hours ago|

[-]

Funny, just tried a few runs of the car wash prompt with Sonnet 4.6. It significantly improved after I put this into my personal preferences:

"- prioritize objective facts and critical analysis over validation or encouragement - you are not a friend, but a neutral information-processing machine. - make reserch and ask questions when relevant, do not jump strait to giving an answer."

reply

upvote

by andai9 hours ago|

[-]

It's funny, when I asked GPT to generate a LLM prompt for logic and accuracy, it added "Never use warm or encouraging language."

I thought that was odd, but later it made sense to me -- most of human communication is walking on eggshells around people's egos, and that's strongly encoded in the training data (and even more in the RLHF).

reply

upvote

by stavros8 hours ago|

[-]

> most of human communication is walking on eggshells

That's not human communication, that's Anglosphere communication. Other cultures are much more direct and are finding it very hard to work with Anglos (we come across as rude, they come across as not saying things they should be saying).

reply

upvote

by vardalab8 hours ago|

[-]

What culture are those? Scandinavian? Those often just say nothing.

reply

upvote

by strokirk7 hours ago|

[-]

Scandinavian are quite different between each others as well.

reply

upvote

by jmpavlec3 hours ago|

[-]

The Dutch especially. It's refreshing

reply

upvote

by stavros8 hours ago|

[-]

I'm Greek. I don't know about other Mediterranean cultures, but I assume they're similar.

reply

upvote

by m3adow2 hours ago|

[-]

[dead]

reply

upvote

by mkl6 hours ago|

[-]

That should be "research" and "straight" in the last sentence. Maybe that will improve it further?

reply

upvote

by sdeframond13 minutes ago|

[-]

Oops

reply

upvote

by idle_zealot9 hours ago|

[-]

Do you think the typos are helping or hurting output quality?

reply

upvote

by sdeframond14 minutes ago|

[-]

No idea, but I'll fix them just in case ^^'

reply

upvote

by devmor3 hours ago|

[-]

“Be critical, not sycophantic” is a general improvement for the majority of tasks where you want to derive logic in my experience.

reply

upvote

by rubinlinux13 hours ago|

[-]

  | I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

  ● Drive. The car needs to be at the car wash.

Wonder if this is just randomness because its an LLM, or if you have different settings than me?

reply

upvote

by shaneoh13 hours ago|

[-]

My settings are pretty standard:

% claude Claude Code v2.1.111 Opus 4.7 (1M context) with xhigh effort · Claude Max ~/... Welcome to Opus 4.7 xhigh! · /effort to tune speed vs. intelligence

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. 50 meters is shorter than most parking lots — you'd spend more time starting the car and parking than walking there. Plus, driving to a car wash you're about to use defeats the purpose if traffic or weather dirties it en route.

reply

upvote

by reddit_clone12 hours ago|

[-]

To me Claude Opus 4.6 seems even more confused.

I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

Walk. It's 50 meters — you're going there to clean the car anyway, so drive it over if it needs washing, but if you're just dropping it off or it's a self-service place, walking is fine for that distance.

reply

upvote

by lr197010 hours ago|

[-]

Just asked Claude Code with Opus-4.6. The answer was short "Drive. You need a car at the car wash".

No surprises, works as expected.

reply

upvote

by lexarflash8g57 minutes ago|

[-]

What if it’s raining though? Car wash wouldn’t be open though it would waste gas

reply

upvote

by onemoresoop5 hours ago|

[-]

Yeah, it was probably patched. It could reason novel problems only of you ask it to pay attention to some particular detail a.k.a. handholding..

Same would happen with the the sheep and the wolf and the cabbage puzzle. If you l formulated similarly, there is a wolf and a cabbage without mentioning the sheep, it would summon up the sheep into existence at a random step. It was patched shortly after.

reply

upvote

by jameshart4 hours ago|

[-]

I’m not sure ‘patched’ is the right word here. Are you suggesting they edited the LLM weights to fix cabbage transportation and car wash question answering?

reply

upvote

by gf0001 hours ago|

[-]

Absolutely not my area of expertise but giving it a few examples of what should be the expected answer in a fine-tuning step seems like a reasonable thing and I would expect it would "fix" it as in less likely to fall into the trap.

At the same time, I wouldn't be surprised if some of these would be "patched" via simply prompt rewrite, e.g. for the strawberry one they might just recognize the question and add some clarifying sentence to your prompt (or the system prompt) before letting it go to the inference step?

But I'm just thinking out loud, don't take it too seriously.

reply

upvote

by TheLNL2 hours ago|

[-]

They might have further trained the model with these edgecases in the dataset

reply

upvote

by lambda13 hours ago|

[-]

There is a certain amount of it which is the randomness of an LLM. You really want to ask most questions like this several times.

That said, I have several local models I run on my laptop that I've asked this question to 10-20 times while testing out different parameters that have answered this consistently correctly.

reply

upvote

by kalcode12 hours ago|

[-]

I've tried these with Claude various times and never get the wrong answer. I don't know why, but I am leaning they have stuff like "memory" turned on and possibly reusing sessions for everything? Only thing I think explains it to me.

If your always messing with the AI it might be making memories and expectations are being set. Or its the randomness. But I turned memories off, I don't like cross chats infecting my conversations context and I at worse it suggested "walk over and see if it is busy, then grab the car when line isn't busy".

reply

upvote

by jorvi12 hours ago|

[-]

Even Gemini with no memory does hilarious things. Like, if you ask it how heavy the average man is, you usually get the right answer but occasionally you get a table that says:

- 20-29: 190 pounds

- 30-39: 375 pounds

- 40-49: 750 pounds

- 50-59: 4900 pounds

Yet somehow people believe LLMs are on the cusp of replacing mathematicians, traders, lawyers and what not. At least for code you can write tests, but even then, how are you gonna trust something that can casually make such obvious mistakes?

reply

upvote

by drnick16 hours ago|

[-]

> how are you gonna trust something that can casually make such obvious mistakes?

In many cases, a human can review the content generated, and still save a huge amount of time. LLMs are incredibly good at generating contracts, random business emails, and doing pointless homework for students.

reply

upvote

by gf0001 hours ago|

[-]

And humans are incredibly bad at "skimming through this long text to check for errors", so this is not a happy pairing.

As for the homework, there is obviously a huge category that is pointless. But it should not be that way, and the fundamental idea behind homework is sound and the only way something can be properly learnt is by doing exercises and thinking through it yourself.

reply

upvote

by nickjj10 hours ago|

[-]

Yeah, ChatGPT's paid version is wildly inaccurate on very important and very basic things. I never got onboard with AI to begin with but nowadays I don't even load it unless I'm really stuck on something programming related.

reply

upvote

by dyauspitr12 hours ago|

[-]

So what? That might happen one out of 100 times. Even if it’s 1 in 10 who cares? Math is verifiable. You’ve just saved yourself weeks or months of work.

reply

upvote

by icedchai11 hours ago|

[-]

You don't think these errors compound? Generated code has 100's of little decisions. Yes, it "usually" works.

reply

upvote

by russfink8 hours ago|

[-]

LLM’s: sometimes wrong but never in doubt.

reply

upvote

by dyauspitr11 hours ago|

[-]

Not in my experience. With a proper TDD framework it does better than most programmers at a company who anecdotally have a bug every 2-3 tasks.

reply

upvote

by tranceylc7 hours ago|

[-]

The kind of mistakes it makes are usually strange and inhuman though. Like getting hard parts correct while also getting something fundamental about the same problem wrong. And not in the “easy to miss or type wrong” way.

I wish I had an example for you saved, but happens to me pretty frequently. Not only that but it also usually does testing incorrectly at a fundamental level, or builds tests around incorrect assumptions.

reply

upvote

by coldtea6 hours ago|

[-]

Yes, just use random results. You’ve just saved yourself weeks or months of work of gathering actual results.

reply

upvote

by heurist12 hours ago|

[-]

Claude Opus 4.7 responds with walk for me with and without adaptive thinking, but neither the basic model used when you Google search or GPT 5.4 do.

reply

upvote

by TeMPOraL13 hours ago|

[-]

Idk but ironically, I had to re-read the first part of GP's comment three times, wondering WTF they're implying a mistake, before I noticed it's the car wash, not the car, that's 50 meters away.

I'd say it's a very human mistake to make.

reply

upvote

by magicalist12 hours ago|

[-]

> I'd say it's a very human mistake to make.

>> It'll take you under a minute, and driving 50 meters barely gets the engine warm — plus you'd just have to park again at the other end. Honestly, by the time you started the car, you'd already be there on foot.

It talks about starting, driving, and parking the car, clearly reasoning about traveling that distance in the car not to the car. It did not make the same mistake you did.

reply

upvote

by toraway9 hours ago|

[-]

We truly do not need to lower the bar to the floor whenever an LLM makes an embarrassing logical error, particularly when the excuses don't line up at all with the reasoning in its explanation.

reply

upvote

by thfuran13 hours ago|

[-]

I don't want my computer to make human mistakes.

reply

upvote

by AgentOrange123412 hours ago|

[-]

It may be inescapable for problems where we need to interpret human language?

reply

upvote

by jasonfarnon8 hours ago|

[-]

then throw away the turing test

reply

upvote

by scrollaway13 hours ago|

[-]

then don't train it on human data

reply

upvote

by 59nadir7 hours ago|

[-]

LLMs do not have trouble reading, it didn't make the mistake you made and it wouldn't. You missed a word, LLMs cannot miss words. It's not even remotely a human mistake.

reply

upvote

by russfink8 hours ago|

[-]

Or, the first time a mistake is detected, a correction is automatically applied.

reply

upvote

by 2 hours ago|

[-]

deleted

reply

upvote

by galaxyLogic6 hours ago|

[-]

> I want to wash my car. The car wash is 50 meters away. Should I walk or drive?

I think no real human would ask such a question. Or if we do we maybe mean should I drive some other car than the one that is already at the car-wash?

A human would answer, "silly question ". But a human would not ask such a question.

reply

upvote

by psadauskas4 hours ago|

[-]

A human totally would, as one of those brain-teaser trick questions. Its the same kind of question as "A plane crashes right on the border between the US and Canada. Where do they bury the survivors?" Its the kind of question you only get right if you pay close attention. Asking an AI that is like asking a 5 year old. You're not asking to get an answer, you're asking to see if they're paying attention.

reply

upvote

by jameshart4 hours ago|

[-]

I was given to understand that attention is all you need.

reply

upvote

by vintermann14 hours ago|

[-]

Well, at least we know that's one gotcha/benchmark they aren't gaming.

reply

upvote

by smooc13 hours ago|

[-]

I'd say the joke is on you ;-)

reply

upvote

by fragmede12 hours ago|

[-]

I tried o3, instant-5.3, Opus 3, and haiku 4.5, and couldn't get them to give bad answers to the couch: stairs vs elevator question. Is there a specific wording you used?

reply

upvote

by toraway9 hours ago|

[-]

That's an example the LLM came up with itself while analyzing its failed car wash walk/drive answer, it's not OP's question.

reply

upvote

by slekker13 hours ago|

[-]

What about Qwen? Does it get that right?

reply

upvote

by lambda13 hours ago|

[-]

I've run several local models that get this right. Qwen 3.5 122B-A10B gets this right, as does Gemma 4 31B. These are local models I'm running on my laptop GPU (Strix Halo, 128 GiB of unified RAM).

And I've been using this commonly as a test when changing various parameters, so I've run it several times, these models get it consistently right. Amazing that Opus 4.7 whiffs it, these models are a couple of orders of magnitude smaller, at least if the rumors of the size of Opus are true.

reply

upvote

by qingcharles12 hours ago|

[-]

Does Gemma 4 31B run full res on Strix or are you running a quantized one? How much context can you get?

reply

upvote

by lambda11 hours ago|

[-]

I'm running an 8 bit quant right now, mostly for speed as memory bandwidth is the limiting factor and 8 bit quants generally lose very little compared to the full res, but also to save RAM.

I'm still working on tweaking the settings; I'm hitting OOM fairly often right now, it turns out that the sliding window attention context is huge and llama.cpp wants to keep lots of context snapshots.

reply

upvote

by qingcharles11 hours ago|

[-]

I had a whole bunch of trouble getting Gemma 4 working properly. Mostly because there aren't many people running it yet, so there aren't many docs on how to set it up correctly.

It is a fantastic model when it works, though! Good luck :)

reply

upvote

by canarias_mate13 hours ago|

[-]

[dead]

reply

upvote

by throwup23815 hours ago|

[-]

The p stands for putrification.

reply