undefined

upvote

points

by svara17 hours ago |

upvote

by GoatInGrey6 hours ago|

[-]

Me:

I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?

-

Opus 4.6:

<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.

Done </thinking>

Walk, absolutely. Driving a car with a compromised transmission even 50 meters risks making the problem worse and turning a repair into a more expensive one. Plus, 50 meters is barely a minute on foot.

reply

upvote

by crimsonnoodle5816 hours ago|

[-]

That's not what I got.

Opus 4.6 (not Extended Thinking):

Drive. You'll need the car at the car wash.

reply

upvote

by almost15 hours ago|

[-]

Also what I got. Then I tried changing "wash" to "repair" and "car wash" to "garage" and it's back to walking.

reply

upvote

by visarga7 hours ago|

[-]

> That's not what I got.

My Opus vs your Opus, which is smarter?!

reply

upvote

by nosuchthing1 hours ago|

[-]

LLMs can't access the training data that's less than the statistically most common token, so they use a random jitter.

With that randomness comes statistically irrelevant results.

reply

upvote

by silisili16 hours ago|

[-]

Am I the only one who thinks these people are monkey patching embarrassments as they go? I remember the r in strawberry thing they suddenly were able to solve, while then failing on raspberry.

reply

upvote

by plexicle12 hours ago|

[-]

Nah. It's just non-deterministic. I'm here 4 hours later and here's the Opus 4.6 (extended thinking) response I just got:

"At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."

reply

upvote

by mentalgear16 hours ago|

[-]

They definitely do: at least openAi "allegedly" has whole teams scanning socials, forums, etc for embarrassments to monkey-patch.

reply

upvote

by londons_explore15 hours ago|

[-]

Which raises the question why this isn't patched already. We're nearing 48 hours since this query went viral...

reply

upvote

by groundzeros201510 hours ago|

[-]

This is part of why they need to be so secretive. If you can see the tree of hardcoded guidance for common things it won’t look as smart.

reply

upvote

by viking12315 hours ago|

[-]

They should make Opus Extended Extended that routes it to actual person in a low cost country.

reply

upvote

by andrewaylett13 hours ago|

[-]

Artificial AI.

reply

upvote

by raincole16 hours ago|

[-]

Yes, you're the only one.

reply

upvote

by coldtea15 hours ago|

[-]

Sure there are many very very naive people that are also so ignorant of the IT industry they don't know about decades of vendors caught monkeypatching and rigging benchmarks and tests for their systems, but even so, the parent is hardly the only one.

reply

upvote

by silisili16 hours ago|

[-]

Works better on Reddit, really.

reply

upvote

by chvid15 hours ago|

[-]

Of course they are.

reply

upvote

by cowboylowrez10 hours ago|

[-]

Thats my thought too. The chatbot bros probably feel the need to be responsive and there's probably an express lane to update a trivia file or something lol

reply

upvote

by anonym2916 hours ago|

[-]

No doubt about it, and there's no reason to suspect this can only ever apply to embarassing minor queries, either.

Even beyond model alignment, it's not difficult to envision such capabilities being used for censorship, information operations, etc.

Every major inference provider more or less explicitly states in their consumer ToS that they comply with government orders and even share information with intelligence agencies.

Claude, Gemini, ChatGPT, etc are all one national security letter and gag order away from telling you that no, the president is not in the Epstein files.

Remember, the NSA already engaged in an unconstitutional criminal conspiracy (as ruled by a federal judge) to illegally conduct mass surveillance on the entire country, lie about it to the American people, and lie about it to congress. The same organization that used your tax money to bribe RSA Security to standardize usage of a backdoored CSPRNG in what at the time was a widely used cryptographic library. What's the harm in a little bit of minor political censorship compared to the unconstitutional treason these predators are usually up to?

That's who these inference providers contractually disclose their absolute fealty to.

reply

upvote

by surgical_fire15 hours ago|

[-]

That you got different results is not surprising. LLMs are non-deterministic; which is both a strength and a weakness of LLMs.

reply

upvote

by mvdtnz16 hours ago|

[-]

We know. We know these things aren't determination. We know.

reply

upvote

by viking12316 hours ago|

[-]

Lmao, and this is what they are saying will be an AGI in 6 months?

reply

upvote

by notahacker15 hours ago|

[-]

There's probably a comedy film with an AGI attempting to take over the world with its advanced grasp of strategy, persuasion and SAT tests whilst a bunch of kids confuse it by asking it fiendish brainteasers about carwashes and the number of rs in blackberry.

(The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up)

reply

upvote

by throwway12313 hours ago|

[-]

There is a Soviet movie, "Teens in the Universe" [0], where teens cause robots' brains to fry by giving them linguistic logical puzzles.

[0]: https://en.wikipedia.org/wiki/Teens_in_the_Universe

reply

upvote

by simonask15 hours ago|

[-]

This would work if it wasn’t for that lovely little human trait where we tend to find bumbling characters endearing. People would be sad when the AI lost.

reply

upvote

by notahacker11 hours ago|

[-]

Maybe infusing the AI character with the boundless self confidence of its creators will make it less endearing :)

reply

upvote

by layer813 hours ago|

[-]

What’s wrong with having a bittersweet movie?

reply

upvote

by simonw11 hours ago|

[-]

In the excellent and underrated The Mitchells vs the Machines there's a running joke with a pug dog that sends the evil robots into a loop because they can't decide if it's a dog, a pig or a loaf of bread.

reply

upvote

by OneMorePerson13 hours ago|

[-]

This theme reminds me of Blaine the Mono from the Dark Tower series

reply

upvote

by GeoAtreides14 hours ago|

[-]

There is a Star Trek episode where a fiendish brainteaser was actually considered to genocide an entire (cybernetic, not AI) race. In the end, captain Picard choose not to deploy it.

reply

upvote

by misnome16 hours ago|

[-]

But “PhD level” reasoning a year ago.

reply

upvote

by hypeatei15 hours ago|

[-]

Yes, get ready to lose your job and cash your UBI check! It's over.

reply

upvote

by moogly7 hours ago|

[-]

They seem to have stopped talking about AGI and pivoted to ads and smut.

reply

upvote

by bigfishrunning3 hours ago|

[-]

This was probably wise, because ads and smut are well understood, and known to exist.

reply

upvote

by prmph13 hours ago|

[-]

Laughable indeed.

One thing that my use of the latest and greatest models (Opus, etc) have made clear: No matter how advanced the model, it is not beyond making very silly mistakes regularly. Opus was even working worse with tool calls than Sonnet and Haiku for a while for me.

At this point I am convinced that only proper use of LLMs for development is to assist coding (not take it over), using pair development, with them on a tight leash, approving most edits manually. At this point there is probably nothing anyone can say to convince me otherwise.

Any attempt to automate beyond that has never worked for me and is very unlikely to be productive any time soon. I have a lot of experience with them, and various approaches to using them.

reply

upvote

by cbozeman16 hours ago|

[-]

Well in fairness, the "G" does stand for "General".

reply

upvote

by dsr_16 hours ago|

[-]

In fairness, they redefined it away from "just like a person" to "suitable for many different tasks".

reply

upvote

by briansm11 hours ago|

[-]

I think this lack of 'G' (generality, or modality) is the problem. A human visualizes this kind of problem (a little video plays in my head of taking a car to a car wash). LLM's don't do this, they 'think' only in text, not visually.

A proper AGI would have have to have knowledge in video, image, audio and text domains to work properly.

reply

upvote

by actionfromafar16 hours ago|

[-]

Show me a robotic kitten then, in six months. As smart and learning.

reply

upvote

by stingraycharles17 hours ago|

[-]

That’s without reasoning I presume?

reply

upvote

by plexicle12 hours ago|

[-]

4.6 Opus with extended thinking just now: "At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."

reply

upvote

by gf00016 hours ago|

[-]

Not the parent poster, but I did get the wrong answer even with reasoning turned on.

reply

upvote

by tezza16 hours ago|

[-]

Thank you all! We needed further data points.

comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.

for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.

e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...

reply