I want to get my car's transmission fixed, the mechanic shop is 50 meters away. Would it be better to walk or to drive to a location 50 meters away from my front door?
-
Opus 4.6:
<thinking> The user is asking whether to walk or drive 50 meters to a mechanic shop, but there's a key nuance here - their transmission needs fixing, so driving might not be wise.
Done </thinking>
Walk, absolutely. Driving a car with a compromised transmission even 50 meters risks making the problem worse and turning a repair into a more expensive one. Plus, 50 meters is barely a minute on foot.
Opus 4.6 (not Extended Thinking):
Drive. You'll need the car at the car wash.
My Opus vs your Opus, which is smarter?!
With that randomness comes statistically irrelevant results.
"At 50 meters, just walk. By the time you start the car, back out, and park again, you'd already be there on foot. Plus you'll need to leave the car with them anyway."
Even beyond model alignment, it's not difficult to envision such capabilities being used for censorship, information operations, etc.
Every major inference provider more or less explicitly states in their consumer ToS that they comply with government orders and even share information with intelligence agencies.
Claude, Gemini, ChatGPT, etc are all one national security letter and gag order away from telling you that no, the president is not in the Epstein files.
Remember, the NSA already engaged in an unconstitutional criminal conspiracy (as ruled by a federal judge) to illegally conduct mass surveillance on the entire country, lie about it to the American people, and lie about it to congress. The same organization that used your tax money to bribe RSA Security to standardize usage of a backdoored CSPRNG in what at the time was a widely used cryptographic library. What's the harm in a little bit of minor political censorship compared to the unconstitutional treason these predators are usually up to?
That's who these inference providers contractually disclose their absolute fealty to.
(The final scene involves our plucky escapees swimming across a river to escape. The AIbot conjures up a speedboat through sheer powers of deduction, but then just when all seems lost it heads back to find a goat to pick up)
One thing that my use of the latest and greatest models (Opus, etc) have made clear: No matter how advanced the model, it is not beyond making very silly mistakes regularly. Opus was even working worse with tool calls than Sonnet and Haiku for a while for me.
At this point I am convinced that only proper use of LLMs for development is to assist coding (not take it over), using pair development, with them on a tight leash, approving most edits manually. At this point there is probably nothing anyone can say to convince me otherwise.
Any attempt to automate beyond that has never worked for me and is very unlikely to be productive any time soon. I have a lot of experience with them, and various approaches to using them.
A proper AGI would have have to have knowledge in video, image, audio and text domains to work properly.
comparing one shot results is a foolish way to evaluate a statistical process like LLM answers. we need multiple samples.
for https://generative-ai.review I do at least three samples of output. this often yields very differnt results even from the same query.
e.g: https://generative-ai.review/2025/11/gpt-image-1-mini-vs-gpt...