This is the problem.
I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.
Think of it less like a static tool, and more like a human helper, where the same holds.
That said, I can't wait for LLMs to stop being AI and start being just another tool. Anything cursed with the "AI" label seems to go through this mess. In the earlier AI cycles, rules engines were considered "human-ish" and got hyped up, but today we just see then as just another tool available to us, and we're better off for it.
From a horse's perspective, the internal combustion engine is just another tool for making scary noises and powering horse trailers to take me on fun horse adventures. So ... perhaps.
ChatGPT was obedient with the grill-me technique, just wrote a plan. Yesterday it started jumping to implementation. Why?
It's a very very bizarre way to use a computer.
Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.
The most important thing to be aware of in my opinion would be that Claude is better at UI design, and leaves a lot more comments in the code.
Other than that the results seem similar, at least functionally. I do not usually review the code style.
With humans it's actually good and worthwhile to create and strengthen connections. With an LLM, that's psychosis.
I don't think LLMs are people in any sense, at least as they're constructed now -- but they very much have what we would call "culture" and "personality" in suitably alien forms.
This is not the same as, e.g., feelings, experience, or humanity, or actual opinions or ideas (versus essentially "distilled vibes") and I feel that AI will more and more force us to confront that (including if new AIs are ever developed that may have the latter, as well!)
And if humans are anything, they are tool users.
Can be both. Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.
>And if humans are anything, they are tool users.
To the point of self-destruction sometimes.
I really don't get it. Why the fact that it outputs words is so goddamn important for everybody? How does it suddenly make you so emotionally vulnerable? Does my brain work in a different way than the rest of humanity? Can't you disregard what's irrelevant? Is every programmer suddenly a trump supporter that has no ability to recognize empty words? To recognize lies about emotions and facts?
Words are just input. Mostly garbage. Emotion inducing words are garbage 10 times more often than any other. I could expect romance reader to be affected, or somebody with iq 70. But how the caste of some of the most technical people ever is afraid of catching psychosis just because they might read some words?
When we built the idea that anthropomorphising is wrong, we meant when doing it for rocks or trees or thunders or deer or some such.
We communicate with other humans using voice and three dimensional hand gestures. To use computers and early phones we had to learn to operate new input devices: keyboards and mice. Later with touchscreens we moved to two dimensional hand (finger) gestures. We're barely making voice commands work with our devices just recently.
Then, a large number of humans are figuratively tethered to their desks because the devices need power and stable internet connection. Mobile devices break this relationship a bit but you still need to charge them and be close to some sort of access point. In any case, the devices encourage sitting in one place for hours at time.
And this is just computers and smartphones. Humans adapted their entire lifestyles and transformed the landscape to cater to cars.
Was it? Think first about what it replaced. Lots of manual computation in bookkeeping and financial sectors. Telegrams and snail mail moved to email. Typesetting in books and magazines became easier and widely available,…
If there’s one thing that you can’t say about computers is that they’re limited.
The context was that technology should evolve to fit the humans [not the other way around]. And if contemporary technology didn't have limitations, it would be correct.
But it did and humans had to adapt to the computers. Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages. It took us decades before we could talk to computers in human languages. We're getting pretty close - especially in the past few years - but there's still some friction.
You may need to revisit your computation theory courses. Computers are the embodiment of a mathematical model and thus the inputs and outputs are formalized.
Do you just hold a pen and words are written automatically? Do you just hover your hands over a piano and have the moonlight sonata played? No, you have to do precise mechanical movements because that’s how the output is realized.
There’s no such things as words, sentences, keywords, statements at the computer level. What it does is symbol manipulation. You provide it a string of symbols, the rules for the manipulation, and it will provide a string of symbols as the output.
What symbols, what rules, are completely arbitrary . We just found that {1,0} are all that we needed as the set of symbols and that Context-Free Grammar is perfect for specifying the rules.
We still need to encode everything down to binary (ascii, unicode, bcd, floating points, pixel formats, PCM,…) and use a programming language (as defined by a grammar) to get the computer to do anything. Inference is made possible by those two mechanisms. It’s not a new computation model.
Realising this made me respect the "I" in "AI" a bit more seriously.
Maybe we need better reviewers then?
This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.
Admittedly, yes, there's some overlap there.
They would have to admit 'seen it in the training data' as a factor, and that opens a can of worms.
They do not test how models perform when used interactively, like most of us do.
I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).
The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…
It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.
These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.
> We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.
Any chance you could share some of these? Seems like something we could all benefit from.First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...
The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...
Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...
Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...
We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc
We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...
Do note that I only use LLMs in the ChatUI, I never use agents. I don't believe having a blackbox codebase managed by entities with a half-life of 'delete conversation' or 200k tokens is a responsible idea. In ChatUI, I lay the ground rules, kill assumptions about our working relationship, give it foundational context on the problem and codebase we're working on, explain the problem and then we have a conversation about it and I gradually disclose more logically context as it becomes relevant. So, to directly answer your question, maybe I'm missing out on a ton of upside by not using the absolute best but I'd say familiarizing yourself with a specific model has all the benefits of having a human friend you've grown up with... except your buddy's a savant and would absolutely love to help!
Or it is more like playing a slot machine and you imagine the rest.
Maybe it works some of the time but it isn't a solution that works everytime.
It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.
While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]
I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.
[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...
Last I saw, engineers working at OpenAI denied this on HN.
I saw that someone set up a tracker that aims to record the performance of the models, and so far it has not shown any statistically significant deviation in performance for Codex, and not yet enough data for Claude: https://marginlab.ai/trackers/codex/
The firm [Anthropic] would deliberately degrade the model’s performance in ways that were invisible to the user.
For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.
Playing a B on a saxophone always plays a B.
But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.
The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.
While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.
That's only if both violins are tuned the same way, and one must continually tune them lest they get out of sync.
Similarly, an LLM can be extremely consistent if tuned properly -- indeed, if you fix the weights and settings, they can be made "essentially deterministic" for many prompts!
This is because LLMs have aspects of chaotic dynamical systems, where small changes in initial conditions can lead to vastly different outcomes. That property is independent from nondeterminism.
You know what we are talking about. Tuning, poor playing, all of that is mild variation from what we know it is supposed to do every time and we can target the the notes they are supposed to hit consistently. You're comparing slight tonal variations to completely different outputs from the same inputs. If I hit a "C" on the piano, it is going to play "C." If it does not, then the piano is not functioning properly. LLM's for some reason get a pass on this and it makes them very distinct from musical instruments.
This feels like a very nitpicky steel man, not a productive attempt at discussion.
LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.
I played woodwinds regularly for 15 years so I feel fine with my example.
Instruments present a clear interface to a user, have predictable outputs, etc.
The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.
I also think it's disingenuous to call LLMs "tools" in the stricter sense of the definition, but I've mostly given up trying to convince people of this. Main reason being that a terrible writer and a gifted writer can produce similar outputs, and for the terrible writer it will be above their average, and for the gifted writer it will be below what they could produce with full control.
> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.
I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.
No, don't f***ing do that! What part of "[previous instruction]" don't you f***ing understand? I am extremely angry and disappointed by your inability to [whatever]. Do better please.
> maybe it could trigger more defensive responses with argumentation to explain its conclusions.Quite the opposite, it makes the model extremely conciliatory—which in this situation is what I want. If you're hoping to make the model less sycophantic, this is the wrong tool.
I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.
Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.
This surprised me, since it was not what I asked for. I just asked it to port the game.
I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.
You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.
What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.
It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.
You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.
It is a fundamentally hard problem to solve
Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?
Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.
We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.
(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)
When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs
With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.
So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.
I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.
One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.
If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.
Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …
it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)
Even with the same model I get different answers to same prompt that is just tweaked a little.
So benchmarks are nice but mostly useless.
Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.
One good analogy is the Macbook vs generic windows laptop debate online.
The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.
But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.
There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.
--
The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.
But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?
It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.
One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.
Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.
Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.
I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.
The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.
Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.
It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.
This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.
Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.
there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.
(i still say "please", i can't help it)
IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.
BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.
My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]
As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.
But I won't give any creative open-ended tasks to any other model than Claude.
[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...
On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.
Gemini CLI at work has the same issue: it'll prefer hacking your workstation over just asking you how to proceed.
I think the harnesses are setup to have a bias to action otherwise the LLM would just stop all the time when doing trivial task but it also mean they'll keep going when the "obvious" path is to just prompt the user.
I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).
That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still
Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.
It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.
I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.
Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.
If you can't show ROI there's literally no reason to ever switch anything.
this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.
> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>
can go a long way.
of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.
... That does sound like something that Anthropic would deliberately aim for, yeah.
> With GPT, you have to be precise and reduce ambiguity.
I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.
It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.
Classify under non-reproducible artifacts of LLM generation.