undefined

[-]

>I would love to have a product sheet showing what each models strengths an weaknesses are, so that I can have a clear decision tree of "if this kind of work, use model X", or "model Y should be used in ways Z". But they all look the same from the outside and the only way to figure out which might be marginally better at what is to do extensive, time consuming, and perhaps expensive testing.

Think of it less like a static tool, and more like a human helper, where the same holds.

by mahidhar6 hours ago|

[-]

Well, unlike a human, I cannot expect any these LLMs to take any ownership of the work they do. I cannot expect any given model and version (sonnet 4.6) to learn, improve and adapt over time. I cannot expect it's limitations to ever go away at the model level. So it is not like a human in most ways that I actually care about.

That said, I can't wait for LLMs to stop being AI and start being just another tool. Anything cursed with the "AI" label seems to go through this mess. In the earlier AI cycles, rules engines were considered "human-ish" and got hyped up, but today we just see then as just another tool available to us, and we're better off for it.

by squidbeak2 hours ago|

[-]

You're on the hook for their work in the way a manager is for their staff's output. The insistence of AI being a mere tool very often comes with this strange desire to be free of responsibility for its work. People seem to forget that the big advantage in these things is the range they have for obscure insight and creative solutions, both impossible with determinism.

by themgt3 hours ago|

[-]

That said, I can't wait for LLMs to stop being AI and start being just another tool.

From a horse's perspective, the internal combustion engine is just another tool for making scary noises and powering horse trailers to take me on fun horse adventures. So ... perhaps.

by kolinko4 hours ago|

[-]

models don’t improve, but harnesses/tools/rules around them grow with the project.

by ACCount379 hours ago|

[-]

One issue with that is that human helpers last longer. LLMs cycle in and out in months, and what held for Your Favorite LLM 6.7 may not hold for Your Favorite LLM 6.9.

by renegade-otter6 hours ago|

[-]

Right, this is why I would slam the breaks on investing into your workflow all of your time and effort, because 2 months from now it may be out the window. Frontier models are also constantly being tweaked, so what worked yesterday may be off today.

ChatGPT was obedient with the grill-me technique, just wrote a plan. Yesterday it started jumping to implementation. Why?

by HappySweeney6 hours ago|

[-]

I find that when an LLM jumps into tasks it was not told to do (or even worse, doing things it was explicitly told not to), it is a good sign the context is too full, and you should do a controlled hand-off to a new instance.

by renegade-otter5 hours ago|

[-]

I wipe my context relentlessly. I never have long-running conversations. In and out like Seal Team Six.

by madeofpalk9 hours ago|

[-]

Except, where every different model and version is like a different person where you need to learn their idiosyncrasies of how they work every other month.

It's a very very bizarre way to use a computer.

Personally, I just don't. I'll use and prompt the LLMs the way that feels natural to me and move on with my life. Maybe I don't always get completely optimal results from them, but im also not spending half my day pleading with the computer to do a task.

by user439287 hours ago|

[-]

I also don't think I need to prompt Claude differently than Codex.

The most important thing to be aware of in my opinion would be that Claude is better at UI design, and leaves a lot more comments in the code.

Other than that the results seem similar, at least functionally. I do not usually review the code style.

by cassianoleal8 hours ago|

[-]

They are not human. Humans have names, faces, voices, personality, a personal history, family, care for whatever they call their community.

With humans it's actually good and worthwhile to create and strengthen connections. With an LLM, that's psychosis.

by tekne8 hours ago|

[-]

To be fair: a voice, personality, and personal history sounds a lot like training data.

I don't think LLMs are people in any sense, at least as they're constructed now -- but they very much have what we would call "culture" and "personality" in suitably alien forms.

This is not the same as, e.g., feelings, experience, or humanity, or actual opinions or ideas (versus essentially "distilled vibes") and I feel that AI will more and more force us to confront that (including if new AIs are ever developed that may have the latter, as well!)

by epicepicurean5 hours ago|

[-]

They are not human, but it helps to prompt them similarly. See: https://www.anthropic.com/research/emotion-concepts-function

by anthonyrstevens4 hours ago|

[-]

Good read. Thanks for sharing.

by Wowfunhappy7 hours ago|

[-]

They're not human. But they are trained on human language, and thinking of them as similar to a human helps me work with them effectively.

by malwrar7 hours ago|

[-]

These things passing the Turing Test makes anthropomorphizing their behavior awkward, but don’t forget it’s just an analogy to communicate an experience. If you convey a certain written voice to these models in your input, you get a somewhat consistent end effect. I think that’s all that is being communicated.

by scotty798 hours ago|

[-]

If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis. There's no connection because the tool is immutable (except for adjustments you made) but you do develop a specific relation with that tool. Some people even love some of their tools at some level.

And if humans are anything, they are tool users.

by coldtea7 hours ago|

[-]

>If you have a toolbox full of similar but different tool getting to know them is a prudent thing to do, not a psychosis

Can be both. Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.

>And if humans are anything, they are tool users.

To the point of self-destruction sometimes.

by scotty796 hours ago|

[-]

> Use of some tools like LLMs might be more inducing psychosis than others like plain compilers or hammers.

I really don't get it. Why the fact that it outputs words is so goddamn important for everybody? How does it suddenly make you so emotionally vulnerable? Does my brain work in a different way than the rest of humanity? Can't you disregard what's irrelevant? Is every programmer suddenly a trump supporter that has no ability to recognize empty words? To recognize lies about emotions and facts?

Words are just input. Mostly garbage. Emotion inducing words are garbage 10 times more often than any other. I could expect romance reader to be affected, or somebody with iq 70. But how the caste of some of the most technical people ever is afraid of catching psychosis just because they might read some words?

by chadgpt35 hours ago|

[-]

It's a certain percentage of people and yes it's different for them because it outputs words and triggers some kind of emotional trust response.

by scotty794 hours ago|

[-]

As good opportunity as any to acquire some emotional intelligence.

by j-bos8 hours ago|

[-]

Yeah, AI tools bring software developers closer to the messy real world where 0 and 1 aren't always exactly 0 and 1.

by skydhash4 hours ago|

[-]

Computing is useful for exactly going away from the messy real world of humans. I don’t need random errors in my financial transactions. I don’t want random errors when doctors are retrieving my medical history. And I don’t want random errors in my backup,… There’s plenty of non-deterministic things in my life, I don’t want my computer to follow suite.

by gib4449 hours ago|

[-]

No, I won't anthropomorphise LLMs.

by TeMPOraL3 hours ago|

[-]

That's your prerogative, but be aware you'll continue to remain confused about LLMs. Anthropomorphizing them is what gives you the best high-level intuition about where and how to employ them, and where and how not to.

by coldtea7 hours ago|

[-]

If there was anything that made sense to anthropomorphise it would be a machine meant to mimic talking, thinking and answering like a human, one that even passes the Turing test.

When we built the idea that anthropomorphising is wrong, we meant when doing it for rocks or trees or thunders or deer or some such.

by yeer27 hours ago|

[-]

This is so dumb and goes against all the principles that enabled computers and smartphones to achieve wide adoption - the technology should evolve to fit the human. Not the other way around.

by duckmysick6 hours ago|

[-]

I'd argue the opposite. Technology in the past few decades was (is) limited and humans had to adapt to it.

We communicate with other humans using voice and three dimensional hand gestures. To use computers and early phones we had to learn to operate new input devices: keyboards and mice. Later with touchscreens we moved to two dimensional hand (finger) gestures. We're barely making voice commands work with our devices just recently.

Then, a large number of humans are figuratively tethered to their desks because the devices need power and stable internet connection. Mobile devices break this relationship a bit but you still need to charge them and be close to some sort of access point. In any case, the devices encourage sitting in one place for hours at time.

And this is just computers and smartphones. Humans adapted their entire lifestyles and transformed the landscape to cater to cars.

by skydhash4 hours ago|

[-]

> Technology in the past few decades was (is) limited and humans had to adapt to it.

Was it? Think first about what it replaced. Lots of manual computation in bookkeeping and financial sectors. Telegrams and snail mail moved to email. Typesetting in books and magazines became easier and widely available,…

If there’s one thing that you can’t say about computers is that they’re limited.

by duckmysick4 hours ago|

[-]

No doubt that computers enabled a lot of automation. We can both agree with that.

The context was that technology should evolve to fit the humans [not the other way around]. And if contemporary technology didn't have limitations, it would be correct.

But it did and humans had to adapt to the computers. Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages. It took us decades before we could talk to computers in human languages. We're getting pretty close - especially in the past few years - but there's still some friction.

by skydhash3 hours ago|

[-]

> Humans had to develop and learn special languages so they could communicate with computers to do all those useful things you mentioned. Why? They were limited in understanding (or parsing) human languages

You may need to revisit your computation theory courses. Computers are the embodiment of a mathematical model and thus the inputs and outputs are formalized.

Do you just hold a pen and words are written automatically? Do you just hover your hands over a piano and have the moonlight sonata played? No, you have to do precise mechanical movements because that’s how the output is realized.

There’s no such things as words, sentences, keywords, statements at the computer level. What it does is symbol manipulation. You provide it a string of symbols, the rules for the manipulation, and it will provide a string of symbols as the output.

What symbols, what rules, are completely arbitrary . We just found that {1,0} are all that we needed as the set of symbols and that Context-Free Grammar is perfect for specifying the rules.

We still need to encode everything down to binary (ascii, unicode, bcd, floating points, pixel formats, PCM,…) and use a programming language (as defined by a grammar) to get the computer to do anything. Inference is made possible by those two mechanisms. It’s not a new computation model.

by Wowfunhappy7 hours ago|

[-]

I mean, like, you can lament the state of the world all you want. It is what it is. Of course the AI labs would also like to make their models more consistent, but it's not how the technology works. They're black boxes to everybody.

by dreambuffer8 hours ago|

[-]

Please do not think of LLMs like human helpers, that is a recipe for long term sociopathy.

by egwor4 hours ago|

[-]

Maybe this is similar to web search too. We know how to get google to return the results we want, and when we use other tools like Bing we get other behaviour.

[-]

Honestly, the differences between AI models always felt to me like the differences between coworkers or job candidates. They don't all share the same strengths and weaknesses - and they all have both good days and bad days.

Realising this made me respect the "I" in "AI" a bit more seriously.

by amelius10 hours ago|

[-]

Yes, but benchmarks can be gamed.

Maybe we need better reviewers then?

by yunohn6 hours ago|

[-]

> a product sheet showing what each models strengths an weaknesses are

This presumes that the labs themselves know how well their models perform. But all they have are overtuned benchmarks and hype vibes.

by couscouspie10 hours ago|

[-]

That would be ideal, but AI is less like a tool and more like a human in this regard and you don't have character sheets for each of your colleagues, as well.

by supergarfield9 hours ago|

[-]

If my coworker was part of a clone series of 100 million units, requesting a character sheet would be pretty reasonable

by bluegatty9 hours ago|

[-]

These are $1 Trillion dollar companies that can't produce explicit details on how their products work? It's nonsense.

by 4 hours ago|

[-]

deleted

by sixothree3 hours ago|

[-]

I think if they could explain how they work, their strengths and weaknesses, they would reveal to the world whose data they've been appropriating.

by bluegatty3 hours ago|

[-]

That's another thing altogether. They can characterize the behaviour without quite giving up who and where the data comes from.

Admittedly, yes, there's some overlap there.

They would have to admit 'seen it in the training data' as a factor, and that opens a can of worms.

by epolanski7 hours ago|

[-]

The problem is that this is very hard to replicate and benchmarks focus on E2E tests, going from one prompt to the final solution.

They do not test how models perform when used interactively, like most of us do.

by weitendorf12 hours ago|

[-]

One thing I used to test quite a lot was rerunning the exact same prompt on the same input, or semantically equivalent (in my mind) but differently framed or worded input, and seeing how much they diverged. In particular I’ve done this quite a lot between Sonnet vs Opus and across Qwen models.

I recommend everybody do this because you don’t need any special data except what you are already using, and the results will be very eye opening: there is WAY more randomness or instability involved than you would otherwise assume. A lot of what you might think is a better prompt technique, or a particularly good or bad outcome, could just as well be random chance or just different behaviors across model version or sizes. And your results can be massively biased by small differences in input. We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

There’s a skill to it. With agentic loops if you get the model into a self-eval structure where it’s hard to cheat or take shortcuts, and it’s in the right structure or domain that models its training, you’re golden. But it’s hard to find the sweet spots (pro tip, have Opus 4.8 convert PyTorch models into ONNX or quants or get them running on different hardware, I swear it was like I activated some kind of savant-like skillset; meanwhile I can’t for the life of me get it to properly write/test EBNF formalizations of common languages and formats without cheating).

The worst part is that it changes so much so frequently that it’s almost useless to really go digging for this kind of knowledge unless you’re actually the one training the models. I wish this kind of “stability” in output was more emphasized in their training so they’d be predictable. I assume it’s hard to do without overfitting or breaking the explore-exploit loop but also, I would spend so much more on LLMs for batch workloads if they could do them more reliably…

by movpasd10 hours ago|

[-]

I've not done particularly rigorous testing, but I've done this a lot with Claude to get a feel. What I've noticed is for certain open-ended tasks, Claude is extremely primeable: it will pick up on minor differences in wording in your prompt and run with them hard.

It can be frustrating. The AI pretends to be a human, and so a part of my brain expects them to commit and have a "parti pris" like a human, so the exercise is a good reminder of the feedback loop. My mental model is that before the first three or four messages, the model has many finer points of its personality still underdetermined. I'd suggest that as the mechanism for "role-based prompting". And it explains the "savant sleeper agent" thing you describe. You want to get the state in the right attractor on the manifold.

These machines are pretty incredible, but for conversation-driven workflows you really have to be in the driver's seat. A human has a property that the AI does not have, at least under current architectures: we are regulated by the outside world. A bit of a tangent, but I can see how AI psychosis arises from these dynamics.

by evntdrvn8 hours ago|

[-]

One thing that I learned when doing raw API LLM usage is how drastically the results can vary call per call with exactly the same input. I think that on average, people using agents underestimate the variation in results from a given turn command are, and so overindex on "X technique worked well" or "if I do Y then this will happen" or even "it did Z task well last time so it will this time too" or "{Model} is great at {thing}"

[-]

  > We’ve been calling some of these “magic words” at work, specific technical terms or references/techniques that you need only mention to get vast improvements in outcome.

Any chance you could share some of these? Seems like something we could all benefit from.

by weitendorf10 hours ago|

[-]

Sure, my company has been working on a broad swathe of infrastructure projects and developer tools, which requires prompting models to seek out other tools/apis/docs/examples but in a way where we can't just dump all the context on the model up front. We also need the models to oftentimes look up technical documentation and specs, and sometimes build custom parsers for specific documentation websites that only make the data available embedded across 200+ pages of html.

First, I almost always try to seed every new project or context/domain with canonical technical specifications or examples I found elsewhere. When I set up this project recently, I linked to a bunch of the official Apple docs for sysctl, and told it to use a specific technique for calling assembly code from Go, that from experience it almost never realizes it can do or knows about (and similarly for sysctl, I knew it kinda sorta knew about it, but not in its entirely): https://github.com/accretional/sysctl/commit/da52438233e5b33...

The other thing I did was tell it to enumerate all the test cases ahead of time rather than to just directly implement them; again this is something where you have to explicitly tell it to go digging for information where it has blind spots and get it to set up properly grounded self-eval in a way that it can test against. I usually tell it to take notes as it works or commit notes to itself that will persist over sessions: https://github.com/accretional/sysctl/blob/main/FINDINGS_2.m...

Once we get back to working on this project we'll just have it implement / validate the rest of the sysctl feature support against the full inventory we had it uncover: https://github.com/accretional/sysctl/blob/main/cmd/darwin-n...

Another thing we do is have it specify an API that it can produce against; then in other projects we have them consume the API via reflection (and our special sauce we've been working on is the ability to discover and integrate against these automatically across thousands of APIs from many providers, which we've got working and can share if you're interested in using it as an early customer): https://github.com/accretional/sysctl/blob/main/proto/sysctl... This isn't the greatest example because it doesn't actually fully specify the sysctl keys yet. But I did have it create a knowledge base trying to cover the 1000+ keys as best as it could, to reference as it continued: https://github.com/accretional/sysctl/tree/main/macos-sysctl...

We have a better example in eg https://github.com/accretional/proto-sqlite/tree/main/lang where we were able to encode the entire sqlite grammar into a grpc interface so that you could eg find the exact structure (and sanitize) of a select statement: https://github.com/accretional/proto-sqlite/blob/main/lang/p... This way integration and discovery becomes a matter of telling it "use reflection against this endpoint to discover the sql interface, then implement against it" and we can model formats/input validation as formal grammars via EBNF (all magic words) vs just adhoc

We also tell it to set up and use a browser automation toolkit/testing and always run it at the end of testing workflows (often in a way that auto-opens screenshots on our local machines + commits them to git) via tools like https://github.com/accretional/chromerpc#headlessbrowser-aut... so that whenever we produce UIs it can evaluate its own output and iterate without direct human intervention. This is another case where the knowledge-discovery problem becomes a problem so we tell the models to use reflection to discover the browser automation apis. That ends up giving us things like this where it records user journeys through sites and creates visualizations without us having to debug them or do them ourselves: https://github.com/accretional/proto-css/tree/main/chrome-te...

by dotancohen5 hours ago|

[-]

Thank you very much. I'm going to re-read this evening. Have a great day!

by mnicky10 hours ago|

[-]

If the benefits of using the model you've come to know well outweigh the disadvantages, you can continue using it even after the release of a successor model, right?

by saint-evan5 hours ago|

[-]

Yes! That's exactly true. I have a very real experience on this. I got introduced to Anthropic's family of models with Claude3.5. I fell in love with the specific personality of Sonnet, the model. I can't remember if back then Opus wasn't public yet but I remember very clearly trying out Opus several times when it became touted as best-in-class and actually recoiling from the foreign feel of the Opus model. I remember very well that my problem was that it was way too eager and pretty hard to steer. I returned to Sonnet and I've used ONLY Sonnet ever since. I have/had access to Fable and Opus4.8 but I never once tried them. In the early days with Sonnet3/4.5, I bought ChatGPT, I also remember thinking that it was a great teacher but a lazy coder. You'd get the scaffolding and then '# rest of code block' not full implementation so unless you wanted to learn the concept, weigh trade-offs, ask clarifying questions or jump into a rabbit hole... You had to go code it yourself. ChatGPT generally as a model is a very good teacher so much so that the free version is enough and I use the free in combination with the most advanced Sonnet model for actual SWE day to day. And whenever there's an Opus release I'm actually very excited because it means there's a smarter Sonnet model OTW. I'll actually be veryyy very sad if the Sonnet line gets sunset. There has been no Sonnet upgrades since even as other family lines get improved.

Do note that I only use LLMs in the ChatUI, I never use agents. I don't believe having a blackbox codebase managed by entities with a half-life of 'delete conversation' or 200k tokens is a responsible idea. In ChatUI, I lay the ground rules, kill assumptions about our working relationship, give it foundational context on the problem and codebase we're working on, explain the problem and then we have a conversation about it and I gradually disclose more logically context as it becomes relevant. So, to directly answer your question, maybe I'm missing out on a ton of upside by not using the absolute best but I'd say familiarizing yourself with a specific model has all the benefits of having a human friend you've grown up with... except your buddy's a savant and would absolutely love to help!

by h05sz487b12 hours ago|

[-]

> It is very much like playing an instrument.

Or it is more like playing a slot machine and you imagine the rest.

by cube0012 hours ago|

[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...

[-]

This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out.

Maybe it works some of the time but it isn't a solution that works everytime.

It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.

While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]

I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.

by user439287 hours ago|

[-]

Has there been any evidence of a well known provider rerouting to lower quality models?

Last I saw, engineers working at OpenAI denied this on HN.

I saw that someone set up a tracker that aims to record the performance of the models, and so far it has not shown any statistically significant deviation in performance for Codex, and not yet enough data for Claude: https://marginlab.ai/trackers/codex/

by cube004 hours ago|

https://news.ycombinator.com/item?id=48485958

[-]

> Has there been any evidence of a well known provider rerouting to lower quality models?

The firm [Anthropic] would deliberately degrade the model’s performance in ways that were invisible to the user.

by coldtea9 hours ago|

[-]

>This is how I feel whenever I see bold all caps instructions in a system prompt or someone claims they conducted "research" and found the magic prompt template that makes the model pay out. Maybe it works some of the time but it isn't a solution that works everytime.

For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.

by Planktonne6 hours ago|

[-]

Every gambler thinks their system works, given enough chances.

by hodgehog1111 hours ago|

[-]

A poor analogy depending on the setting because you can't adjust the odds with a slot machine, and the ROI is negative by design. If that's your experience, yeah, I wouldn't use an LLM either.

by victorbjorklund7 hours ago|

[-]

Pretty sure most modern slot machines are digital and you could adjust the odds (even to a positive EV) if you change the code.

by hodgehog115 hours ago|

[-]

You're being unfaithful to the original statement. The whole point of saying something is like a slot machine is that there are significant odds that you lose. If you ever have access to a casino slot machine that has a positive EV, there are no tangible negative aspects anymore; you would use it over and over again and accumulate significant wealth from the house. That's my point.

by ramon15612 hours ago|

[-]

Instruments are pseudo-random until you know what you're doing. Slot machines are just slot machines

by Forgeties7910 hours ago|

[-]

Musical instruments are not random. You’re just doing random inputs. Instruments are consistent, even if the “flavor” and quality varies with different builds.

Playing a B on a saxophone always plays a B.

by headcanon4 hours ago|

[-]

I see you haven't tried a modular synthesizer yet :) Getting back to the same "place" in a patch can sometimes be impossible, and it does feel "random" until you get the hang of it.

by Forgeties7910 minutes ago|

[-]

But ultimately it isn’t unpredictable and random. That’s just a skill issue. There is literally no person good enough at prompting to create consistent, predictable, useful results.

[-]

Saxophone, being a wind instrument was a bad choice. I can definitely tell which student was blowing when hearing a note.

But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.

by palata10 hours ago|

[-]

While I agree with you, I think it's diverging from the initial point.

The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.

While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.

by tekne8 hours ago|

[-]

Will you?

That's only if both violins are tuned the same way, and one must continually tune them lest they get out of sync.

Similarly, an LLM can be extremely consistent if tuned properly -- indeed, if you fix the weights and settings, they can be made "essentially deterministic" for many prompts!

by layer86 hours ago|

[-]

The difference is that a violin player can predict how the known violin will behave under all relevant circumstances, will know how to get the right tone out of it, while you’re generally unable to predict the adequacy of output of even a deterministic LLM. You can’t practically reason about how varying the input to the LLM will ensure the adequacy of its output, while the violin player is perfectly able to do so for the violin.

This is because LLMs have aspects of chaotic dynamical systems, where small changes in initial conditions can lead to vastly different outcomes. That property is independent from nondeterminism.

by Forgeties797 hours ago|

[-]

Anyone who has even modest experience with a particular instrument can pick any one up at any time and play it. The way the notes are played is consistent and produces a consistent note. If you tune 50 guitars to standard, the chords all produce what they should., It is a predictable instrument. You do not pick up a trumpet in one place then another and find the key combinations are suddenly different.

You know what we are talking about. Tuning, poor playing, all of that is mild variation from what we know it is supposed to do every time and we can target the the notes they are supposed to hit consistently. You're comparing slight tonal variations to completely different outputs from the same inputs. If I hit a "C" on the piano, it is going to play "C." If it does not, then the piano is not functioning properly. LLM's for some reason get a pass on this and it makes them very distinct from musical instruments.

This feels like a very nitpicky steel man, not a productive attempt at discussion.

by Forgeties7910 hours ago|

[-]

A poor B is still a B fingering and the sax is supposed to play a B every time. Missing it is human error, not tool error. I can pick up an alto sax, a clarinet, etc. any time, anywhere, and expect the same fingerings to work every time. My individual skill or mistakes or peculiarities of each build are not what is relevant here.

LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.

I played woodwinds regularly for 15 years so I feel fine with my example.

by 11 hours ago|

[-]

deleted

by glerk12 hours ago|

[-]

It is a bit of both. A non-deterministic instrument and a predictable slot machine.

by psychoslave12 hours ago|

[-]

I play slot machines as instrument! ;)

[-]

Roger Waters and Nick Mason were playing the cash register in 1973!

by devin3 hours ago|

[-]

It is not at all like playing an instrument.

Instruments present a clear interface to a user, have predictable outputs, etc.

The only comparison that might work for me is that LLMs are very bad instruments where you are constantly forced to negotiate its idiosyncrasies in order to massage the output you want from it, and even then there is enough randomness that trying to do so is almost a fool's errand.

by djeastm2 hours ago|

[-]

I think they mean playing different instruments not other instances of the same instrument. A tuba's interface differs from a violin's, etc.

by devin2 hours ago|

[-]

My criticism of the comparison would stand in either case. There is nothing clear and uniform about the interface to LLMs that match their musical counterparts. Even modular synthesizers with random sources are far more controlled.

I also think it's disingenuous to call LLMs "tools" in the stricter sense of the definition, but I've mostly given up trying to convince people of this. Main reason being that a terrible writer and a gifted writer can produce similar outputs, and for the terrible writer it will be above their average, and for the gifted writer it will be below what they could produce with full control.

by Wowfunhappy7 hours ago|

[-]

> With Claude, you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative. Also (you might raise an eyebrow at this) being nice to Claude will be rewarded and being mean to Claude will be punished. Claude tends to mirror your tone more aggressively and you don't want to get into negative loops with it.

> With GPT, you have to be precise and reduce ambiguity. GPT will often try to resolve ambiguity in a min-max style "I'm going to do X, but make sure it is not quite Y". It will tend to be more paranoid and overengineer to catch all edge cases if you don't tell it precisely what the scope is.

I agree with all of this except for one thing: I swear to god, being mean to Claude at the right time can be enormously effective. The F-bomb in particular seems to really help it snap out of ruts sometimes.

by mcbits4 hours ago|

[-]

I haven't really experimented with being "nice" or "mean", but I would worry that a prompt like "No, dumbass, ..." would kick it into the patterns of someone who frequently got called a dumbass (perhaps for good reason) in the training set. On the other hand, maybe it could trigger more defensive responses with argumentation to explain its conclusions.

by Wowfunhappy4 hours ago|

[-]

I only use it for behaviors I really want the model to clamp down on, and I don't think I've ever told the model it was stupid. But I might say something like:

    No, don't f***ing do that! What part of "[previous instruction]" don't you f***ing understand? I am extremely angry and disappointed by your inability to [whatever]. Do better please.

> maybe it could trigger more defensive responses with argumentation to explain its conclusions.

Quite the opposite, it makes the model extremely conciliatory—which in this situation is what I want. If you're hoping to make the model less sycophantic, this is the wrong tool.

by andai7 hours ago|

[-]

I asked GLM 5.2 for a HTML5 port of my old C#/XNA game. It ported all the code exactly (except for operator overloading, which doesn't exist in JS), and added more code to make the code work.

I asked Claude Sonnet 4.6 for the same thing, and Claude's version was like if the game had been written in JS originally.

Also, for some reason it made it a single HTML file, removed all assets, dynamically generated graphics and dynamically generated music. It also gave me a new, better background.

This surprised me, since it was not what I asked for. I just asked it to port the game.

I was pretty pleased about the choices it made, but I'm not sure how to turn that behavior on and off. Sometimes you want it to be creative, sometimes you want it to actually do what you said.

by vlovich1235 hours ago|

[-]

You’d probably have to say “port exactly as is without changing any assets and keeping the original structure of the code” or “port with using the exact same assets but write as if native JS but use good code structure principles for organizing”.

You have to be a lot more explicit but it’s hard to know a priori what decisions it’ll make. A good idea is to run it in plan mode so you can read those decisions before it sets out on a path and have an opportunity to make corrections.

by CuriouslyC6 hours ago|

[-]

What you've described is Claude's "secret sauce" and the reason some people love it and some people hate it. It's not really possible to turn off, you can try to prompt against it but it's not reliable, the solution is to use Claude when you want that behavior and other models when you don't.

by stingraycharles14 hours ago|

[-]

I agree with your general gist, and in general it’s a “the best tool for the particular job”, keeping token spent and other things in mind as well.

What I do know absolutely for sure is that LLM benchmarks are not to be trusted, they are just a minor indicator and real world usage is often very different.

by sanderjd14 hours ago|

[-]

I share this sense, but my immediate thought is that we need to improve the evaluations! Do you think this is impossible? That there is something indelible that it is not possible to capture empirically? I kind of have this intuitive sense that it is this way, but simultaneously I think that it's unlikely to really be true.

by theshrike7912 hours ago|

[-]

We shouldn't just measure the power of the raw LLM, harnesses matter more and more.

It's like taking the engine out a each car, putting it to a test bed and running it and then making a decision whether the car is good or bad based on the graphs the test bed provided.

You might have the best engine in the world, but if you put it in a shit car, the result is still bad. The seats are squeaky plastic, the infotainment is touch-only and you can't put on your seatbelt without knocking down whatever is in the cupholder.

by sanderjd5 hours ago|

[-]

Aren't there benchmarks that measure at the harness level as well?

by theshrike7938 minutes ago|

[-]

How would you benchmark "agent harness communicates with user clearly" it's 100% a feels measurement.

by gbalduzzi13 hours ago|

[-]

Following the original comment concepts, if every model requires a different prompting technique to maximize its output, how can a benchmark based on sending the same prompt to all models be accurate? We should create different prompts for each model, but then how reliable and unbiased can the benchmark be?

It is a fundamentally hard problem to solve

by Wowfunhappy6 hours ago|

[-]

I'm not GP, but yes, I think it's impossible.

Take AI out of the picture for a moment. What makes someone a good coder? What makes someone intelligent? How do you evaluate those skills?

Of course we have standardized tests, and they're useful, but they're also imperfect. And they become especially imperfect when people start training for the tests specifically—which is, essentially, benchmaxxing.

We have never been able to quantitatively measure most skills to a high degree of accuracy, despite centuries of trying. That's not going to change now.

(I don't mean to anthropomorphize the LLMs, but I do think they're like humans in this way.)

by Forgeties799 hours ago|

[-]

The reason we can’t capture it empirically is that nobody truly knows exactly what we are supposed to be using these tools for or how they are going to operate. We are still fitting squares into holes with them. We are told to treat them like some bespoke tool for coding, shopping, tech-support, etc. But it is not actually purpose built for any of these things.

When I use a calculator, I know exactly what it does and what it is supposed to do. It always gives me a verifiable, predictable result. If I input “8+8” 10,000x it will give me “16” 10,000x outside of incredibly fringe edge cases/bugs. I can’t say the same for LLMs

by willtemperley13 hours ago|

[-]

Yes, how do we know Opus 4.8 hasn't been trained on the SWE-Bench examples?

With a squillion dollars at stake per bench point, someone will have figured out a plausibly deniable way to game these benchmarks.

by stingraycharles6 hours ago|

[-]

Ehr, the SWE bench examples are particularly horrible as those are just publicly available historical PRs. So if the models are trained on GitHub data, it will be included.

So almost by design that particular benchmark is tainted, and benchmarks recall rather than reasoning.

by willtemperley6 hours ago|

[-]

Wow that's worse than I thought, and breaks the number one rule of machine learning: you don't train the model with your test dataset.

by dv35z13 hours ago|

[-]

What would it take to have trustworthy benchmarks? As with all "targets", they can be gamed - but I am curious about quantifiable quality metrics.

by sixtyj13 hours ago|

[-]

Issue with LLM benchmarks is similar to cars’ benchmarks. Eg journalists almost always get the full equipped model so their review is honest but sort of rigged.

I haven’t seen details of LLM benchmarks’ data sets but I would suppose that “questions” are public so known in advance therefore you can tune a model as much as possible.

One of real benchmarks is drawing of pelican - https://github.com/simonw/pelican-bicycle - Simon Willison made it for his llms’ tests.

If you want really find out a model that works for your specific purpose I would recommend several rounds at arena.ai - it helps to find a anonymously a model without confirmation bias.

Some ppl: Claude is the best! Others to them: but Qwen is the best! Or… Codex is better! …

it all depends on the language (English, Dutch, French…), style of querying (caveman, specs, skills, goal etc.)

Even with the same model I get different answers to same prompt that is just tweaked a little.

So benchmarks are nice but mostly useless.

Without your usecase it is just a reference number indicating the approximate position of that model among the others. And for those who want to make money it is a marketing tool to sell more as every customer counts.

by theshrike7913 hours ago|

[-]

You can't measure "feels".

One good analogy is the Macbook vs generic windows laptop debate online.

The engineer mind just compares numbers, the Lingwoo laptop from Amazon has biggest numbers for everything and the lowest price. Ergo it is the best.

But the numbers don't measure the fact that the Lingwoo creaks and squeaks when you lift it due to the cheap plastic. It also runs at 100C when both CPU and GPU are fully utilised. The keyboard feels like a membrane keyboard from a milspec device from the 90s. Numbers also don't measure the fact that Linwoo is an alphabet soup whitelabel manufacturer that won't exist in any legal capacity in 6 months so good luck with any warranty issues.

There will be an identical laptop called Chongwin being sold though. Completely different company, definitely.

The same applies to LLMs. You can do benchmarks like ask them to one-shot different kinds of gotcha questions (car wash, strawberry and other idiotic ones) or get them to write different kinds of programs.

But that doesn't measure the UX of doing so at all. How many times do you actually need any of those when you're actually working?

It's like unit testing an application. Every function can have 100% test coverage and the app can still be shit because there are things you can't unit test for.

by psychoslave12 hours ago|

[-]

> You can't measure "feels".

One can always measure whatever they wonder about. It doesn't mean the measure will be trustworthy and that anything built on it won't be at best not worst than wet finger judgement.

by theshrike7910 hours ago|

[-]

Feels are just opinions and taste. It's like art and music, you can't quantify either to a mathematical formula or an absolute test of which is good.

Even songs that break the "rules" of music can be subjectively good, either because they broke the rules or despite it.

Or with cars, a car that's beautiful to one person is the ugliest piece of trash on the street. Some people want a super soft ride where their espresso martini doesn't even vibrate when gunning it through a gravel road and others want to feel every grain of sand on the asphalt in their buttocks. Neither is "correct" and there is no objective measurement for ride comfort.

by da-x12 hours ago|

[-]

Maybe someone can devise a distributed bench-marking system where multiple people collaborate on tests and also vet each other's tests and rating without revealing them to the public.

I have my own "interview questions" for models where I give them a premade Git repo and a problem to solve. Then, I rate them like a teacher. I believe other do that as well, so we only need a reliable system to aggregate these results.

by microtonal12 hours ago|

[-]

The problem with proprietary models behind APIs is that they could have saved your benchmark for future training though.

The only way to make it fair is to have the model provider give some benchmarking org the weights + inference engine, so that the model can be run in complete isolation and no information about the benchmark is leaked.

Though I guess for a 'random' person's benchmark that hides between all other requests it's probably ok.

by clhodapp11 hours ago|

[-]

While the gist of what you say is true, it is hard to get very good at treating them as instruments when they keep getting replaced with new, ostensibly-better versions every few months. But those new versions are not strictly better. They are mostly-better while actually having different strengths and weaknesses.

It's hard to decide when to use the best tool for a job you are aware of to ensure throughput and when to spend time experimenting with a new tool to learn what it's good at.

by nonethewiser4 hours ago|

[-]

> you sometimes want to under-specify or phrase things more indirectly to give a color to the implementation or elicit something creative

This has been my experience with most models. If you say "How do I do X? I was thinking maybe Y or Z" then the model will probably try to make Y or Z work. They will very likely not say some third option that is wildly different is better, even if it may be. And actually maybe less so with Claude because sometimes it pushes back.

Actually this seems like it would be an interesting test. Maybe I will come up with some contrived question and ask several models.

by vkazanov14 hours ago|

[-]

The problem is not that there details, the problem is constantly shifting ground. We can only rlpy on a harness to be sort of predictable but the models change all the time.

by rkuska13 hours ago|

[-]

It system prompts that change all the time especially in claude code.

by furyofantares3 hours ago|

[-]

I strive to make this NOT the case, by fixing up my skills or agents.md whenever they don't work how I want in one provider or the other. I mean, yeah, it would be awesome if I was a virtuoso with all the agents/models I use. But I am switching all the time, either because one leapfrogs the other, or because I hit limits (I'm on $200/mo on both Claude and Codex, and also subscribe to some others when I hit limits on both of those simultaneously).

by tingletech2 hours ago|

[-]

I do think it pays to be nice to the model. When the context window is running out I like to ask "please summarize what went well and what didn't work in this session. How could the user be more helpful?"

by john_strinlai2 hours ago|

[-]

>I do think it pays to be nice to the model.

there was something on HN a few weeks ago about how most/all models perform better the more rude you are to them.

(i still say "please", i can't help it)

by visiondude12 hours ago|

[-]

while not scientific this is been my experience as well. i will add that language specificity in word choice is also a learned behavior. for example, the word “investigate” vs the phrase “look into”. You will find the outputs are quite different. can you guess which will use more tokens? it’s stuff like this that actually sets people apart in the top percentile of using these tools

by qsera12 hours ago|

[-]

Mmm..interesting..So now people are finding behavior patterns in LLMs which are trained on behavior patterns of people...

by theshrike7913 hours ago|

[0] https://en.wiktionary.org/wiki/niin_mets%C3%A4_vastaa_kuin_s...

[-]

Yyep.

IME Claude is the most "creative" of the bunch, you can get surprising ideas out of it that were kinda tickling the back of your head but didn't really connect.

BUT it's also "relentlessly proactive" like simonw put it. It _will_ get the job done, it's the smartest idiot in town. Why use a library to parse $format when you can just write a custom 1000 line parser? Or if it can't access something, it'll pursue the goal of accessing it in the most creative ways - instead of stopping, asking the user "yo, can you give me access to X" and then continuing.

My solution is to use Claude as a pair programmer. I _very_ rarely just do /goal fix this shit, I watch what it does and interrupt if it gets to the "smart idiot" phase. Also I communicate with it like I would a coworker, never had it berate me or get combative. There's a Finnish proverb for that too[0]

As for Codex, Deepseek, GLM, those I use when the goal is 100% clear like "convert this Brewfile to a list of packages for Arch and Debian, use these two Docker containers to test that pacman and apt work correctly". Boom, done.

But I won't give any creative open-ended tasks to any other model than Claude.

by weitendorf12 hours ago|

[-]

The parsing thing, or the willingness to instantly drop into janky unsanitized string manipulations, or to constantly push back against work on infra projects because some random package on GitHub has 200 stars so it’s totally the safer approach, is driving me insane.

On one hand I’m glad Anthropic is only just now starting to get into infrastructure because it means there’s opportunity there, but it’d be great for their models to be more knowledgeable or able to seek out that knowledge on their own, or for the UX of Claude code to be more amenable to launching 5 in parallel and picking the best one, so I don’t have to spend time arguing with a robot. I think there’s a much better balance to strike between just charging ahead towards the goal at all costs vs being lazy and pushing everything back up to the user. Basically they write too much code that’s too contingent/brittle outside its exact current context and don’t do a good job distilling out the essence of the problem “cleanly”. Almost all of them are like this right now, it’s partially a problem with long-range planning but I think a real bias from over optimization for certain RLVR outcomes vs others.

by tym011 hours ago|

[-]

I feel like this is really due to the harness.

Gemini CLI at work has the same issue: it'll prefer hacking your workstation over just asking you how to proceed.

I think the harnesses are setup to have a bias to action otherwise the LLM would just stop all the time when doing trivial task but it also mean they'll keep going when the "obvious" path is to just prompt the user.

by weitendorf11 hours ago|

[-]

While I agree that the harness is part of it, I think it's also a lack of epistemic understanding or awareness for what it means to actually solve a problem vs just get something kinda working; maybe if Claude Code or other harnesses made web search more likely or had a better way to make technical documentation and specs available to models, it would be better solvable there.

I often tell it to stop asking me and just keep going until it accomplishes X task; unfortunately it tends to assume I want something that only just barely works, in the sense that it means it's time to stop once its there, which is I don't think a harness by itself could easily address (ultimately the model itself needs to determine the stopping points unless I literally specify by hand hidden evaluation criteria).

That's why think it's at least partially a training issue where the model gets rewarded for "solving" the problem within a certain amount of context/time without access to grounded knowledge (eg looking up the actual spec for a format) nor adversarially/rigorously evaluated against a reviewer capable of finding all the edge cases/shortcuts preventing something from being a properly generalized solution. I don't want it to ask me for guidance when it's working on a well-specified problem, I want it to either find the right parser and use it, or to completely implement one against the spec, rather than write some half-assed string inserter that eg only works on the specific select statements my examples use right now. My understanding is that the Mythos/Fable models were better trained for this but from my brief foray into using Fable for work I wasn't that impressed. For me they need to get better at agentic search and self-eval still

by theshrike7910 hours ago|

[0] https://hermes-agent.nousresearch.com

[-]

There are still billion dollar opportunities in the harness/LLM space.

Having a reliable shared memory for hundreds of agentic AI users is something that's 95% snake oil at the moment. There are a few successes on an individual level (I really like Hermes[0]) but nothing scales to a company level easily.

It should be possible to (pre)configure all agentic harnesses used in a company to use a single source for information so that it'd automatically pick up internal libraries, conventions, licensing decisions etc and remember them across sessions.

I've had limited success with this on a personal level, but it's still not ingrained in the model because it would really need a custom harness. Hooks, skills, prompts get you like 80% of the way. I still need to do a "please check that the project matches the conventions defined in ..." regularly to catch any drift - especially on more vague stuff that can't be locked down with unit testing.

by zahlman3 hours ago|

[-]

FWIW I find that GPT can be very creative when discussing a high-level design. Once it starts writing code snippets it will offer to take things in a bunch of different directions.

by keeganpoppen3 hours ago|

[-]

this is the best distillation of what various models are like that i've ever heard... it's wild to me that people view LLMs as this monolithic entity, like "how do i get the best prompts to do <X>?", when it is such a clearly interactive medium, but the returns to engaging with the various models and understanding their "vibes" are very, very high.

by nosyke7 hours ago|

[-]

It's interesting because this really hasn't been my experience over the last month or two. I would prior it was, but it's definitely changed on my end. In my exp I've needed to be way more specific with Claude and with Codex I can generally approach a problem in a much more open ended way.

by LogicFailsMe1 hours ago|

[-]

I find with Claude that when I call its BS I get better results. And it openly admits to lying to and gaslighting me as well as not seeing any way to stop itself from continuing to do so.

Fable seemed less apt to do so but I didn't get enough time with it before it was yanked away to know for sure. It may have had mixed results on the benchmarks but it was finding bugs opus never found.

by bandrami9 hours ago|

[-]

I think this goes beyond "vibes" to cargo-culting. It's why nobody's ever able to actually show ROI from LLMs

by CuriouslyC5 hours ago|

[-]

It's hard to actually show ROI from any programming methodology or tool. You can show ROI from a product or feature, but the tool/methodology is a multiplier on the velocity of creating that which is not directly observable.

by bandrami5 hours ago|

[-]

It's really not. When we switched from CVS to SVN I had to show ROI and when I we switched from SVN to git I had to show ROI and when we switched from Ada to Java I had to show ROI. When we switched from Xen to KVM I had to show ROI and when we switched from PAM realtime privileges to rtkit I had to show ROI. When we switched from chroots to LXC I had to show ROI, when we switched from LXC to docker I had to show ROI, and when we switched from docker to podman I had to show ROI.

If you can't show ROI there's literally no reason to ever switch anything.

by baq10 hours ago|

[-]

+1.

this is what 'tokens are commodities' and 'there is no moat' people miss. the models are in general not easily swapped out. you always have to run evals before you can swap them around, tune prompts etc. even minor versions of models from same providers need this process.

by hashmap13 hours ago|

[-]

totally true. one key for claude is to not smell like an evaluator, its good at knowing when its being tested and will behave defensively and avoid doing work. i avoid this basin by typing unreasonably excited about the thing i want done. like way over the top. it's harder to keep that up than it sounds.

by notduncansmith12 hours ago|

[-]

I’m able to avoid this basin with a pretty natural baseline professional positivity and frustration management that I would employ with pair-programming. For example, if I just made progress with a human I was guiding through a task, I would be like “Nice, now let’s xyz” (instead of just “now let’s xyz” as if _I_ were the robot lol) or if we had to work for a result I’ll be like “Sweet! Looks good, now let’s xyz” - this is important signal for humans, and the same is true for agents. Also staying emotionally regulated and focused on the goal when things don’t work as expected or when we haven’t made progress after a few tries at something, critical in human interactions :) and even if it’s my job paying for the tokens, the idea of racking up even a microscopic bill for the privilege of having a machine read my insults and then formulate some credible-sounding blob of apology text is belly-laugh absurd to me. I do try to express my genuine feelings during more vision-oriented planning sessions, and just like with a human, you have to maintain the vibes if you want a genuinely collaborative session to go well. If you are toxic people will become either defensive or aggressive in response. From reading the rest of the front page it seems like we are lucky that Claude is the former, and that we especially best maintain a positive atmosphere around Grok.

by glerk13 hours ago|

[-]

at the risk of sharing my secret magic spells :)

> this is phenomenal work, genuinely! I feel like you read my mind! <next instruction here>

can go a long way.

of course, I would only say that when I mean it, because Claude can get superficial and cut corners which is why I prefer GPT for raw implementation.

by hansmayer11 hours ago|

[-]

[dead]

by zahlman4 hours ago|

[-]

> being nice to Claude will be rewarded and being mean to Claude will be punished

... That does sound like something that Anthropic would deliberately aim for, yeah.

> With GPT, you have to be precise and reduce ambiguity.

I have found that it occasionally makes a wild misinterpretation, that makes a bit of sense in retrospect given how I worded something but is still surprising.

It also sometimes tries to loop in and tie together ideas from earlier in the conversation that really shouldn't still appear relevant. But that might be a general LLM thing.

by 12 hours ago|

[-]

deleted

by reverius4214 hours ago|

[-]

These are the vibes that power vibecoding.

by vorticalbox11 hours ago|

[-]

I find opus for planning and sonnet for coding but codex for code review.

by photochemsyn4 hours ago|

[-]

We can’t tell if reported anecdotal behaviors of given LLMs are due to (1) one’s engagement history with that particular LLM provider or (2) ongoing variations in the secret system prompt all commercial LLM providers insert or (3) some other variable feature like RAG.

Classify under non-reproducible artifacts of LLM generation.

by QwenGlazer90002 hours ago|

[-]

As someone who actually uses musical instruments, it's not at all the same. If anything, traditional IDEs are closer to musical instruments, which seem to be going EOL if you listen to the hype bros.

by gateonai13 hours ago|

[-]

[flagged]

by epolanski7 hours ago|

[-]

[dead]

by izucken11 hours ago|