Maybe it works some of the time but it isn't a solution that works everytime.
It reminds me of people hovering to play a slot machine when someone gets up and it hasn't paid out as if they've solved slot machines.
While I don't mind putting something in a loop until the tests pass, I'm less comfortable doing that when providers are silently rerouting to lower quality models, or in Google's case burning quota faster to ease their own server load without being transparent about what the "standard limits" are to begin with. [1]
I'm hopeful I'll be more comfortable with these "slot machines" when frontier models get to the point where they can be run locally on hardware I can actually afford so I know exactly what I'm getting and not jumping at shadows with providers playing tricks behind the scenes to ease their own load without admitting the customer is getting less for their money as they get more popular.
[1]: https://support.google.com/gemini/answer/16275805?hl=en&sjid...
Last I saw, engineers working at OpenAI denied this on HN.
I saw that someone set up a tracker that aims to record the performance of the models, and so far it has not shown any statistically significant deviation in performance for Codex, and not yet enough data for Claude: https://marginlab.ai/trackers/codex/
The firm [Anthropic] would deliberately degrade the model’s performance in ways that were invisible to the user.
For such thing to be useful, it's enough that they works substantially more times that not having those instructions in.
Playing a B on a saxophone always plays a B.
But your analogy remains solid if you substitute e.g. a piano and a reasonably proficient player. A single note would be nearly indistinguishable between players... But a full piece most certainly will sound different.
The original take was "LLMs are very much like playing an instrument". I think they are very much NOT like playing an instrument.
While different musicians will produce different results, one musician won't get drastically different results on different days or when trying a different "copy" of the same instrument. If you can play the violin on your violin and I lend you my violin, you will still be able to play very consistently. You may argue that the sound will differ and you will have to adapt slightly, but that's not remotely similar to the randomness coming from LLMs.
That's only if both violins are tuned the same way, and one must continually tune them lest they get out of sync.
Similarly, an LLM can be extremely consistent if tuned properly -- indeed, if you fix the weights and settings, they can be made "essentially deterministic" for many prompts!
This is because LLMs have aspects of chaotic dynamical systems, where small changes in initial conditions can lead to vastly different outcomes. That property is independent from nondeterminism.
You know what we are talking about. Tuning, poor playing, all of that is mild variation from what we know it is supposed to do every time and we can target the the notes they are supposed to hit consistently. You're comparing slight tonal variations to completely different outputs from the same inputs. If I hit a "C" on the piano, it is going to play "C." If it does not, then the piano is not functioning properly. LLM's for some reason get a pass on this and it makes them very distinct from musical instruments.
This feels like a very nitpicky steel man, not a productive attempt at discussion.
LLM’s do not operate consistently and make their own errors while we argue about which incantation makes it less inconsistent, knowing it will never actually perform as expected.
I played woodwinds regularly for 15 years so I feel fine with my example.