upvote
> One of us is misleading people here, and I don't think it's me.

Firstly, I am not the one with an LLM-influencer side-gig. Secondly - No sorry, please don't move the goalposts. You did not answer my main argument - which is - how does a "tool" which constantly change its behaviour deserve being called a tool at all? If a tailor had scissors which cut the fabric sometimes just a bit, and sometimes completely differently every time they used it, would you tell the tailor he is not using them right too? Thirdly you are now contradicting yourself. First you said we need to live with the fact that they are un-predictable. Now you are sugarcoating it into being "a bit unpredictable", or "not as nearly unpredictable". I am not sure if you are doing this intentionally or do you really want to believe in the "magic" but either way you are ignoring the ground tenets of how this technology works. I'd be fine if they used it to generate cheap holiday novels or erotica - but clearly after four years of experimenting with the crap machines to write code created a huge pushback in the community - we don't need the proverbial scissors which cut our fabric differently each time!

reply
> how does a "tool" which constantly change its behaviour deserve being called a tool at all?

Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

The same is true of plenty of other tools - pottery kilns, cast iron pans, knife sharpening stones. Expert tool users frequently use tools that change over time and need to be monitored and adjusted.

I do think dogs and horses other animal tools remain an excellent example here as well. They're unpredictable and you have to constantly adapt to their latest behaviors.

I agree that LLMs are unpredictable in that they are non-deterministic by nature. I also think that this is something you can learn to account for as you build experience.

I just fed this prompt to Claude Code:

  Add to_text() and to_markdown() features to justhtml.html - for the whole document or for CSS selectors against it
  
  Consult a fresh clone of the justhtml Python library (in /tmp) if you need to
It did exactly what I expected it would do, based on my hundred of previous similar interventions with that tool: https://github.com/simonw/tools/pull/162
reply
Whether its blast furnaces or carbon fiber, the wear and tear (macroscopic changes) as well as material fatigue (molecular changes) is something that will be specified by the manufacturer, within some margin of error and you pretty much know what to expect - unless you are a smartass billionaire building an improvised sub off of carbon fiber whose expiry date was long due. However, the carbon fiber or your blast furnace wont break just on their own. So it's a weak analogy and a stretch at that. Now for your experiment: it has no value because a) you and me both know if you told your LLM that their output was shit, they would immediately "agree" with you and go off to produce some other crap b) For this to be a scientifically valid experiment at all, I'd expect on the order of 10.000 repetitions, each providing exactly the same output. But also on this you and me both know already the 2nd iteration will introduce some changes. So stop fighting the obvious and repeat after me: LLMs are shit for any serious work.
reply
Why would I agree that "LLMs are shit for any serious work" when I've been using them for serious work for two+ years, as have many other people who's skills I respected from before LLMs came along?

I wrote about another solid case study this morning: https://simonwillison.net/2025/Dec/14/justhtml/

I genuinely don't understand how you can look at all of this evidence and still conclude that they aren't useful for people who learn how to use them.

reply
Well, you dont have to agree with that statement. But I havent seen a serious refute of my arguments either.
reply
> Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

Now let's make the analogy more accurate: let's imagine the blast furnace often ignores the operator controls, and just did what it "wanted" instead. Additionally, there are no gauges and there is no telemetry you can trust (it might have some that can the furnace will occasionally falsify, but you won't know when it's doing that).

Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output (requires rework which exceeds the productivity gains of using the blast furnace to begin with).

Furthermore, the only way to tell which one of those 3 options you got, is to manually inspect every detail of every piece of every output. If you don't do this, the output might leak secrets (or worse) and bankrupt your company.

Finally, the operator would be charged for usage regardless of how often the furnace actually worked. At least this part of the analogy already fits.

What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers. Maybe a few people with money to burn. In particular, those who sing the highest praises of such a tool would likely be ignorant of all these pitfalls, or have a vested interest in the tool selling.

reply
> What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers.

Absolutely wrong. If this blast furnace would cost a fraction of other blast furnaces, and would allow you to produce certain metals that were too expensive to produce previously (even with high error rate), almost everyone would use it.

Which is exactly what we're seeing right now.

Yes, you have to distinguish marketing message vs real value. But in terms of bang for buck, Claude Code is an absolute blast (pun intended)!

reply
> this blast furnace would cost a fraction of other blast furnaces

Totally incorrect: as we already mentioned, this blast furnace actually costs just as much as every other blast furnace to run all the time (which they do). The difference is only in the outputs, which I described in my post and now repeat below, with emphasis this time.

Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output ——>(requires rework which exceeds the productivity gains of using the blast furnace to begin with)<——

Does this describe any currently-operating blast furnaces you are aware of? Like I said, probably not, for good reason.

reply
You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention paid to catching their mistakes and designing systems around them (like sandboxed coding agent harnesses) that allow them to be operated productively and safely.

I couldn't agree more.

reply
> You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention

I did not say that. I said that most metalworkers familiar with all the downsides (only 1 of which you are referring to here) would avoid using such an unpredictable, uncontrollable, uneconomical blast furnace entirely.

A regular blast furnace requires the user to be careful. A blast furnace which randomly does whatever it wants from minute to minute, producing bad output more often than good, including bad output that costs more to fix than the furnace cost to run, more than any cost savings, with no way to tell or meaningfully control it, is pretty useless.

Saying "be careful" using a machine with no effective observability or predictability or controls is a silly misnomer, when no amount of care will bestow the machine with them.

What other tools work this way, and are in widespread use? You mentioned horses, for example: What do you think usually happens to a deranged, rabid, syphilitic working horse which cannot effectively perform any job with any degree of reliability, and which often unpredictably acts out in dangerous and damaging ways? Is it usually kept on the job and 'run carefully'? Of course not.

reply
> I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.

Wow, was that a shark just then?

reply