upvote
Yes, you have to learn those things. LLMs are hard to use.

So are animals, but we've used dogs and falcons and truffle hunting pigs as tools for thousands of years.

Non-deterministic tools are still tools, they just take a bunch more work to figure out.

reply
It's like having Michael Jordan with dementia on your team. You start out mesmerized by how many points he can score, and then you get incredibly frustrated that he forgets he has to dribble and shoot into the correct hoop.
reply
Spot on. Not to mention all the fouls and traveling the demented "all star" makes for your team, effectively negating any point gains.
reply
No, please, stop misleading people Simon. People use tools to make things easier for them, not harder. And a tool which I cannot steer predictably is not a god damn tool at all! The sheer persistence the AI-promoters like you are willing to invest just to gaslight us all into thinking we were dumb and did not know how to use the shit-generators is really baffling. Understand that a lot of us are early adopters and we see this shit for what it is - the most serious mess up of the "Big Tech" since Zuckerberg burned 77B for his metaverse idiocy. By the way - animals are not tools. People do not use them - they engage with them as helpers, companions and for some people, even friends of sorts. Drop your LLM and try engaging with someone who has a hunting dog for example - they'd be quite surprised if you referred to their beloved retriever as a "tool". And you might learn something about a real intelligence.
reply
Your insistence that LLMs are not useful tools is difficult for me to empathize with as someone who has been using them successfully as useful tools for several years - and sharing in great detail how I am using them.

https://simonwillison.net/2025/Dec/10/html-tools/ is the 37th post in my series about this: https://simonwillison.net/series/using-llms/

https://simonwillison.net/2025/Mar/11/using-llms-for-code/ is probably still my most useful of those.

I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.

They're not nearly as unpredictable as you appear to think they are.

One of us is misleading people here, and I don't think it's me.

reply
> One of us is misleading people here, and I don't think it's me.

Firstly, I am not the one with an LLM-influencer side-gig. Secondly - No sorry, please don't move the goalposts. You did not answer my main argument - which is - how does a "tool" which constantly change its behaviour deserve being called a tool at all? If a tailor had scissors which cut the fabric sometimes just a bit, and sometimes completely differently every time they used it, would you tell the tailor he is not using them right too? Thirdly you are now contradicting yourself. First you said we need to live with the fact that they are un-predictable. Now you are sugarcoating it into being "a bit unpredictable", or "not as nearly unpredictable". I am not sure if you are doing this intentionally or do you really want to believe in the "magic" but either way you are ignoring the ground tenets of how this technology works. I'd be fine if they used it to generate cheap holiday novels or erotica - but clearly after four years of experimenting with the crap machines to write code created a huge pushback in the community - we don't need the proverbial scissors which cut our fabric differently each time!

reply
> how does a "tool" which constantly change its behaviour deserve being called a tool at all?

Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

The same is true of plenty of other tools - pottery kilns, cast iron pans, knife sharpening stones. Expert tool users frequently use tools that change over time and need to be monitored and adjusted.

I do think dogs and horses other animal tools remain an excellent example here as well. They're unpredictable and you have to constantly adapt to their latest behaviors.

I agree that LLMs are unpredictable in that they are non-deterministic by nature. I also think that this is something you can learn to account for as you build experience.

I just fed this prompt to Claude Code:

  Add to_text() and to_markdown() features to justhtml.html - for the whole document or for CSS selectors against it
  
  Consult a fresh clone of the justhtml Python library (in /tmp) if you need to
It did exactly what I expected it would do, based on my hundred of previous similar interventions with that tool: https://github.com/simonw/tools/pull/162
reply
Whether its blast furnaces or carbon fiber, the wear and tear (macroscopic changes) as well as material fatigue (molecular changes) is something that will be specified by the manufacturer, within some margin of error and you pretty much know what to expect - unless you are a smartass billionaire building an improvised sub off of carbon fiber whose expiry date was long due. However, the carbon fiber or your blast furnace wont break just on their own. So it's a weak analogy and a stretch at that. Now for your experiment: it has no value because a) you and me both know if you told your LLM that their output was shit, they would immediately "agree" with you and go off to produce some other crap b) For this to be a scientifically valid experiment at all, I'd expect on the order of 10.000 repetitions, each providing exactly the same output. But also on this you and me both know already the 2nd iteration will introduce some changes. So stop fighting the obvious and repeat after me: LLMs are shit for any serious work.
reply
Why would I agree that "LLMs are shit for any serious work" when I've been using them for serious work for two+ years, as have many other people who's skills I respected from before LLMs came along?

I wrote about another solid case study this morning: https://simonwillison.net/2025/Dec/14/justhtml/

I genuinely don't understand how you can look at all of this evidence and still conclude that they aren't useful for people who learn how to use them.

reply
Well, you dont have to agree with that statement. But I havent seen a serious refute of my arguments either.
reply
> Let's go with blast furnaces. They're definitely tools. They change over time - a team might constantly run one for twenty years but still need to monitor and adjust how they use it as the furnace itself changes behavior due to wear and tear (I think they call this "drift".)

Now let's make the analogy more accurate: let's imagine the blast furnace often ignores the operator controls, and just did what it "wanted" instead. Additionally, there are no gauges and there is no telemetry you can trust (it might have some that can the furnace will occasionally falsify, but you won't know when it's doing that).

Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output (requires rework which exceeds the productivity gains of using the blast furnace to begin with).

Furthermore, the only way to tell which one of those 3 options you got, is to manually inspect every detail of every piece of every output. If you don't do this, the output might leak secrets (or worse) and bankrupt your company.

Finally, the operator would be charged for usage regardless of how often the furnace actually worked. At least this part of the analogy already fits.

What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers. Maybe a few people with money to burn. In particular, those who sing the highest praises of such a tool would likely be ignorant of all these pitfalls, or have a vested interest in the tool selling.

reply
> What a weird blast furnace! Would anyone try to use this tool in such a scenario? Not most experienced metalworkers.

Absolutely wrong. If this blast furnace would cost a fraction of other blast furnaces, and would allow you to produce certain metals that were too expensive to produce previously (even with high error rate), almost everyone would use it.

Which is exactly what we're seeing right now.

Yes, you have to distinguish marketing message vs real value. But in terms of bang for buck, Claude Code is an absolute blast (pun intended)!

reply
> this blast furnace would cost a fraction of other blast furnaces

Totally incorrect: as we already mentioned, this blast furnace actually costs just as much as every other blast furnace to run all the time (which they do). The difference is only in the outputs, which I described in my post and now repeat below, with emphasis this time.

Let's also imagine that the blast furnace changes behavior minute-to-minute (usually in the middle of the process) between useful output, useless output (requires scrapping), and counterproductive output ——>(requires rework which exceeds the productivity gains of using the blast furnace to begin with)<——

Does this describe any currently-operating blast furnaces you are aware of? Like I said, probably not, for good reason.

reply
You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention paid to catching their mistakes and designing systems around them (like sandboxed coding agent harnesses) that allow them to be operated productively and safely.

I couldn't agree more.

reply
> You appear to be arguing that powerful, unpredictable tools like LLMs need to be run carefully with plenty of attention

I did not say that. I said that most metalworkers familiar with all the downsides (only 1 of which you are referring to here) would avoid using such an unpredictable, uncontrollable, uneconomical blast furnace entirely.

A regular blast furnace requires the user to be careful. A blast furnace which randomly does whatever it wants from minute to minute, producing bad output more often than good, including bad output that costs more to fix than the furnace cost to run, more than any cost savings, with no way to tell or meaningfully control it, is pretty useless.

Saying "be careful" using a machine with no effective observability or predictability or controls is a silly misnomer, when no amount of care will bestow the machine with them.

What other tools work this way, and are in widespread use? You mentioned horses, for example: What do you think usually happens to a deranged, rabid, syphilitic working horse which cannot effectively perform any job with any degree of reliability, and which often unpredictably acts out in dangerous and damaging ways? Is it usually kept on the job and 'run carefully'? Of course not.

reply
> I know you absolutely hate being told you're holding them wrong... but you're holding them wrong.

Wow, was that a shark just then?

reply
> So are animals, but we've used dogs and falcons and truffle hunting pigs as tools for thousands of years.

Dogs learn their jobs way faster, more consistently and more expressively than any AI tool.

Trivially, dogs understand "good dog" and "bad dog" for example.

Reinforcement learning with AI tooling clearly seems not to work.

reply
> Dogs learn their jobs way faster, more consistently and more expressively than any AI tool.

That doesn't match my experience with dogs or LLMs at all.

reply
Ever heard of service dogs? Or police dogs? Now tell me, when will LLMs ever be safe to be used as assistance to blind people? Or will the big tech at some point release some sloppy blind-people-tool based on LLMs and unleash the LLM-influencers like yourself to start gaslighting the users into thinking they were "not holding it right" ? For mission and life critical problems, I'll take a dog any day, thank you very much!
reply
I've talked to a few people who are blind about vision LLMs and they're very, very positive about them.

They fully understand their limitations. Users of accessibility technology are extremely good at understanding the precise capabilities of the tools they use - which reminds me that screenreaders themselves are a great example of unreliable tools due to the shockingly bad web apps that exist today.

I've also discussed the analogy to service dogs with them, which they found very apt given how easily their assistive tool could be distracted by a nearby steak.

The one thing people who use assistive technology do not appreciate is being told that they shouldn't try a technology out themselves because it's unreliable and hence unsafe for them to use!

reply
Please for once answer the question being asked without replacing both the question and the stated intention with something else. I was willing to give you the benefit of doubt, but I am now really wondering where does your motivation for these vaguely constructed "analogies" coming from, is the LLM industry that desperate? We were all "positive" about LLM possibilities once. I am asking you, when will LLMs be so reliable that they can be used in place of service dogs for blind people ? Do you believe that this technology will ever be that safe. Have you ever actually seen a service dog? I don't think you can distract a service dog with a steak - did you know they start their training basically from year one of age and it takes up to two years to train them. Do you think they spend those two years learning to fetch properly? Also I never said people should not be allowed to "try" a technology. But like with drugs, the tools for impaired, sick etc. also undergo a verification and licensing process, I am surprised you did not know that. So I am asking you again, can you ever imagine an LLM passing those high regulatory hurdles, so that they can be safely used for assisting the impaired people? Service dogs must be doing something right, if so many of them are safely assisting so many people today, don't they ?
reply
You’ve asked the right questions and don’t want to find the answers. It’s on you.
reply