>Frequent LLM users already know not to do that.
And I think that’s the biggest problem. Amidst the current push to utilize LLMs across orgs and groups there are a large (if even say majority) of people that are using them every day but who have never approached anything as technical as a “harness” before let alone an entire setup.
For them the behavior mentioned here is a major issue.
So the reasonable response to being told you're holding your scissors wrong is to realize that yes, you most likely are holding your scissors wrong[0], and ask the other person for advice (or just to do the thing), or look up a YouTube video and learn, or sign up to a class, or such.
Expecting mastery in 30 seconds is not a reasonable attitude, but it's unfortunately the lie that software industry tried to sell to people for the past 15 years or so.
--
[0] - There's much more to it than one would think.
I don’t have an example off hand, but I know that it’s easy to dismiss something an LLM does as trivial if your work is extremely marginal. Most devs aren’t creating their own programming languages. I can’t help but think people who hold this opinion also think the work most software professionals do is “trivial” (“you’re just moving strings around, that’s not impressive/trivial”)
A lathe operator isn’t any good if they don’t frequently operate lathes.
A articulated robot implementer needs frequent experience implementing robots to be any good.
That doesn’t mean lathes or robots are useless. Nor does it mean they have failed as products because they require expertise.
I do think it raises questions as to whether vast swathes of the population will be effective at using LLMs. Are they scissors, or a lathe?
To me learning to use LLMs is the same as doing anything else, you have to practice and put in the hours to get good. Maybe some harnesses will eventually allow LLMs to function more as scissors than lathes. This seems to be what Microsoft is trying to do by embedding Copilot in all their products and saying “choose the UI that works best for you”. If that doesn’t end up working we’ll need another paradigm for “non-technical” users to effectively operate computer assistants
What does one do when a full editor consumes too much bandwidth^H tokens? Use ed, the standard editor!
Also as a person developing agentic code tools since before Claude Code, I'm skeptical if str_replace provides accuracy improvement over just full rewrite.
Back in the day when SOTA models would do lazy coding like `// ... rest of the code ...`, full rewrite wasn't easy. Search/replace was fast, efficient and without the lazy coding. However, it came with slight accuracy drop.
Today that accuracy drop might be minimal/absent, but I'm not sure if it could lead to improvements like preventing doc corruption.
They've been decent at full rewrite for 2 years. I don't think they were good at search/replace until a year ago, but I'm not so sure.
It's true that the models 2 years ago would sometimes make errors in whole rewrite - e.g removing comments was fairly common. But I've never seen one randomly remove one character or anything like that. These days they're really good.
Main reason agentic harnesses use search/replace is speed and cost, surely! Whole file output is expensive for small changes.
I think this is closely related to other sources saying that even if you have huge context the attention mechanism itself is not back referencing thus any tasks related to bigger contexts are prone to errors.
because I have some preconception of this maybe I am assuming this is what they were saying. Am I missing something ?
This team is inexperienced and it shows.
The noise to signal ratio will get worse, even in "academia". Brace yourselves. The kids are growing up in this new world.
On editing tasks, one should only allow programmatic editing commands, the text shouldn't flow through the LLM at all. The LLM should analyze the text and emit commands to achieve a feedback directed goal.
The fact of the matter is, if you want to edit a document by reading the document and then regurgitating the entire document with said edits... a human will DO worse then a 25% degradation. It's possible for a human to achieve 0% degradation but the human will have to ingest the document hundreds of times to achieve a state called "memorization". The equivalent in an LLM is called training. If you train a document into an LLM you can get parity with the memorized human edit in this case.
But the above is irrelevant. The point is LLMs have certain similarities with humans. You need to design a harness such that an LLM edits a document the same way a human would: Search and surgical edits. All coding agents edit this way, so this paper isn't relevant.
OR it could be because their concerns are genuine but are ignored in favour of a good sounding story.
So that is definitively a biased interpretation. This is independent of how accurate my POV or your POV is on whether LLMs degrade documents. I am simply saying the experiment conducted is COMPLETELY DIFFERENT from how LLMs AND humans edit papers.
* than
As I was reading this article, a similar thought occurred to me: "I wonder if that's better or worse than a human?" Unfortunately, there was no human baseline in this study. That said, there are studies that compare LLM to human performance. Usually, humans perform much better (like 5-7x better) at long-running tasks.
In other words, a human would probably do better than an LLM on this task.
Humans lose to LLMs in narrow, well-specified text/symbolic reasoning tasks where the model can exploit breadth, speed, and search. Usually, the LLM performed ~15% better than humans, but I saw studies that were as high as 80%. To my surprise, these studies were usually about "soft skills" like creativity and persuasion.
Show your edit by regurgitating this entire thread by hand on a paper. Don't use any additional tools like Find and replace.
Boom there's your baseline. I can simulate the result in my head.
Guys I'm basically saying the experiment is innaccurate to the practical reality of how LLMs are actually used.
Most LLM users who are not touching code are certainly not going to be using a harness. They're going to take all the documents, slam all those tokens into the context window, see they have only used 500k out of their 1M tokens and say "summarize".