upvote
I've definitely experienced this while coding with LLMs. Often, after a flurry of feature work in which I thought I was being reasonably careful but moving very fast, I take a closer look at some small piece of code and go "holy shit". Then I have to spend a few hours going over everything and carefully reworking parts where things didn't quite go how I'd like, where I was unclear, or where the LLM's brainworms kicked in.

Quality is really important to me in its own right, but I also worry about this exact "repeated compression" problem: when my codebase is clean and I have an up-to-date mental model, an LLM can quickly help me churn out some feature work and still leave the codebase in a reasonable state. But as the LLM dirties up the codebase, its past mistakes or misunderstandings compound, and it's likely to flub more and more things. So I have to go back and "restore" things to a correct state before I feel comfortable using the LLM again.

reply
This seems closely related to the problem of model collapse [1][2][3], where LLMs lose the tails of the distribution, and so when you recursively train on the output of an LLM, or otherwise feed the output back into the input in subsequent stages, you lose the precision and diversity that human authors bring to the work. Eventually everything regresses to the mean and anything that would've made the content unique, useful, and differentiated gets lost.

My takeaway from this is that AI is a temporary phenomena, the end stage of the Internet age. It's going to destroy the Internet as we know it as well as much of the technological knowledge of the developed world, and then we're going to have to start fresh and rebuild everything we know. My takeaway is that I'm trying to use AI to identify and download the remaining sources of facts on the Internet, the human-authored stuff that isn't generated for engagement but comes from the era when people were just putting useful stuff online to share information.

[1] https://en.wikipedia.org/wiki/Model_collapse

[2] https://www.nature.com/articles/s41586-024-07566-y

[3] https://cacm.acm.org/blogcacm/model-collapse-is-already-happ...

reply
Yep humans and civilization are subject to the same model-collapse phenomenon as they interact more with LLMs, but engineering knowledge has always been held by a small minority with certain personality characteristics. Maybe the minority will get smaller but I'm not sure it will completely disappear. There's always people like yourself building archives.
reply
See A Canticle for Leibowitz
reply
There are plenty of AIs that are immune to this because they're trained on something that won't be flooded with slop. E.g. robotics, self-driving cars (both trained on real camera/sensor inputs) or programming/proof-assistant stuff (trained on things that are verifiable).
reply
My experience mostly matches this: I think of a piece of development work having three phases:

1. Prototype 2. Initial production implementation 3. Hardening

My experience with LLMs is that they solve “writer’s block” problems in the prototyping phase at the expense of making phases 2+3 slower because the system is less in your head. They also have a mixed effect on ongoing maintenance: small tasks are easier but you lose some of the feel of the system.

reply
I completely agree with all of these observations.

And indeed for me, the biggest productivity boost has nothing to do with my "typing speed" or any such nonsense, it's that it can help with writer's block and other kinds of unhelpful inertia.

It kind of reminds me of ADHD medication: it alleviates the "inability to direct attention at one thing" problem, but actually exacerbates the "time blindness" and "hyperfocus" problems.

I think probably a lot of complex tools have these characteristics: useful in some ways, liable to backfire in others, and ultimately context-sensitive (and maybe somewhat unpredictable) in their helpfulness.

Hopefully as LLMs are more widely experimented with by developers, the conversation can continue to move away from thinking about the effects of LLM use in terms of some uniform/fungible "productivity" and towards understanding where it hurts and where it helps, how to tell when it's time to put it away, what kinds of codebases are really hurt by that kind of detached engagement, and what kinds of projects leverage that sort of rapid prototyping the most effectively.

Plausible text generation is an almost magical trick, whether it's generating human language or computer code. But it turns out it's not a silver bullet, no matter how impressive the trick is. It's more interesting than a silver bullet, in fact: it's a system of surprising tradeoffs, even for different phases of the same overall task.

reply
Usually you'll iterate several times on #1, which is where LLMs are really helpful. They let you get working code from stage #1 quite quickly, so you can check the output and behavior, and then oftentimes you'll find that you framed the problem incorrectly in the first place. Then you can fix your problem definition, have the LLM rewrite the code, try it again, and so on, until you get the results you want.

#1 -> #2 is a gap, but it also helps if you ask the LLM to explain its thinking and generate a human-readable design-doc of the approach it took and code organization it used. Then you read the design doc to gain the context, and pick up with #2.

reply
Yeah, a lot of "it doesn't matter how the code looks" convos seem to be ignoring that we know what happens over time when you just make tactical the-tests-still-pass changes over and over and over again. Slowly some of those tests get corrupted without noticing. And you never had the ENTIRE spec (and all the edge-case but user-relied-on-things) covered anyway. And then new dev gets way harder.
reply
This is definitely most annoying when dealing with software or standards with slightly illogical or hard to grasp cases. Recently, I worked on one of the software community's favourite spaces, timezones, and kept getting myself and my LLM context polluted with the confusion that arises when using POSIX standard timezone notation and common human-readable formats.

This blog probably covers my exact headache [0]. In summary, "Etc/GMT+6" actually means UTC-6. I was developing a one-off helper script to massively create calendars to a web app via API, and when trying to validate my CSV+Python script's results, I kept getting confused as to when do the CSV rows have correct data and when does the web app UI have correct data. LLM probably developed the Python script in a manner that translated this on-the-fly, but my human-readable "Calendar name" column which had "Etc/GMT+6" would generate a -6 in the web app. This probably would not have been a problem with explicit locations specified, but my use case would not allow for that.

When trying to debug if something is wrong, the thinking trace was going into loops trying to figure out if the "problem" is coming from my directions, the code's bugs, or the CSV having incorrect data.

Learning: when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".

(Do note that I am not an advanced software engineer, my practices are probably obvious to others. My repos are mainly comprised of sysadmin style shell/python helper code! :-) )

[0]https://blacksheepcode.com/posts/til_etc_timezone_is_backwar...

reply
> when facing problems like this, try using the well-known "notepad file" methods to track problems like this, so that if the over-eager LLM starts applying quick code fixes – although YOU were the "problem's" source – it will be easier to undo or clean up code that was added to the repository during a confusing debug session. For me, it has been difficult to separate "code generated due to more resilient improvements" vs. "code generated during debugging that sort of changed some specific step of the script".

Yeah, I have definitely hit this as well. Sometimes I've named a function or variable in a way that misuses a term or concept, or I've changed what something does without fully thinking it through. The LLM sees that code, notices an inconsistency, and makes a guess about what I meant. But because I screwed up, only I know what I really meant (or what I "should have meant"). So the LLM ends up writing a fix that breaks assumptions made in other parts of the code— assumptions that fit with my overall original mental picture, but not the misnomer the LLM got snagged on. Or it writes a small-scoped fix but the mistake of mine it stumbled upon actually merits rethinking and redesigning how some parts interact, so even if its fix is better than what I had before, I want to unwind that change so I can redefine my interfaces or whatever.

That's definitely worth calling out: it's not only the LLM's mistakes that make it more likely to commit future mistakes. Any mistakes in the codebase can compound like that. If you want an LLM to do useful work for you, it's more relevant than ever to "tidy first".

reply
deleted
reply
Where this result is actually interesting and relevant is when a coding agent splits a large source file into multiple smaller files. Opus + Claude Code will try to recite long sections of source code from memory into each of the new files, instead of using some sort of copy/paste operation like a human would.

Moving a file is a bit easier. LLMs may sometimes try to recite the file from memory. But if you tell them to use "git mv" and fix the compiler errors, they mostly will.

Ordinary editing on the other hand, generally works fine with any reasonable model and tool setup. Even Qwen3.6 27B is fine at this. And for in-place edits, you can review "git diff" for surprises.

reply
> And for in-place edits, you can review "git diff" for surprises.

I don't let AI touch git anyway, and I always review the diff after it generated stuff. If it modifies my documentation, I always want to check if it messed with the text instead of just added formatting.

reply
This. I know the LLM agents often have their own little diff viewers and edit approval workflows, but for a high volume of code, I cannot imagine actually reviewing everything without leaning on much more capable Git tooling.

I use Magit, and up until I started using LLM agents it was mostly a nice-to-have that I relied on casually. (I was definitely under-utilizing its power.) But for reviewing, selectively staging, and selectively rejecting the changes of an LLM agent? I feel like I'd die without it. Idk how others manage.

reply
If you’re using LLMs for agentic work it is absolutely essential that you have a robust set of tools for them to use and the correct instructions to prompt their use.

The LLM will come up with stupid ways to do things, common sense doesn’t exist for AI.

reply
Isn't this the whole reason they became viable in the last 6 months? The system prompt and harness is improving. It's less and less essential every day to roll your own.
reply
I don't think there is a single reason. Models are improving, so are the harnesses, prompts and we who use them a lot also get more proficient and learn where they can be used effectively vs not, so lots of improvements all over the ecosystem, brought together.

Latest big change is probably how feasible local models are becoming, like Qwen 3.6 and Gemma 4, they're no longer easily getting stuck in loops and repetition, although on lower quantizations they still pretty much suck for agentic usage.

reply
> we who use them a lot also get more proficient and learn where they can be used effectively vs not

I think it’s always been obvious where an LLM could be used effectively and where it cannot, if you understand how they work and don’t see them as magical.

The “increase in proficiency” is mostly people coming back to reality and being more intentional about LLM usage. There are no surprise discoveries here. One does not need to use an LLM a lot to get effective with them. A total noob could become effective on day 1 with proper guidance.

reply
I think you hit the nail on the head. I had been in this space for a little bit before it really became popular. I haven’t seen incredible gains in model competency. What I have seen though is people figuring out what works and what doesn’t.
reply
It’s pretty telling that ignoring LLMs entirely for a few years and then jumping in last minute after everyone has struggled through figuring out how to use them still puts you on the same level very quickly.
reply
> then jumping in last minute after everyone has struggled through figuring out how to use them still puts you on the same level very quickly

Does it actually though?

I've used agents for quite some time now, if someone who never used agents before want to put this to the test somehow, I'm open to try to measure this, reach out via email :)

reply
The models also have far more intelligence built in. For example, the pi.dev agent harness has a system prompt which fits on a single page, and includes only 4 or 5 tools. Running with a small coding model like Qwen3.6 27B, this setup is completely capable of agentic coding.
reply
They still aren't viable. Nothing changed within the last 6 months.
reply
My favorite is when Claude will build a completely new application to load and inspect a .dll file using reflection instead of just googling the library's interfaces.
reply
It did this for during one of the recent outrage periods. It was unjarring deps left and right instead of googling for it. What an easy way for me to own the tokenmaxxing leaderboard I remember thinking
reply
“Use all of the tools at your disposal, including searching the internet” is my claude-specific common instruction.
reply
There's a kid's game that illustrates this too: https://en.wikipedia.org/wiki/Telephone_game
reply
Maybe more relatable to the typical HN reader: You know when the top boss tells the lower bosses stuff, who then tells the lower bosses something and once it reaches you as an IC it's all different and corrupted compared to what it initially was? LLMs have the same effect, unsurprisingly.
reply
A coworker talks about LLMs as "bullshit" layers. Not exactly dismissing them or being derogatory about them, but emphasising that each time you feed something through an LLM, what comes out the other side may not be what you expect/want. Like that guy at the pub sharing what he'd seen online somewhere, after a few pints. Might be accurate, but carries notable risk it's not.

So e.g., don't use an LLM to call an API to gather data and produce a report on it, as that's feeding deterministic data through a "bullshit" layer, meaning you can't trust what comes out the other side. Instead use the LLM to help you write the code that will produce a deterministic output from deterministic data.

I've seen co-workers use LLMs to summarise deterministic data coming from APIs and have reports be wildly off the mark as often as they are accurate. Depending on what they're looking at that can have catastrophic risk.

reply
Similar experience. I wouldn't say it even needs to be like some random person in the local pub: this behaviour is what you'd get from any game of telephone, book authors will say how you need to be blunt and direct about points in the book because readers will miss subtlety, anyone who has been quoted in a newspaper will have a story about the paper getting it wrong, etc.

However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

It is ironic, the more AI becomes like us and less it acts like a traditional computer program, the worse it is at many things we want to use it for, but because collectively we're oblivious to our cognitive limitations we race into completely avoidable failures like this.

reply
> However, there's a reason pre-computing bureaucracy came with paper trails and meeting minutes getting written up, why court cases are increasingly cautious about the reliability of eye witnesses.

This was the comment I was coming in to make: I worked in a pre-computing bureaucracy (the U.S. Navy's) and "staff you delegated work to have consistent trouble following the directions you provide for the delegated work" is just a fact of life.

A lot of it is telephone game, a lot of it is is lack of real familiarity with office software, a lot of it is the inherent integration challenge from sending the same document out for coordination to dozens of stakeholders.

All those mistakes you made fixes for based on comments in the draft that went out for O-6 review? At least 2 will pop up again at 1-star review because staffers will copy the same text back out from their local copy they had stashed during O-6 review rather than re-reviewing from scratch.

Style guidance to meet the Admiral's preferred format? You can provide it but there's not a chance they'll follow it, formatting is for humanities majors so you'll need to catch and fix all that yourself.

That's not to say the LLMs are foolproof or magically always correct, but a lot of these style of criticisms apply just as much, if not more, to the current status quo. I don't need LLMs to be perfect, I just need them to be better than the current alternatives.

reply
Before Claude Code my strategy in JetBrains AI was to start a new chat convo per task it yielded better output.
reply
I like this framing. At least as "nondeterministic" vs "deterministic" layers for the folks who flinch at "bullshit." Also "broadly capable but lossy" versus "limited capability but reliable."

Building structures of dependencies, the interface between each pair seems to collapse to the lesser of the two. So there's a ton of work right now going into TLA+, structured io, etc to force even a bit of reliability back into the LLM/program boundaries. To have any hope of chaining multiple LLM dependencies in a stack without the whole thing toppling chaotically.

reply
> the more they will tend to gradually pull that into some homogenous abstract equilibrium

I experienced this with resume editing. The LLM removes everything that differentiates my resume from a pile of junior engineers with “average” experience. Anything that was special or unique or different was eventually replaced with generic stuff

Of course I didn’t use what it produced, but it was maddening because the LLM kept insisting this was better than what I had.

I found LLMs to be much more useful in suggesting edits to very small chunks of my resume (a sentence or three) rather than the overall vision of the document.

reply
deleted
reply
My half-baked solution is requiring colocation of the "why" for every decision and doc the llm writes, ideally my exact words. And similarly, every so often the llm why it's doing something reveals a mismatch between your intent and its PoV.
reply
Further, could we think of intent as some ordered state, and over time the LLM introduces entropy, eventually resulting in something akin to free-association?
reply
I was talking about this in a thread yesterday. It’s why I don’t like blogs that are just LLM generated. I don’t care how good you think it is, I don’t care that you consider a facsimile of you good enough. If I want a rote, boring LLM response, I will prompt it myself. I do not appreciate reading blogs and other assumed to be human-generated content and having somebody attempt to trick me into reading their prompt results like some annoying middleman.

I came to your blog to read what you had to say. Why are you writing a blog if you aren’t even going to write it?

reply
A human doing the same tasks as what the LLM did in the paper that the human will degrade the document further then the LLM. If the LLM is 25%, a human would degrade it probably 80% if they used the same technique as the LLM did in this paper. I'm talking about a single pass.

The fact of the matter is, humans don't edit things the way it was done in the paper and neither do coding agents like claude. Think about it: You do not ingest an entire paper and then regurgitate that paper with a single targeted edit... and neither do coding agents.

Also think carefully. A 25% degradation rate is unacceptable in the industry. The AI change that's taking over all of SWE development would not actually exist if there was 25% degradation... that's way too much.

reply
Are we comparing humans to LLMs or human written software to LLMs?

The whole point of creating software to do things used to be getting things done more accurately and consistently.

reply
No. The whole point of creating software is getting things done.

"More accurately and consistently" was merely downstream from what capabilities were natural for machine logic and hard algorithms.

Now, we're just spoiled for choice. We have hard algorithm software where we want to do things that benefit for accurate, consistent, highly deterministic behavior - and we have soft algorithm AI for when we want to do things that simply aren't amenable to hard logic.

Machine translation used to be a horrid mess when we were trying to do it with symbolic systems. Because symbolic systems are "consistent, highly deterministic" but not at all "accurate" on translation tasks. Being able to leverage LLMs for that is a generational leap.

reply
All of software is hard-coded algorithm.

If you differ between AI source code and engineer source code say so. "Getting things done" is a business need. Which things get translated to a deterministic language executable by a computer is code.

There are entire languages dedicated for lesser engineers/domain experts to formulate business requirements.

Anyhow; What's your point? That we received a framework for "soft algorithms" where the output does not need to be correct and deducible? What's even the point of putting it into software. Just forward your input to the reader and let him judge on its own.

reply
AI is more "grown" than it is "hard-coded". It's sideways to normal software - the way DSP is sideways to normal software but somehow even worse.

It all comes down to hard logic eventually, but that "eventually" has teeth. None of the interesting behaviors of AI systems live in "engine.py".

My point is: there are tasks where the choices are to use AI, use a meatbag, or suck forever. The "use AI" option going to be flawed, and often in the same ways "use meatbag" is. But it's going to be cheaper, much more scalable, and a lot better than "suck forever". Humanlike flaws are the price you pay for accessing humanlike capabilities.

reply
Except that coding agents will do this at times. That's half the problem. A human will forget details and exaggerate others, but LLMs fail in spectacular ways that humans rarely would, like trying to copy a document from memory rather than one word at a time, side by side, or rewriting the whole thing just to make some simple changes. Coding agents will delete tests or return True to get them to pass - something you would never expect of even a junior professional.

And I know this because I see it all the time. I use composer-2 and sonnet 4.6 on a regular basis. It's not much better for my colleagues who use Opus or GPT or any of the other frontier models. Most of the time it's fine, but other times it does things simply unforgivable for a human. I have to watch the agent closely so that it doesn't decide to nuke my database; I don't have to do that with any of my juniors, even those with little experience and poor discipline.

reply
> nuke

> I don’t have to do that with any of my juniors…

For some values of “nuke,” I absolutely have had to do that with juniors in the past. Perhaps you’re referring to a single rm -r or hilarious force push or something, but undertrained and unsupervised juniors regularly introduce things like SQL injection, XSS, etc. simply because they don’t know any better yet. This isn’t saying “AI is better across the board” - I just don’t think they’re comparable, also think AI shouldn’t be used to chop the bottom 5 rungs off our career ladder. But let’s not pretend juniors can be left alone with a codebase without any worries.

reply
LLM’s are the most elaborate guessing machine man-kind has made. That’s makes it both useless and useful depending on what it is used for.

That’s it. Once you look at everything through this lense everything makes sense - especially the fact there is no underlying understanding of reasoning and creativity. I don’t care what boosters say.

reply
I don't know what a "booster" is, but if a model can solve original math problems, then it's reasoning.

If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

reply
A model solving original math problems may look like human reasoning, but internally the model is choosing the next token based on what it has learned about probability around various patterns and structures. The model knows about correlations between problems, proof techniques and answer structures, and when it "reasons" it's selecting a high probability trajectory through that learned knowledge.

A calculator is different because it is not probabilistic; it executes a fixed procedure. One of these models, when doing math, is more like a learned probabilistic system that understands enough structure around mathematics that some of its high probability trajectories seem like genuine reasoning.

The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement. The model may produce the same output, and may even appear to approach it the same way, but the mechanism is a probabilistic pattern selection rather than explicit rule enforcement.

reply
You talk as if problem solving is a supervised (imitation) learning problem. No, it is a reinforcement learning problem, models learn by solving problems and getting rated. They generate their own training data. Optimal budget allocation is 1/3 cost pre-training, 1/3 for RL, and 1/3 on inference.
reply
> The difference is that when a human reasoner goes to solve a problem, they'll think "this kind of proof usually goes this way" - following an explicit rule enforcement.

How is this different from "probabilistic pattern selection"?

reply
Because... it's just different, that's all! OK?
reply
I don’t think there’s any evidence that “human reasoning” isn’t also based on probabilistic pattern selection.
reply
It’s amazing simple things have to be reiterated.

Perhaps it’s best if most admit they don’t have the fundamental ways of thinking to even participate in the conservation.

When all nuance is lost, the discussion must end.

reply
You should leave this site. Comments like this are not good for this site. You should go somewhere else.
reply
> If you can come up with a way to do math without reasoning, that would be, in a sense, even more interesting than AI.

Logic is just syntactic manipulation of formulas. By the early 90s logical reasoning was pretty much solved with classical AI (the last building block being constraint logic programming).

reply
So you'll be able to show me the early-90s era program that can solve original IMO-level problems when supplied with the plaintext questions. Right?
reply
if i presented math problems to the best english mathematicians in chinese, does that mean they arent able to reason? the plain text is an arbitrary constraint
reply
The actual question is, if you presented an undergraduate-level calculus problem to a human who is considered intelligent but who was never given an "understanding" of math in school, would the human be able to solve it? Why or why not?

If so, what exactly would you call the process by which the intelligent human solves the math problem that he or she does not initially understand?

Whatever you call that process is what a reasoning model does. You don't have to call it "reasoning," of course... unless you want other people to understand what you're talking about.

reply
My dear sir, the entire universe is made of things that "do math without reasoning!"

It's the default, and if we're lucky we harness pieces of it to discern something we're interested in.

reply
[flagged]
reply