1. The models are purposefully nerfed, before the release of the next model, similar to how Apple allegedly nerfed their older phones when the next model was out.
2. You are relying more and more on the models and are using your talent less and less. What you are observing is the ratio of your vs. the model’s work leaning more and more to the model’s. When a new model is released, it produces better quality code then before, so the work improves with it, but your talent keeps deteriorating at a constant rate.
As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
One other failure mode that I've seen in my own work while I've been learning: the things that you put into AGENTS.md/CLAUDE.md/local "memories" can improve performance or degrade performance, depending on the instructions. And unless you're actively quantitatively reviewing and considering when performance is improving or degrading, you probably won't pick up that two sentences that you added to CLAUDE.md two weeks ago are why things seem to have suddenly gotten worse.
> similar to how you can expect more over time from a junior you are delegating to and training
That's the really interesting bit. Both Claude and Codex have learned some of my preferences by me explicitly saying things like "Do not use emojis to indicate task completion in our plan files, stick to ASCII text only". But when you accidentally "teach" them something that has a negative impact on performance, they're not very likely to push back, unlike a junior engineer who will either ignore your dumb instruction or hopefully bring it up.
> As you note, the developer's input is still driving the model quite a bit so if the developer is contributing less and less as they trust more, the results would get worse.
That is definitely a thing too. There have been a few times that I have "let my guard down" so to speak and haven't deeply considered the implications of every commit. Usually this hasn't been a big deal, but there have been a few really ugly architectural decisions that have made it through the gate and had to get cleaned up later. It's largely complacency, like you point out, as well as burnout trying to keep up with reviewing and really contemplating/grokking the large volume of code output that's possible with these tools.
Personally I can see the case for both interpretation to be true at the same time, and maybe that is precisely why I confused them so eagerly in my initial post.
I’d prefer providers to simply deprecate stuff faster, but then that would break other people’s existing workflows.
Newer (past two years?) models have improved "in detail" - or as pragmatic tools - but they still don't deserve the anthropomorphism we subject them to because they appear to communicate like us (and therefore appear to think and reason, like us).
But the "holes" are painted over in contemporary models - via training, system prompts and various clever (useful!) techniques.
But I think this leads us to have great difficulty spotting the weak spots in a new, or slightly different model - but as we get to know each particular tool - each model - we get better at spotting the holes on that model.
Maybe it's poorly chosen variable names. A tendency to write plausible looking, plausibly named, e2e tests that turns out to not quite test what they appear to test at first glance. Maybe there's missing locking of resources, use of transactions, in sequencial code that appear sound - but end up storing invalid data when one or several steps fail...
In happy cases current LLMs function like well-intentioned junior coders enthusiasticly delivering features and fixing bugs.
But in the other cases, they are like patholically lying sociopaths telling you anything you want to hear, just so you keep paying them money.
When you catch them lying, it feels a bit like a betrayal. But the parrot is just tapping the bell, so you'll keep feeding it peanuts.