So, I think the claims of improvement in productivity and regression in productivity can be true at the same time (and it's not just that people who don't find using LLMs productive are just prompting them wrong).
I think most can be gained by learning in which areas LLMs can give large productivity boosts and where it's better to avoid using them. Of course, this is a continuous process, given that LLMs are still getting better.
Personally, I am quite happy with LLMs. They cannot replace me, but they can do a chunk of the boring/repetitive work (e.g. boilerplate), so as a result I can focus on the interesting problems. As long as we don't have human-like performance (and I don't feel like we are close yet), LLMs make programming more interesting.
They are also a great learning aid. E.g., this morning I wanted to make a 3D model for something I needed, but I don't know OpenSCAD. I iteratively made the design with Claude. At some point the problem becomes too difficult for Claude, but with the code generated at that point, I have learned enough about OpenSCAD that I can fix the more difficult parts of the project. The project would have taken me a few hours (to learn the language, etc.), but now I was done in 30 minutes and learned some OpenSCAD in a pleasant way.
There is also the frontend and tnpse code bases don't need to be very old at all before AI falls down. NPM packages and clashing styles in a codebase and AI has been not very helpful to me at all.
Generally speaking, which AI is a fine enhancement to autocomplete, I haven't seen it be able to do anything more serious in a mature codebase. The moment business rules and tech debt sneak in in any capacity, AI becomes so unreliable that it's faster to just write it yourself. If I can't trust the AI to automatically generate a list of exports in an index.ts file. What can I trust it for?
Things have changed a lot in the past six weeks.
Gemini 2.5 Pro accepts a million tokens and can "reason" with them, which means you can feed it hundreds of thousands of lines of code and it has a surprisingly good chance of figuring things out.
OpenAI released their first million token models with the GPT 4.1 series.
OpenAI o3 and o4-mini are both very strong reasoning code models with 200,000 token input limits.
These models are all new within the last six weeks. They're very, very good at working with large amounts of crufty undocumented code.
Maybe in a generation or two codebases will become more uniform and predictible if fewer humans do it by hand. Same with self driving cars, if there were no human drivers out there the problem would become trivial to conquer.
They still make mistakes, and yeah they're still (mostly) next token predicting machines under the hood, but if your mental model is "they can't actually predict through how some code will execute" you may need to update that.