I'm not trying to discredit your experience and maybe it really is something wrong with the model.
But in my experience those first few prompts / features always feel insanely magical, like you're working with a 10x genius engineer.
Then you start trying to build on the project, refactor things, deploy, productize, etc. and the effectiveness drops off a cliff.
But I'm optimistic that this will gradually improve in time.
Today it’s my turn to be that person. Large scientific code base with a bunch of nontrivial, handwritten modules accomplishing distinct, but structurally similar in terms of the underlying computation, tasks. Pointed GPT Pro at it, told it what new functionality I wanted, and it churns away for 40 minutes and completely knocks it out of the park. Estimated time savings of about 3-4 weeks. I’ve done this half a dozen times over the past two months and haven’t noticed any drop off or degradation. If anything it got even better with 5.4.
The codebase itself is architected and documented to be LLM friendly and claude.md gives very strong harnesses how to do things.
As architect Claude is abysmal, but when you give it an existing software pattern it merely needs to extend, it’s so good it still gives me probably something like 5x feature velocity boost.
Plus when doing large refactorings, it forgets much fever things than me.
Inventing new architecture is as hard as ever and it’s not great help there - unless you can point it to some well documented pattern and tell it ”do it like that please”.
Even after deleting everything from the first feature and going back to the checkpoint just before initial development, I can no longer get it to accomplish anything meaningful without my direct guidance.
Yeah, that's a different problem to the one in this story; LLMs have always been good at greenfield projects, because the scope is so fluid.
Brownfield? Not so much.