There's a fair amount of talk right now about the value being in the verification layer -- once there's a hard verification loop, the agents can do amazing things without getting (permanently) sidetracked. I think what you're working on is half way there -- in essence, you're probably relying on the LLMs notion of what a spec is and should be to the codebase.
What's not currently solved, and what I think is very interesting is how much automation can be added to the creation of verification. We all would unlock a lot more speed and productivity for even moderate gains on that side.
Disagree on the bit about it "never going to work" though.
Failure-prone stochastic ML systems produce testable, auditable code... just like failure-prone human brains can produce testable, auditable code. And in fact, in both cases, changes to our process can reduce the amount of failures that slip past testing and audit. Or can reap other rewards. Finding the a better process is what I'm interested in right now.
You're missing the bigger picture here. Yeah, they produce code. But "producing" code was never the bottleneck. Yes you can pop out a webapp within a couple of hours, but now you have no clue how it works, even if its a language and framework you are competent it in, because you skipped the part where you understand how the parts fit in together architecturally. So you wrote an elaborate spec, but the LLM "decides" to do something else. Maybe they don't make that PK autoincrement or they throw you in those nice empty "catch" blocks they ingested from various beginner tutorials, which will be very "helpful" when you application silently deviates from the happy path execution that you spec'ed the hell out of in your virulent spec-driven-workflow.. So it "kinda" works, it generates the code. It works the way your kid's toy car works - it "drives" but it cannot be driven to work, can it? So it does not work in the big picture. It's not a reliable enterprise ready system. It's a toy, and should be treated like one.