Specifications are smaller than the full code, just as high level code is smaller than the functionally equivalent assembly. As we ascend the abstraction ladder the amount of reading a human needs to do decreases. I don't think this should really count as "back to square 1".
binary => hexadecimal instructions
hexadecimal => assembly language
assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]
Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.In particular, I'd like to highlight UML, which was supposed to mostly obsolete programming through auto-generated code from object-oriented class diagrams.[1] The promise was that "business domain experts" could model their domain via visual UML tooling, and the codegen would handle it from there. In practice, UML-built applications became maintenance nightmares.
In every one of these examples, the artifact that people made "instead of programming" became the de-facto programming language, needing to be maintained over time, abstracted, updated, consumed behind APIs, etc. -- and programmers had to be called in to manage the mess.
It's interesting that Spec4 can be auto-generated, then used to generate code. My question is - what do you do when you have (a) consumers depending on a stable API, and (b) requests for new features? Maybe hand the job to Claude Code or a human developer with a suite of unit tests to guarantee API compatibility, but at that point we're back to an agent (LLM or human) doing the work of programming, with the Spec4 code as the programming language being updated and maintained.
[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...
A formal spec in Lean is typically 10-50x shorter than the code it proves correct. More importantly, Lean's type checker is itself a small, trusted kernel (~10k lines) that has been scrutinized by the PL community for years. So you're not trusting the agent — you're trusting the kernel.
The practical workflow isn't "agent writes spec + code." It's: human writes spec (the hard creative part), agent generates proof that code satisfies spec, Lean kernel mechanically checks the proof. The agent can hallucinate all it wants in step 2 — if the proof doesn't typecheck, it gets rejected deterministically.
The real bottleneck is step 1: writing good specs requires domain expertise. But that's exactly where humans should stay in the loop. It's a much better division of labor than reviewing thousands of lines of generated code.