The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.
At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.
I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.
Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).
Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)
Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.
I guess the current wave is going to give us Sofware Development Epicycles (SDEC?)
* All analogies are "wrong", some analogies are useful
Obviously the author has to do much work in selecting the correct bits from this baggage to get a structure that makes useful predictions, that is to say predictions that reproduces observable facts. But ultimately the theory comes from the author, not from the facts, it would be hard to imagine how one can come up with a theory that doesn't fit all the facts known to an author if the theory truly "emanated" from the facts in any sense strict enough to matter.
I disagree. Having tests (even if the LLM wrote them itself!) gives the model some grounding, and exposes some of its inconsistencies. LLMs are not logically-omniscient; they can "change their minds" (next-token probabilities) when confronted with evidence (e.g. test failure messages). Chain-of-thought allows more computation to happen; but it doesn't give the model any extra evidence (i.e. Shannon information; outcomes that are surprising, given its prior probabilities).
Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)
The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.
A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.
Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.
Details matter.
People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.
A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.
Spec is an overloaded term in software :) because there are design specs (the plan, alternatives considered etc) and engineering style specs (imagine creating a document with enough detail that someone overseas could write your documentation from it while you’re building it)
Those need distinct names or we are all at risk of talking past each other :)