Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)
As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.
Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.
Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).
Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.
I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.
The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.
At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.
I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.
Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).
Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)
Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.
I guess the current wave is going to give us Sofware Development Epicycles (SDEC?)
* All analogies are "wrong", some analogies are useful
Obviously the author has to do much work in selecting the correct bits from this baggage to get a structure that makes useful predictions, that is to say predictions that reproduces observable facts. But ultimately the theory comes from the author, not from the facts, it would be hard to imagine how one can come up with a theory that doesn't fit all the facts known to an author if the theory truly "emanated" from the facts in any sense strict enough to matter.
I disagree. Having tests (even if the LLM wrote them itself!) gives the model some grounding, and exposes some of its inconsistencies. LLMs are not logically-omniscient; they can "change their minds" (next-token probabilities) when confronted with evidence (e.g. test failure messages). Chain-of-thought allows more computation to happen; but it doesn't give the model any extra evidence (i.e. Shannon information; outcomes that are surprising, given its prior probabilities).
Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)
The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.
A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.
Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.
Details matter.
People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.
A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.
Spec is an overloaded term in software :) because there are design specs (the plan, alternatives considered etc) and engineering style specs (imagine creating a document with enough detail that someone overseas could write your documentation from it while you’re building it)
Those need distinct names or we are all at risk of talking past each other :)
I’ve been experimenting with a small sparse-regression system that infers governing equations from raw data, and it can produce a lot of plausible candidates quickly. The hard part is filtering out the ones that look right but violate underlying constraints.
For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.
Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.
They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev
Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).
I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"
if you can get a model to quickly translate a relevant subset of your code to lean to find tricky bugs and map lean fixes back to your codebase space, you've got yourself a huge unlock. (spoiler alert: you basically can, today)
Before you commented, I started poking at what you described for 15 minutes, then forget about it and fell asleep. Now I remembered, and I know it's viable and IIUC it's almost certainly going to make a big difference in my work practice moving forward. Cheers.
(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)
What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).
But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.
Require Import String.
Definition hello: string := "Hello world!".
Print hello.
hello = String (Ascii.Ascii false false false true false false true false) (String (Ascii.Ascii true false true false false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false false false false false true false false) (String (Ascii.Ascii true true true false true true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false true false false true true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true false false true true false) (String (Ascii.Ascii true false false false false true false false) EmptyString))))))))))) : string
I used to think that the only way we would be able to trust AI output would be by leaning heavily into proof-carrying code, but I've come to appreciate the other approaches as well.
If someone posted a breakthrough in cryptographic verification and the top comment was "yeah, unit tests are great," we'd all recognize that as missing the point. I don't think it's unrelated, I think it's almost related, which is worse, because it pattern-matches onto agreement while losing the actual insight.