upvote
> t took humans years to write the tests by hand, and the agents still failed to converge.

I think there is some hazard in assuming that what agents fail at today they will continue to fail on in the future.

What I mean is, if we take the optimistic view of agents continuing to improve on the trajectory they have started at for one or two years, then it is worth while considering what tools and infrastructure we will need for them. Companies that start to build that now for the future they assume is coming are going to be better positioned than people who wake up to a new reality in two years.

reply
> I think there is some hazard in assuming that what agents fail at today they will continue to fail on in the future.

I think there is some hazard in assuming a seemingly exponential curve has no asymptotes, otherwise known as faith.

reply
That is what the market is for!

I'm just pointing out "we don't need this right now" isn't necessarily an argument against "we don't need this".

There is a saying that isn't perfect but may apply: better to have it and not need it then to need it and not have it.

Here is another way of looking at it. Let's say agents don't meet the hyped up expectations and we build all of this robust tooling for nothing. So we have all of this work towards creating autonomous testing systems but we don't have the autonomous agents. That still seems like a decent outcome.

When we plan around optimistic views of the future, we tend to build generally useful things.

reply
The market stopped being remotely useful measurement of…anything quite a while ago.
reply
> There's no doubt, I think, testing will remain important and possibly become more important with more AI use, and so better testing is helpful, PBT included.

Given Curry-Howard isomorphism, couldn't we ask AI to directly prove the property of the binary executable under the assumption of the HW model, instead of running PBTs?

By no means I want to dismiss PBTs - but it seems that this could be both faster and more reliable.

reply
Proofs are a form of static analysis. Static analysis can find interesting bugs, but how a system behaves isn't purely a property of source code. It won't tell you whether the code will run acceptably in a given environment.

For example, if memory use isn't modelled, it won't tell you how big the input can be before the system runs out of memory. Similarly, if your database isn't modelled then you need to test with a real database. Web apps need to test with a real web browser sometimes, rather than a simplified model of one. Databases and web browsers are too complicated to build a full-fidelity mathematical model for.

When testing with real systems there's often the issue that the user's system is different from the one you use to test. You can test with recent versions of Chrome and Firefox, etc, which helps a lot, but what about extensions?

Nothing covers everything, but property tests and fuzzers actually run the code in some test environment. That's going to find different issues than proofs will.

reply
> Databases and web browsers are too complicated to build a full-fidelity mathematical model for.

I disagree - thanks to Curry-Howard isomorphism, the full-fidelity mathematical model of a database or web browser are their binaries themselves.

We could have compilers provide theorems (with proof) of correctness of the translation from source to machine code, and library functions could provide useful theorems about the resource use.

Then, if the AI can reason about the behavior of the source code, it can also build the required proof of correctness along with it.

reply
> thanks to Curry-Howard isomorphism, the full-fidelity mathematical model of a database or web browser are their binaries themselves.

Maybe I'm misunderstanding you, but Curry-Howard is a mapping between mathematical jargon and programming jargon, where e.g. "this is a proof of that proposition using foo logic" maps to "this program has that type in programming language foo".

I don't see how that makes "binaries" a "full-fidelity mathematical model": compilation is (according to Curry-Howard) translating a proof from one system of logic to another. For a binary, the resulting system of logic is machine code, which is an absolutely terrible logic: it has essentially one type (the machine word), which makes every proposition trivial; according to Curry-Howard, your database binary is proof of the proposition corresponding to its type; since the type of every binary is just "some machine words", the proposition that your database binary is a "full-fledged mathematical model" of is essentially just "there exists a machine word". Not very useful; we could optimise it down to "0", which is also a proof that there exists a machine word.

If we assume that you want to prove something non-trivial, then the first thing you would need to do is abstract away from the trivial logic of machine code semantics, by inferring some specific structures and patterns from that binary, then developing some useful semantics which captures those patterns and structures. Then you can start to develop non-trivial logic on those semantics, which will let you state worthwhile propositions. If we apply the Curry-Howard lens to that process, it corresponds to... decompilation into a higher-level language!

tl;dr Curry-Howard tells us that binaries are literally the worst possible representation we could hope for.

reply
And how do you know if it has proven the property you want, instead of something that's just complicated looking but evaluates to true?
reply
The AI would build a proof of correctness, which would be then verified in a proof checker (not AI).
reply
> Given Curry-Howard isomorphism, couldn't we ask AI to directly prove the property of the binary executable under the assumption of the HW model, instead of running PBTs?

Yes, in principle. Given unlimited time and a plentiful supply of unicorns.

Otherwise, no. It is well beyond the state of the art in formal proofs for the general case, and it doesn't become possible just because we "ask AI".

And unless you provide a formal specification of the entire set of behavior, it's still not much better than PBT -- the program is still free to do whatever the heck it wants that doesn't violate the properties formally specified.

reply
> But the problem remains verifying that the tests actually test what they're supposed to.

Definitely. It's a lot harder to fake this with PBT than with example-based testing, but you can still write bad property-based tests and agents are pretty good at doing so.

I have generally found that agents with property-based tests are much better at not lying to themselves about it than agents with just example-based testing, but I still spend a lot of time yelling at Claude.

> So "a huge part" - possibly, but there are other huge parts still missing.

No argument here. We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

reply
A fun recent experience I had with Claude was I asked it to write a model for PBTs against a complex SUT, and it duplicated the SUT algorithm in the model — not helpful! I had to explicitly prompt it to write the model algorithm in a completely different style.
reply
Ugh, yeah. Duplicating the code under test is a bad habit that Claude has had when writing property-based tests from very early on and has never completely gone away.

Hmm now that you mention it we should add some instructions not to do that in the hegel-skill, though oddly I've not seen it doing it so far.

reply
> We're not claiming to solve agentic coding. We're just testing people doing testing things, and we think that good testing tools are extra important in an agentic world.

Yeah, I know. Just an opportunity to talk about some of the delusions we're hearing from the "CEO class". Keep up the good work!

reply
> I have generally found that agents with property-based tests are much better at not lying to themselves

I also observed the cheating to increase. I recently tried to do a specific optimization on a big complex function. Wrote a PBT that checks that the original function returns the same values as the optimized function on all inputs. I also tracked the runtime to confirm that performance improved. Then I let Claude loose. The PBT was great at spotting edge cases but eventually Claude always started cheating: it modified the test, it modified the original function, it implemented other (easier) optimizations, ...

reply
Ouch. Classic Claude. It does tend to cheat when it gets stuck, and I've had some success with stricter harnesses, reflection prompts and getting it to redo work when it notices it's cheated, but it's definitely not a solved problem.

My guess is that you wouldn't have had a better time without PBT here and it would still have either cheated or claimed victory incorrectly, but definitely agreed that PBT can't fully fix the problem, especially if it's PBT that the agent is allowed to modify. I've still anecdotally found that the results are better than without it because even if agents will often cheat when problems are pointed out, they'll definitely cheat if problems aren't pointed out.

reply
I actually think there's another angle here where PBT helps, which wasn't explored in the blog post.

That angle is legibility. How do you know your AI-written slop software is doing the right thing? One would normally read all the code. Bad news: that's not much less labor intensive as not using AI at all.

But, if one has comprehensive property-based tests, they can instead read only the property-based tests to convince themselves the software is doing the right thing.

By analogy: one doesn't need to see the machine-checked proof to know the claim is correct. One only needs to check the theorem statement is saying the right thing.

reply
Right, I said that property based tests are easier to read, and that's good. But people still have to actually read them. Also, because they still work best at the "unit" level, to understand them, the people reading them need to know how all the units are connected (e.g. a single person cannot review even PBTs required for 10KLOC per day [1]).

My point isn't so much about PBT, but about how we don't yet know just how much agents help write real software (and how to get the most help from them).

[1]: I'm only using that number because Garry Tan, CEO of YC, claimed to generate 10K lines of text per day that he believes to be working code and developers working with AI agents know they can't be.

reply