upvote
Do you have an example of the tautological tests you're referring to? What comes to mind to me is genuinely logically tautological tests, like "assert(true || expectedResult == actualResult)" which is a mistake I don't even expect modern AI coding tools to make. But I suspect you're talking about a subtler type of test which at first glance appears useful but actually isn't.
reply
I've definitely seen Opus go to town when asked to test a fairly simple builder. Possibly it inferred something about testing the "contract", and went on to test such properties as

  - none of the "final" fields have changed after calling each method
  - these two immutable objects we just confirmed differ on a property are not the same object
In addition to multiple tests with essentially identical code, multiple test classes with largely duplicated tests etc.
reply
Among many other possible examples, here are a few [0] from Ruby that I've seen in the wild before LLMs, and still see today spat out by LLMs.

0: https://www.codewithjason.com/examples-pointless-rspec-tests...

reply
I do see agents pop out tests that look like this occasionally:

  it { expect(classroom).to have_many(:students) }
If I catch them I tell them not to and they remove it again, but a few do end up slipping through.

I'm not sure that they're particularly harmful any more though. It used to be that they added extra weight to your test suite, meaning when you make changes you have to update pointless tests.

But if the agent is updating the pointless tests for you I can afford a little bit of unnecessary testing bloat.

reply
I don’t love tests like that either, but I’ve seen a lot of them (long before the generative AI era) and heard reasonable people make arguments in favor of them.

Admittedly, in the absence of halfway competent static type checking, it does seem like a good way to prevent what would be a very bad regression. It doesn’t seem worse than tests which check that a certain property is non-null (when that’s a vital business requirement and you’re using a language without a competent type system).

reply
I don’t have examples but I have an LLM driven project with like…2500 tests and I regularly need to prune:

* no-op tests

* unit tests labeled as integration tests

* skipped tests set to skip because they were failing and the agent didn’t want to fix them

* tests that can never fail

Probably at any given time the tests are 2-4% broken. I’d say about 10% of one-shot tests are bogus if you’re just working w spec + chat and don’t have extra testing harnesses.

reply
For example, you might write a concurrency test, and the agent will cheerfully remove the concurrency and announce that it passes. They get so hung up on making things work in a narrow sense that they lose track of the purpose.
reply
Yes. And, a bad test -- that passes because it's defined to pass -- is _much worse_ than no test at all. It makes you think an edge case is "covered" with a meaningful check.

Worse: once you have one "bad apple" in your pile of tests, it decreases trust in the _whole batch of tests_. Each time a test passes, you have to think if it's a bad test...

reply
That's where mutation testing becomes even more valuable. If the test still passes after the code has been mutated, then you may want to look deeper, because it's a sign that the test is not good.
reply
This seems it should be very easy to validate. Force the AI to make minimal changes to the code under test, which makes a single (or as few as possible) test fail as a result. If it can't make a test fail at all, it should be useless.
reply
Agreed, and that's why I think adding some example prompts and ideas to the Testing section would be helpful. A vanilla-prompted LLM, in my experience, is very unreliable at adding tests that fail when the changes are reverted.

Many times I've observed that the tests added by the model simply pass as part of the changes, but still pass even when those changes are no longer applied.

reply
I had an example in that section but it got picked apart by pedants (who had good points) so I removed it. I plan to add another soon. You can still see it in the changelog: https://simonwillison.net/guides/agentic-engineering-pattern...
reply
This is essentially dual to the idea behind mutation testing, and should be trivial to do with a mutation testing framework in place (track whether a given test catches mutants, or more sophisticated: whether it catches the exact same mutants as some other test).
reply
That's part of the reason I like red/green TDD - you make the agent show that the test fails before the implementation and passes afterwards.

It can still cheat, but it's less likely to cheat.

reply
> we as humans can guide the LLM toward a rigorous test suite, rather than one that has a lot of "coverage" but doesn't actually provide sound guarantees about behavior.

I have a hard enough time getting humans to write tests like this…

reply