Rerunning things: nothing built for that yet, but I do have some design ideas. Repros are notoriously shaky in testing like this (unless run against a deterministic app, or inside Antithesis), but I think Bombadil should offer best-effort repros if it can at least detect and warn when things diverge.
Shrinking: also nothing there yet. I'm experimenting with a state machine inference model as an aid to shrinking. It connects to the prior point about shaky repros, but I'm cautiously optimistic. Because the speed of browser testing isn't great, shrinking is also hard to do within reasonable time bounds.
Thanks for the questions and feedback!
Should be pretty easy to make it deterministic if you follow that precondition.
(How I had my review apps wired up was I dumped the staging DB nightly and containerized it, I believe Neon etc make it easy to do this kind of thing.)
Ages ago I wired up something much more basic than this for a Python API using hypothesis, and made the state machine explicit as part of the action generator (with the transitions library), what do you think about modeling state machines in your tests? (I suppose one risk is you don’t want to copy the state machine implementation from inside the app, but a nice fluent builder for simple state machines in tests could be a win.)
Regarding state machines: yeah, it can often become an as-complex mirror of the system your testing, if the system has a large complicated surface. If on the other hand the API is simple and encapsulates a lot of complexity (like Ousterhout's "Deep Modules") state machine specs and model-based testing make more sense. Testing a key-value store is a great example of this.
If you're curious about it, here's a very detailed spec for TodoMVC in Bombadil: https://github.com/owickstrom/bombadil-playground/blob/maste... It's still work-in-progress but pretty close to the original Quickstrom-flavored spec.
Microsoft had a remotely similar tool named Pex [1] but instead of randomly generating inputs, it instrumented the code to enable executing the code also symbolically and then used their Z3 theorem proofer to systematically find inputs to make all encountered conditions either true or false and with that incrementally explore all possible execution paths. If I remember correctly, it then generated a unit test for each discovered input with the corresponding output and you could then judge if the output is what you expected.
[1] https://www.microsoft.com/en-us/research/publication/pex-whi...
Ui tests like:
* if there is one or more items on the page one has focus
* if there is more than one then hitting tab changes focus
* if there is at least one, focusing on element x, hitting tab n times and then shift tab n times puts me back on the original element
* if there are n elements, n>0, hitting tab n times visits n unique elements
Are pretty clear and yet cover a remarkable range of issues. I had these for a ui library, which came with the start of “given a ui build with arbitrary calls to the api, those things remain true”
Now it’s rare it’d catch very specific edge cases, but it was hard to write something wrong accidentally and still pass the tests. They actually found a bug in the specification which was inconsistent.
I think they often can be easier to write than specific tests and clearer to read because they say what you actually are testing (a generic property, but you had to write a few explicit examples).
What you could add though is code coverage. If you don’t go through your extremely specific branch that’s a sign there may be a bug hiding there.
I work at Antithesis now so you can take that with a grain of salt, but for me, everything changed for me over a decade ago when I started applying PBT techniques broadly and widely. I have found so many bugs that I wouldn't have otherwise found until production.
https://github.com/papers-we-love/san-francisco/blob/master/...