I mean, if you properly define "do well" as getting a first draft of something interesting that might or might not be a step towards a solution, that's not completely wrong. A pass/fail test is verified feedback of a sort, that the AI can then do quick iteration on. It's just very wrong to expect that you can get away with only checking for passing tests and not even loosely survey what the AI generated (which is invariably what people do when they submit a bunch of vibe-coded pull requests that are 10k lines each or more, and call that a "gain" in productivity).