Really this is why the LLM needs to be able to write exploits for issues it finds. Of course that leads down a rabbit hole of other issues. But if an exploit works, then that's pretty conclusive evidence.
Frontier models, including Mythos, can greatly streamline bug hunting and exploit developments in the hands of a competent security engineer. In the hands of a person with no security experience, they will still mostly waste your time and money.
I've seen it make the codebase vulnerable by changing the source, then claiming it found a vuln, or finding a well-defended and secure exec function, write a unit test that shows what exec does (which is running commands), then claiming a critical finding.
Interesting that gpt-5.5, while not as good as mythos, also seems like a decent step up