Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.
The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.
That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.
With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.
What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?