The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.
For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling
We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.
There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.
Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...
> However, [the approach of using AI agents for malware detection] is not ready for production.
Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."
Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.
The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.
That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.
With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.
What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?
If anything, complex logic is what'll defeat an LLM. But a good model will also highlight such logic being intractable.
see:
- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...
- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...
The second one is more impressive. I'd like to see the reasoning trace.
Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.
I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…
They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.
On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.
But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.