undefined

points

[-]

One of the authors here.

The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.

For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling

We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.

There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.

by botusaurus4 hours ago|

parent|

[-]

have you tried stuffing a whole set of tutorials on how to use ghidra in the context, especially for the 1 mil token context like gemini?

by stared4 hours ago|

parent|

[-]

No. To give it a fair test, we didn't tinker with model-specific context-engineering. Adding skills, examples, etc is very likely to improve performance. So is any interactive feedback.

Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...

by anamexis4 hours ago|

parent|

[-]

Why, though? That would make sense if you were just trying to do a comparative analysis of different agent's ability to use specific tools without context, but if your thesis is:

> However, [the approach of using AI agents for malware detection] is not ready for production.

Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."

by ronald_petty2 hours ago|

parent|

[-]

Not the author. Just my thoughts on supplying context during tests like these. When I do tests, I am focused on "out of the box" experiences. I suspect the vast majority of actors (good and bad, junior and senior) will use out of the box more then they will try to affect the outcome based on context engineering. We do expect tweaking prompts to provide better outcomes, but that also requires work (for now). Maybe another way to think is reducing system complexity by starting at the bottom (no configuration) before moving to top (more configuration). We can't even replicate out of the box today much less any level of configuration (randomness is going to random).

Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.

by decidu0us90342 hours ago|

parent|

prev|

[-]

All the docs are already in its training data, wouldn't that just pollute the context? I think giving a model better/non-free tooling would help as mentioned. binja code mode can be useful but you definitely need to give these models a lot of babysitting and encouragement and their limitations shine with large binaries or functions. But sometimes if you have a lot to go through and just need some starting point to triage, false pos are fine.

by stared3 hours ago|

parent|

prev|

[-]

You can solve any problem with AI if you give enough hints.

The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.

That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.

by anamexis2 hours ago|

parent|

[-]

"With instructions that would be clear for a reverse engineering specialist" is a big caveat, though. It seems like an artificial restriction to add.

With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.

by embedding-shape3 hours ago|

parent|

prev|

[-]

> The question we asked is if they can solve a problem autonomously

What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?

by Avamander29 minutes ago|

prev|

[-]

I have seen LLMs be surprisingly effective at figuring out such oddities. After all it has ingested knowledge of a myriad of data formats, encryption schemes and obfuscation methods.

If anything, complex logic is what'll defeat an LLM. But a good model will also highlight such logic being intractable.

by akiselev6 hours ago|

prev|

[-]

When I was developing my ghidra-cli tool for LLMs to use, I was using crackmes as tests and it had no problem getting through obfuscation as long as it was prompted about it. In practice when reverse engineering real software it can sometimes spin in circles for a while until it finally notices that it's dealing with obfuscated code, but as long as you update your CLAUDE.md/whatever with its findings, it generally moves smoothly from then on.

by eli6 hours ago|

parent|

[-]

Is it also possible that crackme solutions were already in the training data?

by akiselev6 hours ago|

parent|

[-]

I used the latest submissions from sites like crackmes.ones which were days or weeks old to guard against that.

by achille5 hours ago|

prev|

[-]

in the article they explicitly said they stripped symbols. If you look at the actual backdoors many are already minimal and quite obfuscated,

see:

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...

- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...

by comex3 hours ago|

parent|

[-]

The first one was probably found due to the reference to the string /bin/sh, which is a pretty obvious tell in this context.

The second one is more impressive. I'd like to see the reasoning trace.

by comex1 hours ago|

parent|

[-]

Reply to self: I managed to get their code running, since they seemingly haven’t published their trajectories. At least in my run (using Opus 4.6), it turns out that Claude is able to find the backdoored function because it’s literally the first function Claude checks.

Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.

I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…

They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.

On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.

But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.

by hereme8885 hours ago|

prev|

[-]

I've used Opus 4.5 and 4.6 to RE obfuscated malicious code with my own Ghidra plugin for Claude Code and it fully reverse engineered it. Granted, I'm talking about software cracks, not state-level backdoors.

by halflife6 hours ago|

prev|

[-]

Isn’t LLM supposed to be better at analyzing obfuscated than heuristics? Because of its ability to pattern match it can deduce what obfuscated code does?

by bethekidyouwant4 hours ago|

parent|

[-]

How much binary code is in the training set? (None?)

by Retr0id5 hours ago|

prev|

[-]

Stripping symbols is fairly normal, but hiding imports ought to be suspicious in its own right.