This post did a little bit of that but I wish it had gone into more detail.
If you're running the security audit yourself you should be in a better position to understand and then confirm the issues that the coding agents highlight. Don't treat something as a security issue until you can confirm that it is indeed a vulnerability. Coding agents can help you put that together but shouldn't be treated as infallible oracles.
It's different from Hacker One because those reports tend to come in with all sorts of flowery language added (or prompt-added) by people who don't know what they are doing.
If you're running the prompts yourself against your own coding agents you gain much more control over the process. You can knock each report down to just a couple of sentences which is much faster to review.
Surely people have found patterns that work reasonably well, and it's not "everyone is completely on their own"? I get that the scene is changing fast, but that's ridiculous.
You can do that in conjunction with trying things other people report, but you'll learn more quickly from your own experiments. It's not like prompting a coding agent is expensive or time consuming, for the most part.
But your codebase is unique. Slop in one codebase is very dangerous in another.
I ran a small python script that I made some years ago through an LLM recently and it pointed out several areas where the code would likely throw an error if certain inputs were received. Not security, but flaws nonetheless.
* Specification extraction. We have security.md and policy.md, often per module. Threat model, mechanisms, etc. This is collaborative and gets checked in for ourselves and the AI. Policy is often tricky & malleable product/business/ux decision stuff, while security is technical layers more independent of that or broader threat model.
* Bug mining. It is driven by the above. It is iterative, where we keep running it to surface findings, adverserially analyze them, and prioritize them. We keep repeating until diminishing returns wrt priority levels. Likely leads to policy & security spec refinements. We use this pattern not just for security , but general bugs and other iterative quality & performance improvement flows - it's just a simple skill file with tweaks like parallel subagents to make it fast and reliable.
This lets the AI drive itself more easily and in ways you explicitly care about vs noise
“do a thing that would take me a week” can not actually be done in seconds. It will provide results that resemble reality superficially.
If you were to pass some module in and ask for finite checks on that, maybe.
Despite the claims of agents… treat it more like an intern and you won’t be disappointed.
Would you ask an intern to “do a security audit ” of an entire massive program?
Basically, don't trust AI if it says "you program is secure", but if it returns results how you could break it, why not take a look?
This is the way I would encourage AI to be used, I prefer such approaches (e.g. general code reviews) than writing software by it.
The results of such AI fuzz testing should be treated as just a science experiment and not a replacement for the entire job of a security researcher.
Like conventional fuzz testing, you get the best results if you have a harness to guide it towards interesting behaviors, a good scientific filtering process to confirm something is really going wrong, a way to reduce it to a minimal test case suitable for inclusion in a test suite, and plenty of human followup to narrow in on what's going on and figure out what correctness even means in the particular domain the software is made for.
That's exactly what they're not. Models post-trained with current methods/datasets have pretty poor diversity of outputs, and they're not that useful for fuzz testing unless you introduce input diversity (randomize the prompt), which is harder than it sounds because it has to be semantical. Pre-trained models have good output diversity, but they perform much worse. Poor diversity can be fixed in theory but I don't see any model devs caring much.
Put another way: I absolutely would have an intern work on a security audit. I would not have an intern replace a professional audit though.
It's otherwise a pretty low stakes use. I'd expect false positives to be pretty obvious to someone maintaining the code.
It’s another thing to say hey intern security audit this entire code base.
LLM’s thrive on context. You need the right context at the right time, it doesn’t matter how good your model is if you don’t have that.
Why not?
You can't relies solely on that, but having an extra pair of eye without prior assumption on the code always is good idea.