If you look closely at, say, this patch, you might get a sense of what I mean (although the real cleverness is in the testcase, which we have not made public): https://hg-edge.mozilla.org/integration/autoland/rev/c29515d...
The code before the patch does not look obviously wrong. Now, some more lines were added, but would you now say it now looks less obviously wrong, or more obviously correct?
It seems that the invariants needed here are either in some person's heads, or in some document that is not referenced.
Reading the code for the first time, the immediate question is: "What other lines might be missing? How can I figure?"
If the "obviously correct" level of the code does not increase for a human reviewer, how is it ensured that a similar problem will not arise in the future? Or do we need more LLM to tell us which other lines need to be added?
What is the point of keeping it private? I'd bet feeding this patch to Opus and asking to look for specific TOCTOU issue fixed by the patch will make it come up with a testcase sooner or later.
At Mozilla, but not everywhere: exploits are a subset of vulnerabilities are a subset of bugs.
Much as GitHub calls everything an "issue" and GitLab a "work item".
I'm genuinely curious what "types" of implementation mistakes these were, like whether e.g. it was library usage bugs, state management bugs, control flow bugs etc.
Would love to see a writeup about these findings, maybe Mythos hinted us towards that better fuzzing tools are needed?
In this particular sense, AI tends to find bugs that are closer to what we'd see from a human researcher reading the code. Fuzz bugs are often more "here's a seemingly innocuous sequence of statements that randomly happen to collide three corner cases in an unexpected way".
Outside of SpiderMonkey, my understanding is that many of the best vulnerabilities were in code that is difficult to fuzz effectively for whatever reason.
That being said, I think there's a lot of potential for synergy here: if LLMs make writing code easier, that includes fuzzers, so maybe fuzzers will also end up finding a lot more bugs. I saw somebody on Twitter say they used an LLM to write a fuzzer for Chrome and found a number of security bugs that they reported.