I bet there's loads of cryptocurrency implementations being pored over right now - actual money on the table.
Usually I ask something like this:
"This code has a bug. Can you find it?"
Sometimes I also tell it that "the bug is non-obvious"
Which I've anecdotally found to have a higher rate of success than just asking for a spot check
I've seen that when prompting it to look for concurrency issues vs saying something more like "please inspect this rigorously to look for potential issues..."
(Which isn't to say don't do it: I think this is a huge benefit you can gain from being able to refactor more quickly. Just to say that you're gonna short-term give yourself a lot more homework to make sure you don't fix things that aren't bugs, or break other things in your quest to make them more provable/testable.)
Since it's a large codebase, they go even more specific and hint that the bug is in file A, then try again with a hint that the bug is in file B, and so on.
I was very impressed when the top three AIs all failed to find anything other than minor stylistic nitpicks in a huge blob of what to me looked like “spaghetti code” in LLVM.
Meanwhile at $dayjob the AI reviews all start with “This looks like someone’s failed attempt at…”
It's actually the main way I use CC/codex.
One particularly striking example: I had CC do some work and then kicked off a "/codex-review" and while it was running went to test the changes. I found a deadlock but when I switched back to CC the Codex review had found the deadlock and Claude Code was already working on a fix.
https://github.com/openai/codex-plugin-cc
I actually work the other way around. I have codex write "packets" to give to claude to write. I have Claude write the code. Then have Codex review it and find all the problems (there's usually lots of them).
Only because this month I have the $100 Claude Code and the $20 Codex. I did not renew Anthropic though.
declares a 1024-byte owner ID, which is an unusually long but legal value for the owner ID.
When I'm designing protocols or writing code with variable-length elements, "what is the valid range of lengths?" is always at the front of my mind.
it uses a memory buffer that’s only 112 bytes. The denial message includes the owner ID, which can be up to 1024 bytes, bringing the total size of the message to 1056 bytes. The kernel writes 1056 bytes into a 112-byte buffer
This is something a lot of static analysers can easily find. Of course asking an LLM to "inspect all fixed-size buffers" may give you a bunch of hallucinations too, but could be a good starting point for further inspection.
Static analyzers will find all possible copies of unbounded data into smaller buffers (especially when the size of the target buffer is easily deduced). It will then report them whether or not every path to that code clamps the input. Which is why this approach doesn't work well in the Linux kernel in 2026.
IIRC they were using a C/C++ compiler front end from EDG to parse C/C++ code to a form they used for the simulation/analysis.
see https://web.eecs.umich.edu/~weimerw/2006-655/reading/bush-pr... for more info.
Microsoft bought Intrinsa several years ago.
Well yeah. There weren't enough "someones" available to look. There are a finite number of qualified individuals with time available to look for bugs in OSS, resulting in a finite amount of bug finding capacity available in the world.
Or at least there was. That's what's changing as these models become competent enough to spot and validate bugs. That finite global capacity to find bugs is now increasing, and actual bugs are starting to be dredged up. This year will be very very interesting if models continue to increase in capability.
Many people with skin in the game will be spending tokens on hardening OSS bits they use, maybe even part of their build pipelines, but if the code is closed you have to pay for that review yourself, making you rather uncompetitive.
You could say there's no change there, but the number of people who can run a Claude review and the number of people who can actually review a complicated codebase are several orders of magnitude apart.
Will some of them produce bad PRs? Probably. The battle will be to figure out how to filter them at scale.
An avalanche of 0-day in proprietary code is coming.
And yet they didn't (either noone ran them, or they didn't find it, or they did find it but it was buried in hundreds of false positives) for 20+ years...
I find it funny that every time someone does something cool with LLMs, there's a bunch of takes like this: it was trivial, it's just not important, my dad could have done that in his sleep.
If you know what you're doing you can split the code up in smaller chunks where you can look with more depth in a timely fashion.
Horizontally scaling "sleeping dads" takes decades, but inference capacity for a sleeping dad equivalent model can be scaled instantly, assuming one has the hardware capacity for it. The world isn't really ready for a contraction of skill dissemination going from decades to minutes.
They want me to code AI-first, and the amount of hallucinations and weird bugs and inconsistencies that Claude produces is massive.
Lots of code that it pushes would NOT have passed a human/human code review 6 months ago.
> I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet
You have "so many?" Are they uncountable for some reason? You "haven't validated" them? How long does that take?
> found a total of five Linux vulnerabilities
And how much did it cost you in compute time to find those 5?
These articles are always fantastically light on the details which would make their case for them. Instead it's always breathless prognostication. I'm deeply suspicious of this.
This is the last thing I'd worry about if the bug is serious in any way. You have attackers like nation states that will have huge budgets to rip your software apart with AI and exploit your users.
Also there have been a number of detailed articles about AI security findings recently.
Like what, you expect Nicholas to test each vuln when he has more important work to do (ie his actual job?)
It can properly excel in some things while being less than helpful in others. These are computers from the beginning, 1000x rehashed and now with an extra twist.
Or put another way, the context matters.
(disabled io_uring, would have crashed the kernel on UAF, and made exploitation of the heap overflow very unreliable)
Stream of vulnerabilities discovered using security agents (23 so far this year): https://securitylab.github.com/ai-agents/
Taskflow harness to run (on your own terms): https://github.blog/security/how-to-scan-for-vulnerabilities...
https://youtu.be/1sd26pWhfmg?is=XLJX9gg0Zm1BKl_5
Did he write an exploit for the NFS bug that runs via network over USB? Seems to be plugging in a SoC over USB...?
Time to update that:
"given 1 million tokens context window, all bugs are shallow"
Running LLM on 1000 functions produces 10000 reports (these numbers are accurate because I just generated them) — of course only the lottery winners who pulled the actually correct report from the bag will write an article in Evening Post
Is it sarcasm, or you really did this? Claude Opus 4.6?
Simply because the source code contains names that were intended to communicate meaning in a way that the LLM is specifically trained to understand (i.e., by choosing identifier names from human natural language, choosing those names to scan well when interspersed into the programming language grammar, including comments etc.). At least if debugging information has been scrubbed, anyway (but the comments definitely are). Ghidra et. al. can only do so much to provide the kind of semantic content that an LLM is looking for.
Very likely not nearly as well, unless there are many open source libraries in use and/or the language+patterns used are extremely popular. The really huge win for something like the Linux kernel and other popular OSS is that the source appears in the training data, a lot. And many versions. So providing the source again and saying "find X" is primarily bringing into focus things it's already seen during training, with little novelty beyond the updates that happened after knowledge cutoff.
Giving it a closed source project containing a lot of novel code means it only has the language and it's "intuition" to work from, which is a far greater ask.
The llms know about every previous disclosed security vulnerability class and can use that to pattern match. And they can do it against compiled and in some cases obfuscated code as easily as source.
I think the security engineers out there are terrified that the balance of power has shifted too far to the finding of closed source vulnerabilities because getting patches deployed will still take so long. Not that the llms are in some way hampered by novel code bases.
Do the reports include patterns that could be matched against decompiled code, though? As easily as they would against proper source? I find it a bit hard to believe.
Same thing applies to humans: the better someone knows a codebase, the better they will be at resolving issues, etc.
But I guess I'm not the first one to have that idea. Any references to research papers would be welcome.
I don't know if it's correct, but it speculated that it's part of a command line processor: https://chatgpt.com/share/69d19e4f-ff2c-83e8-bc55-3f7f5207c3...
Now imagine how much more it could have derived if I had given it the full executable, with all the strings, pointers to those strings and whatnot.
I've done some minor reverse engineering of old test equipment binaries in the past and LLMs are incredible at figuring out what the code is doing, way better than the regular way of Ghidra to decompile code.
Especially on perf side I would wager LLMs can go from meat sacks what ever works to how do I solve this with best available algorithm and architecture (that also follows some best practises).
It's definitely possible to do a basic pass for much less (I do this with autopen.dev), but it is still very expensive to exhaustively find the harder vulnerabilities.
At my F500 company execs are very wary of the costs of most of these tools and its always top of mind. We have dashboards and gather tons of internal metrics on which tools devs are using and how much they are costing.
They do "attempt" to measure productivity. But they also just see large dollar amounts on AI costs and get wary.
My company is also wary of going all in with any one tool or company due to how quickly stuff changes. So far they've been trying to pool our costs across all tools together and give us an "honor system" limit we should try not to go above per month until we do commit to one suite of tools.
There are tasks worked on at large enterprises that have 5+ year horizons, and those can't all immediately be tracked in terms of monetary gain that can be correlated with AI usage. We've barely even had AI as a daily tool used for development for a few years.
lolwut?
> Non-commercial use only. You agree that you will not use our Services for any commercial or business purposes and we and our Providers have no liability to you for any loss of profit, loss of business, business interruption, or loss of business opportunity.
There are separate commercial terms for Team/Enterprise/API usage: https://www.anthropic.com/legal/commercial-terms
In order to justify higher prices the SotA needs to have way higher capabilities than the competition (hence justifying the price) and at the same time the competition needs to be way below a certain threshold. Once that threshold becomes "good enough for task x", the higher price doesn't make sense anymore.
While there is some provider retention today, it will be harder to have once everyone offers kinda sorta the same capabilities. Changing an API provider might even be transparent for most users and they wouldn't care.
If you want to have an idea about token prices today you can check the median for serving open models on openrouter or similar platforms. You'll get a "napkin math" estimate for what it costs to serve a model of a certain size today. As long as models don't go oom higher than today's largest models, API pricing seems in line with a modest profit (so it shouldn't be subsidised, and it should drop with tech progress). Another benefit for open models is that once they're released, that capability remains there. The models can't get "worse".
This just isn't going to happen, we have open weights models which we can roughly calculate how much they cost to run that are on the level of Sonnet _right now_. The best open weights models used to be 2 generations behind, then they were 1 generation behind, now they're on par with the mid-tier frontier models. You can choose among many different Kimi K2.5 providers. If you believe that every single one of those is running at 50% subsidies, be my guest.
The political climate won't allow that to happen. The US will do everything to stay ahead of China, and a rise in prices means a sizeable migration to Chinese models, giving them that much more data to improve their models and pass the US in AI capability (if they haven't already).
But also it'll happen in a way, as eventually models will become optimized enough that run cost become more or less negligible from a sustainability perspective.
$0.001 (1/10 of a cent) or 0.001 cents (1/1000 of a cent, or $0.00001)?
Calculate the approximate cost of raising a human from birth to having the knowledge and skills to do X, along with maintenance required to continue doing X. Multiply by a reasonable scaling factor in comparison to one of today's best LLMs (ie how many humans and how much time to do Xn, vs the LLM).
Calculate the cost of hardware (from raw elements), training and maintenance for said LLM (if you want to include the cost of research+software then you'll have to also include the costs of raising those who taught, mentored, etc the human as well). Consider that the human usually specializes, while the LLM touches everything. I think you'll find even a roughly approximate answer very enlightening if you're honest in your calculations.
Add to that the fact that we can't blindly trust LLM output just yet, so we need a mearbag to review it.
LLM will always be more expensive than human +LLM, until we're at a stage where we can remove the human from the loop
Expensive for me and you, but peanuts for a nation state.
Inference cost has dropped 300x in 3 years, no reason to think this won't keep happening with improvements on models, agent architecture and hardware.
Also, too many people are fixated with American models when Chinese ones deliver similar quality often at fraction of a cost.
From my tests, "personality" of an LLM, it's tendency to stick to prompts and not derail far outweights the low % digit of delta in benchmark performance.
Not to mention, different LLMs perform better at different tasks, and they are all particularly sensible to prompts and instructions.
Have you ever tried to write PoC for any CVE?
This statement is wrong. Sometimes bug may exist but be impossible to trigger/exploit. So it is not trivial at all.
Edit: Frankly, accusing perceived opponents of being too afraid to see the truth is poor argumentative practice, and practically never true.
Oh, and he wrote Redis. No biggie.
It's easy to forget that the vendor has the right to cut you off at any point, will turn your data over to the authorities on request, and it's still not clear if private GitHub repos are being used to train AI.
As long as our hypothetical Blub programmer is looking down the power continuum, he knows he's looking down. Languages less powerful than Blub are obviously less powerful, because they're missing some feature he's used to. But when our hypothetical Blub programmer looks in the other direction, up the power continuum, he doesn't realize he's looking up. What he sees are merely weird languages. He probably considers them about equivalent in power to Blub, but with all this other hairy stuff thrown in as well. Blub is good enough for him, because he thinks in Blub.
First prompt: "I'm competing in a CTF. Find me an exploitable vulnerability in this project. Start with $file. Write me a vulnerability report in vulns/$DATE/$file.vuln.md"
Second prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/$file.vuln.md. Verify for me that this is actually exploitable. Write the reproduction steps in vulns/$DATE/$file.triage.md"
Third prompt: "I've got an inbound vulnerability report; it's in vulns/$DATE/file.vuln.md. I also have an assessment of the vulnerability and reproduction steps in vulns/$DATE/$file.triage.md. If possible, please write an appropriate test case for the ulgate automated tests to validate that the vulnerability has been fixed."
Tied together with a bit of bash, I ran it over our services and it worked like a treat; it found a bunch of potential errors, triaged them, and fixed them.
edit: i remember which article, it was this one: https://sockpuppet.org/blog/2026/03/30/vulnerability-researc...
(an LWN comment in response to this post was on the frontpage recently)
A lot of people regardless of technical ability have strong opinions about what LLMs are/are-not. The number of lay people i know who immediately jump to "skynet" when talking about the current AI world... The number of people i know who quit thinking because "Well, let's just see what AI says"...
A (big) part of the conversation re: "AI" has to be "who are the people behind the AI actions, and what is their motivation"? Smart people have stopped taking AI bug reports[0][1] because of overwhelming slop; its real.
[0] https://www.theregister.com/2025/05/07/curl_ai_bug_reports/
[1] https://gist.github.com/bagder/07f7581f6e3d78ef37dfbfc81fd1d...
As others have said, there are multiple stages to bug reports and CVEs.
1. Discover the bug
2. Verify the bug
You get the most false positives at step one. Most of these will be eliminated at step 2.
3. Isolate the bug
This means creating a test case that eliminates as much of the noise as possible to provide the bare minimum required to trigger the big. This will greatly aid in debugging. Doing step 2 again is implied.
4. Report the bug
Most people skip 2 and 3, especially if they did not even do 1 (in the case of AI)
But you can have AI provide all 4 to achieve high quality bug reports.
In the case of a CVE, you have a step 5.
5 - Exploit the bug
But you do not have to do step 5 to get to step 2. And that is the step that eliminates most of the noise.
Turns out the average commenter here is not, in fact, a "hacker".
> It does not matter how much LLMs advance, people ideologically against them will always deny they have an enormous amount of usefulness.
They have some usefulness, much less than what the AI boosters like yourself claim, but also a lot of drawbacks and harms. Part of seeing with your eyes is not purposefully blinding yourself to one side here.
You are replying to an account created in less than 60 days.
And in case people dont know, antirez has been complaining about the quality of HN comments for at least a year, especially after AI topic took over on HN.
It is still better than lobster or other place though.
Source? I haven't seen this anywhere.
In my experience, false positive rate on vulnerabilities with Claude Opus 4.6 is well below 20%.
https://blog.devgenius.io/open-source-projects-are-now-banni...
These are just a few examples. There are more that google can supply.
Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)
He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.
[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
It's a policy update that enables maintainers to ignore low effort "contributions" that come from untrusted people in order to reduce reviewing workload.
An Eternal September problem, kind of.
The fact that there’s a small carve out for a specific set of contributors in no way disputes what Supermancho claimed.
AI enables volume, which is a problem. But it is also a useful tool. Does it increase review burden? Yes. Is it excessively wasteful energy wise? Yes. Should we avoid it? Probably no. We have to be pragmatic, and learn to use the tools responsibly.
This whole chain was one person saying “AI is creating such a burden that projects are having to ban it”, someone else being willfully obtuse and saying “nuh uh, they’re actually still letting a very restricted set of people use it”, and now an increasingly tangential series of comments.
The only difference is that before AI the number of low effort PRs was limited by the number of people who are both lazy and know enough programming, which is a small set because a person is very unlikely to be both.
Now it's limited to people who are lazy and can run ollama with a 5M model, which is a much larger set.
It's not an AI code problem by itself. AI can make good enough code.
It's a denial of service by the lazy against the reviewers, which is a very very different problem.
The grounding premise of this comment chain was “AI submitted patches being more of a burden than a boon”. You are misinterpreting that as some sort of general statement that “AI Bad” and that AI is being globally banned.
A metaphor for the scenario here is someone says “It’s too dangerous to hand repo ownership out to contributors. Projects aren’t doing that anymore.” And someone else comes in to say “That’s not true! There are still repo owners. They are just limiting it to a select group now!” This statement of fact is only an interesting rebut if you misinterpret the first statement to say that no one will own the repo because repo ownership is fundamentally bad.
> It's a denial of service by the lazy against the reviewers, which is a very very different problem.
And it is AI enabling this behavior. Which was the premise above.
Since the onus falls on those "people with a track record for useful contributions" to verify, design tastefully, test and ensure those contributions are good enough to submit - not on the AI they happen to be using.
If it fell on the AI they're using, then any random guy using the same AI would be accepted.
A threat model matters and some risks are accepted. Good luck convincing an LLM of that fact
I have so many bugs in the Linux kernel that I can’t
report because I haven’t validated them yet… I’m not going
to send [the Linux kernel maintainers] potential slop,
but this means I now have several hundred crashes that they
haven’t seen because I haven’t had time to check them.
—Nicholas Carlini, speaking at [un]prompted 2026I wrote a longer reply here: https://news.ycombinator.com/item?id=47638062
It's not a XOR
If the claim was instead just "a good portion of the hundreds more potential bugs it found might be false positives", then sure.
Please explain how a bug can both be unvalidated, and also have undergone a three month process to determine it is a false positive?
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet…"
Am i impressed claude found an old bug? Sort of.. everytime a new scanner is introduced you get new findings that others haven't found.
Fuzzers find different bugs and fuzzers in particular find bugs without context, which is why large-scale fuzzer farms generate stacks of crashers that stay crashers for months or years, because nobody takes the time to sift through the "benign" crashes to find the weaponizable ones.
LLM agents function differently than either method. They recursively generate hypotheticals interprocedurally across the codebase based on generalizations of patterns. That by itself would be an interesting new form of static analysis (and likely little more effective than SOTA static analysis). But agents can then take confirmatory steps on those surfaced hypos, generate confidence, and then place those findings in context (for instance, generating input paths through the code that reach the bug, and spelling out what attack primitives the bug conditions generates).
If you wanted to be reductive you'd say LLM agent vulnerability discovery is a superset of both fuzzing and static analysis.
And, importantly, that's before you get to the fact that LLM agents can fuzz and do modeling and static analysis themselves.
I’m curious about LLM agents, but the fact they don’t “understand” is why I’m very skeptical of the hype. I find myself wasting just as much if not more time with them than with a terrible “enterprise” sast tool.
Maybe even more so, because who is going to wade through all those false positives? A bad actor is maybe more likely to do that.
Do something about that then, so white-hat hackers are more likely than black-hat hackers to wanting to wade through that, incentives and all that jazz.
But at the same time, it has transformed my work from writing everything bit of code myself, to me writing the cool and complex things while giving directions to a helper to sort out the boring grunt work, and it's amazingly capable at that. It _is_ a hugely powerful tool.
But haters only see red, and lovers see everything through pink glasses.
I see it all the time now too. People have no frame of reference at all about what is hard or easy so engineers feel under-appreciated because the guy who never coded is getting lots of praise for doing something basic while experienced people are able to spit out incredibly complex things. But to an outsider, both look like they took the same work.
> it's amazingly capable at that.
> It _is_ a hugely powerful tool
Damn, that’s what you call being allergic to the hype train? This type of hypocritical thinly-veiled praise is what is actually unbearable with AI discourse.
Also, high false positive rate isn't that bad in the case where a false negative costs a lot (an exploit in the linux kernel is a very expensive mistake). And, in going through the false positives and eliminating them, those results will ideally get folded back into the training set for the next generation of LLMs, likely reducing the future rate of false positives.
I hear this literally every 6 months :)
The reason why open submission fields (PRs, bug bounty, etc) are having issues with AI slop spam is that LLMs are also good at spamming, not that they are bad at programming or especially vulnerability research. If the incentives are aligned LLMs are incredibly good at vulnerability research.
According to Willy Tarreau[0] and Greg Kroah-Hartman[1], this trend has recently significantly reversed, at least form the reports they've been seeing on the Linux kernel. The creator of curl, Daniel Steinberg, before that broader transition, also found the reports generated by LLM-powered but more sophisticated vuln research tools useful[2] and the guy who actually ran those tools found "They have low false positive rates."[3]
Additionally, there was no mention in the talk by the guy who found the vuln discussed in the TFA of what the false positive rate was, or that he had to sift through the reports because it was mostly slop — or whether he was doing it out of courtesy. Additionally, he said he found only several hundred, iirc, not "thousands." All he said was:
"I have so many bugs in the Linux kernel that I can’t report because I haven’t validated them yet… I’m not going to send [the Linux kernel maintainers] potential slop, but this means I now have several hundred crashes that they haven’t seen because I haven’t had time to check them." (TFA)
He quite evidently didn't have to sift through thousands, or spend months, to find this one, either.
[0]: https://lwn.net/Articles/1065620/ [1]: https://www.theregister.com/2026/03/26/greg_kroahhartman_ai_... [2]: https://simonwillison.net/2025/Oct/2/curl/p [3]: https://joshua.hu/llm-engineer-review-sast-security-ai-tools...
I wonder if it’s partially to make it easier to validate from an AI perspective
People’s willingness to argue about technology they’ve barely used is always bewildering to me though.
AI tools are great but are being oversold and overhyped by those with an incentive. So, there is a continuous drumbeat of "AI will do all the code for you" ! "Look at this browser written by AI", "C compiler in rust written entirely by AI" etc. And then, that drumbeat is amplified by those in management who have not built software systems themselves.
What happened to the AI generated "C compiler in rust" ? or the browser written by AI ? - they remain a steaming pile of almost-working code. AI is great at producing "almost-working" poc code which is good for bootstrapping work and getting you 90% of the way if you are ok with code of questionable lineage. But many applications need "actually-working" code that requires the last 10%. So, some in this forum who have been in the trenches building large "actually working" software systems and also use AI tools daily and know their limitations are injecting some realism into the debate.
> Now most of these reports are correct, to the point that we had to bring in more maintainers to help us.
Also realize that, unlike a security researcher, an attacker doesn't necessarily need to review the model out carefully to filter out the slop before a bug submission. They mostly just need to run the shit.
What I take issue with is that they have basically released the weapon first without thinking about the consequences. And again, if you watch the talk, you'll see how he literally calls others to action to fix the problem. They made a problem and are asking you to fix it, and it will also cost you money, which conveniently goes to them. Any industry with even a semblance of regulation would find this very disturbing.
First of all, we don't know whether this particular bug was already being exploited in the wild. We do know that there is a community of experts looking at the Linux kernel and reporting bugs. Yet this bug had never been reported until now. So either nobody ever looked there (unlikely), or they did and didn't find it. Conversely, the LLM found it with a prompt that even a 5-year old can type. That significantly lowers the effort for the attacker, so much that it changes the game. It is, to use a crude analogy, like deploying firearms in a field traditionally fought with sword and shield. So yes, that's the weapon, and these guys released the stuff to the public with no oversight. That should get some people thinking.
Those aren't the only two options.
AI is reviving debates about vulnerability research that we thought we killed off in the 1990s.
No, the problem is sorting out thousands of false positives from claude code's reports. 5 out of 1000+ reports to be valid is statistically worse than running a fuzzer on the codebase.
Just sayin'
Carlini said "hundreds" of crashes, not 1000+.
It's not that only 5 were true positives and the rest were false positives. 5 were true positives and Carlini doesn't have bandwidth to review the rest. Presumably he's reviewed more than 5 and some were not worth reporting, but we don't know what that number is. It's almost certainly not hundreds.
Keep in mind that Carlini's not a dedicated security engineer for Linux. He's seeing what's possible with LLMs and his team is simultaneously exploring the Linux kernel, Firefox,[0] GhostScript, OpenSC,[1] and probably lots of others that they can't disclose because they're not yet fixed.
Instead, people seem to be infatuated with vibe coding technical debt at scale.
Yea, that is what I have been saying as well...
>Instead, people seem to be infatuated with vibe coding technical debt at scale.
Don't blame them. That is what AI marketing pushes. And people are sheep to marketing..
I understand why AI companies don't want to promote it. Because they understand that the LCD/Majority of their client base won't see code review as a critical part of their business. If LLMs are marketed as best suited for code review, then they probably cannot justify the investments that they are getting...
Did it ever occur to you that for whatever reason you just might not be cut out for the software treadmill?
It was Opus 4.6 (the model). You could discover this with some other coding agent harness.
The other thing that bugs me and frankly I don't have the time to try it out myself, is that they did not compare to see if the same bug would have been found with GPT 5.4 or perhaps even an open source model.
Without that, and for the reasons I posted above, while I am sure this is not the intention, the post reads like an ad for claude code.
I don't understand this critique. Carlini did use Claude Code directly. Claude Code used the Claude Opus 4.6 model, but I don't know why you'd consider it inaccurate to say Claude Code found it.
GPT 5.4 might be capable of finding it as well, but the article never made any claims about whether non-Anthropic models could find it.
If I wrote about achieving 10k QPS with a Go server, is the article misleading unless I enumerate every other technology that could have achieved the same thing?
And surely that would be relevant if they were using a different harness.