undefined

points

by moomin22 hours ago |

comments

by ZrArm22 hours ago|

[-]

> Maybe curl is significantly better hardened than most projects?

Meanwhile from [1]:

"Not even half-way through this #curl release cycle we are already at 11 confirmed vulnerabilities - and there are three left in the queue to assess and new reports keep arriving at a pace of more than one/day."

"The simple reason is: the (AI powered) tools are this good now. And people use these tools against curl source code.They find lots of new problems no one detected before. And none of these new ones used Mythos. Focusing on Mythos is a distraction - there are plenty of good models, and people who can figure out how to get those models and tools to find things."

Yeah, it looks like there are at least 11 security bugs missed by Mythos.

[1] https://www.linkedin.com/feed/update/urn:li:activity:7463481...

by computomatic16 hours ago|

parent|

[-]

I’m trying to reconcile this with TFA. Because the article says that the majority of vulns found by Mythos are being reported by independent researchers after validation. They never said those reports inform that mythos was involved - and I suspect they don’t. So did any of these 11 CVEs come from that channel?

by solenoid093721 hours ago|

parent|

prev|

[-]

I don't think anyone has claimed that Mythos finds all vulns in all projects. But it's very good if Mozilla's blog posts are anything to go by.

by _heimdall9 hours ago|

parent|

prev|

[-]

Based on the article here, and Firefox's mythos article, they had found bugs with Opus 4.6 as well but mythos is finding more that it missed.

That would align with the curl feedback you linked, they aren't using mythos but are finding bugs with other models. Presumably the expectation would be that with mythos they'd find more that were missed by other models already used.

by frumiousirc8 hours ago|

parent|

[-]

> Based on the article here, and Firefox's mythos article, they had found bugs with Opus 4.6 as well but mythos is finding more that it missed.

It's not quite apples-to-apples. It was Opus on Firefox 148, Mythos on 150. A better test of Mythos vs Opus would have been to apply Mythos to Firefox 148. Or also re-apply Opus to Firefox 150.

Do we know all the Opus+Firefox 148 bugs are fixed in Firefox 150? Do we know the number of new bugs introduced per Firefox release?

by _heimdall2 hours ago|

parent|

[-]

> Do we know all the Opus+Firefox 148 bugs are fixed in Firefox 150? Do we know the number of new bugs introduced per Firefox release?

That may be parsable from their bug tracker, though I don't know of all bugs raised by mythos are public.

I'd be particularly interested in how many of the bugs found existed in 148. Assuming most or all of them weren't newly created bugs added in 149 or 150, the comparison should still hold even though Opus and Mythos looked at different releases.

by IndeanCondor7 hours ago|

prev|

[-]

The same UK security research body ran the same CTF against GPT5.5. GPT5.5 got the same result as Mythos.

Anthropic promised us that Mythos was such an existential threat that it would compromise "every OS and browser on devices across the planet". They've held conferences and meetings with banks and govts across the world, shouting how critical this issue is.

GPT5.5 has been out for a month. Every device on earth has not been breached yet. It's very fair to criticize Anthropic's maximalist posturing when it's becoming exceedingly clear their models are fairly behind OpenAI's in capability.

In my opinion, the original commenter's statement stands, and the UK govt data point only helps support that due to the equal result between Mythos and GPT.

I'd advise reading into the specifics of what happened with Firefox; the TL;DR is a reduced safety version of its code was scanned by Opus 4.6 (yes Opus) and found a multitude of bugs and 4 high severity vulns that did not escape sandbox. The Mythos system card test describes running Mythos against the same issues Opus found to see if it could reliably replicate and chain together an attack.

by 2 hours ago|

parent|

[-]

deleted

by SecretDreams18 hours ago|

prev|

[-]

I think for every point, we need to know how many tokens and cost were burned to achieve a desired outcome. And how buggy each software was to start.