But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).
So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.
This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.
Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?
Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.
But, I have come to consider Gemma 4 31b the best model I can self-host, even though there are bigger models that'll fit on the Strix Halo. (I could also use much bigger MoE models on my desktop which has 64GB VRAM and 112GB system RAM.)
I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?
I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.
There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.
The Gemma 4 replication tests are not published, yet, but Gemma 4 31B consistently performs the best of all of them. Note Gemma 4 31b has two "partials" on the big benchmark, which means it found a bug in the right place but the judge didn't think it understood the bug, those are probably unfairly judged "wrong bug" by Opus. It consistently finds four of nine, and sometimes finds two others, making Gemma 4 31b the best model I've tested. But, I suspect the big models would do even better if giving multiple attempts, as I did for Gemma 4. You can see the report of that here, note 31b finds six(!) of nine bugs if given a couple of attempts (MoE does much worse than the dense model, it may degrade more due to quantization, I'm still experimenting): https://swelljoe.com/html/gemma-promptlab-report.html
The "partial" score thing is kinda tricky, but it's actually quite rare for a model to find the right place but describe the bug in a way that Opus considers it to be the wrong bug. So, I'm inclined to give Gemma 4 full credit for those finds. When I read its bug report, it's clear that you'd fix the problem Gemma describes the same way as you would if given Opus' description of the problem, even if the mechanism of exploit is different. That, to me, is a hit. Opus called it the wrong bug.
And, yeah, a more powerful Gemma would be great. I'd love a double-sized Gemma 4 MoE (something like 70B A8B maybe, or even 122B A12B). I think that'd make self-hosted models feasible for a lot of tasks. It'd run comfortably on a 128GB machine, and if it's some reasonable amount smarter than the 31B, it'd be a real beast.
Read the cloudflare blog about using Mythos. Mythos is important and notable because of the harness and self-direction. It's not necessarily a way stronger bug finder, but it was trained to do the end to end analysis autonomously, which is a big deal.
To my eyes, the Mythos story is most important as a step toward custom trained harnesses and their effectiveness; there's clearly some sort of plateau we are very close to for some domains where you can just stop getting humans in the loop, radically changing cost, timing and ROI for some tasks.
Opus 5 might become a distillation from Mythos.