undefined

points

[-]

Early on, I had a vague suspicion that the reason some of the Chinese models, including quite small ones, perform so well on this task, especially relative to their size and cost, is because they don't have the same safety guardrails baked in regarding software security that US models seem to have. Gemini 3.1 Pro doing so poorly sort of reinforced that gut feeling.

But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes. I haven't published the replication results for Gemma 4, yet, where I gave it multiple opportunities, but the dense version was consistently able to find four of the nine bugs exactly, plus two other very difficult bugs that it found occasionally, sometimes with a not quite accurate description (which gets partial credit in its own column on the big benchmark), six altogether. Leaving three of the bugs in the corpus that no model other than Mythos ever found, but also making Gemma 4 31B the best model I have results for (but it got multiple attempts, which I assume would make any of the models perform better).

So, my conclusion, not very strongly held, is: Mythos is both better than other public models and it has fewer guardrails. But, also that the guardrails in current models are probably not strict enough to prevent this work. Only Gemini models when run under Antigravity refused to perform the work. Maybe Mistral silently refused due to guardrails, I'm not sure, since it failed to find any bugs. Maybe it just sucks.

by scorpioxy12 hours ago|

parent|

[-]

Can you elaborate on the "software security that US models" seem to have? According to blog posts I read, the code generated had security problems and naive ones at that. Perhaps it got better now or people have learned not to blindly vibe code applications that are to be used publicly but it certainly didn't feel like there were security guardrails.

by SwellJoe11 hours ago|

parent|

[-]

I'm talking about guardrails that prevent finding exploits, which is only peripherally related to writing secure code.

This benchmark is about finding security bugs, not writing secure code. I don't believe the models have guardrails that prevent writing safe code, but they're also not intelligent and have a bunch of insecure code in their training data, so they definitely write insecure code sometimes.

by coldtea11 hours ago|

parent|

prev|

[-]

>But, then Gemma 4 proved to be extraordinarily good for its size (better than Qwen), and kinda disproved that US models are any weaker at small sizes.

Did it "disprove" it retroactively or just changed what the situation is, given that until then they were indeed weaker at small sizes?

by SwellJoe11 hours ago|

parent|

[-]

I don't know. I think it proves that if Google is baking guardrails into their models that prevent them from finding security bugs, they didn't bake those guardrails into Gemma 4, because it is very good at it. Maybe that means Google devs had a change of heart. Maybe it means something about Gemma 4 architecture is better for this task than Gemini 3.1 Pro. Gemini Flash 3.5 did OK though.

Anyway, I kinda think among US models only Fable really tries to block security work like this, based on my experience so far.

by pbgcp20269 hours ago|

parent|

prev|

[-]

I concur with "Gemma 4 31B the best model I have results for". My workflow includes a lot of Gemma 4 – but dense 31B non-quantised version.(BTW I found it is most cost effective to run on Bedrock)

by SwellJoe5 hours ago|

parent|

[-]

I tried to prove quantization made models worse, but in my testing Qwen 3.6 27b performed statistically the same from 4 bits to 16, using the unsloth dynamic quantizations. Gemma 4 4-bit QAT seems to perform the same as the full-fat version, but quite a lot faster.

But, I have come to consider Gemma 4 31b the best model I can self-host, even though there are bigger models that'll fit on the Strix Halo. (I could also use much bigger MoE models on my desktop which has 64GB VRAM and 112GB system RAM.)

by coder5432 hours ago|

parent|

[-]

> I have come to consider Gemma 4 31b the best model I can self-host

I'm confused. Your own results show that Gemma 4 26B A4B and Qwen3.6-27B did better in these tests?

I really like Gemma 4 31B, especially with how exceptionally good its MTP drafter is, but it is absurdly weak at tool calling and instruction following in my testing, and its smaller siblings are even worse at this. If the system prompt says to do something, Gemma 4 31B will very often ignore that entirely. It will also make fewer tool calls than were needed to solve a problem, so then it fails. The Qwen3.6 series is much, much more reliable for carrying out instructions and doing agentic tasks in my testing, although they can get stuck in loops.

There is a lot of potential in the Gemma 4 series, but I think Google needs to release a Gemma 4.1 update to polish the rough edges. Unfortunately, if Gemma 3's lifecycle is any indication, Google won't release a true revision of the Gemma 4 models, even if they release a bunch of specialized research models based on Gemma 4 over the next year.

by SwellJoe20 minutes ago|

parent|

[-]

I have done replication tests of Qwen and the Gemma models. The Qwen benchmarks are published: https://swelljoe.com/post/qwen-quantization-degradation/ . (Though, I still want to add the other three cases to that one. I was mostly testing quantization effects in that test, but it also served as a replication test of Qwen in finding bugs.)

The Gemma 4 replication tests are not published, yet, but Gemma 4 31B consistently performs the best of all of them. Note Gemma 4 31b has two "partials" on the big benchmark, which means it found a bug in the right place but the judge didn't think it understood the bug, those are probably unfairly judged "wrong bug" by Opus. It consistently finds four of nine, and sometimes finds two others, making Gemma 4 31b the best model I've tested. But, I suspect the big models would do even better if giving multiple attempts, as I did for Gemma 4. You can see the report of that here, note 31b finds six(!) of nine bugs if given a couple of attempts (MoE does much worse than the dense model, it may degrade more due to quantization, I'm still experimenting): https://swelljoe.com/html/gemma-promptlab-report.html

The "partial" score thing is kinda tricky, but it's actually quite rare for a model to find the right place but describe the bug in a way that Opus considers it to be the wrong bug. So, I'm inclined to give Gemma 4 full credit for those finds. When I read its bug report, it's clear that you'd fix the problem Gemma describes the same way as you would if given Opus' description of the problem, even if the mechanism of exploit is different. That, to me, is a hit. Opus called it the wrong bug.

And, yeah, a more powerful Gemma would be great. I'd love a double-sized Gemma 4 MoE (something like 70B A8B maybe, or even 122B A12B). I think that'd make self-hosted models feasible for a lot of tasks. It'd run comfortably on a 128GB machine, and if it's some reasonable amount smarter than the 31B, it'd be a real beast.

by vessenes7 hours ago|

prev|

[-]

It's really not the same thing.

Read the cloudflare blog about using Mythos. Mythos is important and notable because of the harness and self-direction. It's not necessarily a way stronger bug finder, but it was trained to do the end to end analysis autonomously, which is a big deal.

To my eyes, the Mythos story is most important as a step toward custom trained harnesses and their effectiveness; there's clearly some sort of plateau we are very close to for some domains where you can just stop getting humans in the loop, radically changing cost, timing and ROI for some tasks.

by blenklo1 hours ago|

prev|

[-]

No Mythos is probably a 10 Trillion Parameter model, Fable is Mythos with filtering (perhaps a small LLM in-front or finetuned) and Opus is a 1-2 Trillion parameter Model.

Opus 5 might become a distillation from Mythos.

by kevinh45613 hours ago|

prev|

[-]

Fable, the same model as mythos with extra safety controls, was much faster, more accurate, and more token efficient than previous models. What I got done with it in 48 hours accelerated my personal project from concept to deployed prototype.

by pbgcp20269 hours ago|

parent|

[-]

Fable is not the same model as Mythos but with guardrails. There are many things that were never disclosed by Project Glasswind. And probably will never be.

by cheeze13 hours ago|

prev|

[-]

Why wouldn't OpenAI offer the same?

by pbgcp20269 hours ago|

parent|

[-]

My bet is actually on GLM. Z.ai does amazing work and they will overcome Western models. IMO, faster than DS or Qwen. They have amazing team and very capable and smart leader.