upvote
OK, consider a for loop that goes through your repo, then goes through each file, and then goes through each common vulnerability...

Is Mythos some how more powerful than just a recursive foreloop aka, "agentic" review. You can run `open code run --command` with a tailored command for whatever vulnerabilities you're looking for.

reply
newer models have larger context windows, and more stable reasoning across larger context windows.

If you point your model directly at the thing you want it to assess, and it doesn't have to gather any additional context you're not really testing those things at all.

Say you point kimi and opus at some code and give them an agentic looping harness with code review tools. They're going to start digging into the code gathering context by mapping out references and following leads.

If the bug is really shallow, the model is going to get everything it needs to find it right away, neither of them will have any advantage.

If the bug is deeper, requires a lot more code context, Opus is going to be able to hold onto a lot more information, and it's going to be a lot better at reasoning across all that information. That's a test that would actually compare the models directly.

Mythos is just a bigger model with a larger context window and, presumably, better prioritization and stronger attention mechanisms.

reply
Harnesses are basically doing this better than just adding more context. Every time, REGARDLESS OF MODEL SIZE, you add context, you are increasing the odds the model will get confused about any set of thoughts. So context size is no longer some magic you just sprinkle on these things and they suddenly dont imagine things.

So, it's the old ML join: It's just a bunch of if statements. As others are pointing out, it's quite probably that the model isn't the thing doing the heavy lifting, it's the harness feeding the context. Which this link shows that small models are just as capabable.

Which means: Given a appropiately informed senior programmer and a day or two, I posit this is nothing more spectacular than a for loop invoking a smaller, free, local, LLM to find the same issues. It doesn't matter what you think about the complexity, because the "agentic" format can create a DAG that will be followable by a small model. All that context you're taking in makes oneshot inspections more probable, but much like how CPUs have go from 0-5 ghz, then stalled, so too has the context value.

Agent loops are going to do much the same with small models, mostly from the context poisoning that happens every time you add a token it raises the chance of false positives.

reply