undefined

points

[-]

You really need to take the benchmarks with a massive pinch of salt. I’ve been testing local LLMs since the original llama and there’s nothing I’ve tried that is in the same category as Opus.

by lambda3 hours ago|

parent|

[-]

Which Opus? They certainly outperform Claude 3 Opus.

Anyhow, feel free to try them out head to head on OpenRouter. I'd love to see someone write up their results, of a modern local sized open source model vs. frontier models from ~a year ago, on something other than the standard benchmarks.

by mapontosevenths2 hours ago|

parent|

[-]

There's a guy on Youtube named Bijan Bowen who tests all the models (open and frontier) on a series of one/few shot programming exercises and has been for a long while now. You can pretty much watch him compare the results for any two models you're likely to be interested in.

I'm not affiliated, I just like his style and have found it handy. I know it's not very rigorous, but it's good enough for me and I've found his examples to pretty closely match the results I see in real life.

by lambda1 hours ago|

parent|

[-]

OK, it looks like he did a browser OS test with both Claude 4 Opus and Qwen 3.6 35B-A3B.

Claude 4 Opus: https://youtu.be/J7omabtqnBM?t=193

Qwen 3.6 35B A3B: https://youtu.be/gVU-DQeqkI0?t=215

Qwen 3.6 produced far more working functionality than Claude 4 Opus did.

Obviously, just one test of a single one-shot prompt of a silly toy OS, but yeah, this particular test shows Qwen 3.6 running locally dramatically outperforming Claude 4 Opus, which was a frontier model a year ago.

by MrScruff2 hours ago|

parent|

prev|

[-]

I’m normally comparing frontier open/cheap models against frontier closed source. I use deepseek/glm regularly, they’re fine and you can get real work done with them but it’s super obvious when you switch back to opus or even sonnet. A 3B active param MoE model is not comparable.

by lambda1 hours ago|

parent|

[-]

Yeah. I was pointing out that local 3b active models outperform frontier models from a year ago.

Will this trend continue? Who knows. Both the frontier and local model will probably continue to get better. Which one will hit the top of the S-curve first? Hard to say, really. But what you can do right now locally is better than what you could do a year ago on the frontier, and lots of people were already using it pretty heavily a year ago.

Hoever, November is when most folks agree that the frontier models got good enough for much of their work. Local models aren't quite there yet (where by "local" I mean "can run at reasonable speed and quant on a system less that $10,000 with today's RAM and GPU prices"). The biggest open weights models are getting there, but those require something like an 8x H100 server to reasonably run.

It's likely that there will always be a gap between frontier and local if you're comparing models at the same time, you can just do a lot more with terabytes of HBM than gigabytes of DDR. But will local models get good enough to be usable for useful work? For many folks, they already are.