undefined

upvote

points

by amunozo1 days ago |

upvote

by originalvichy23 hours ago|

[-]

For at least a year now, it has been clear that data quality and fine-tuning are the main sources of improvement for mediym-level models. Size != quality for specialized, narrow use cases such as coding.

It’s not a surprise that models are leapfrogging each other when the engineers are able to incorporate better code examples and reasoning traces, which in turn bring higher quality outputs.

reply

upvote

by cbg022 hours ago|

[-]

If all you're looking at is benchmarks that might be true, but those are way too easy to game. Try using this model alongside Opus for some work in Rust/C++ and it'll be night and day. You really can't compare a model that's got trillions of parameters to a 27B one.

reply

upvote

by otabdeveloper421 hours ago|

[-]

> ...and it'll be night and day.

That's just, like, your opinion, man.

> You really can't compare a model that's got trillions of parameters to a 27B one.

Parameter count doesn't matter much when coding. You don't need in-depth general knowledge or multilingual support in a coding model.

reply

upvote

by cbg021 hours ago|

[-]

I often do need in-depth general knowledge in my coding model so that I don't have to explain domain specific logic to it every time and so that it can have some sense of good UX.

reply

upvote

by rubiquity22 hours ago|

[-]

You should try it out. I'm incredibly impressed with Qwen 3.5 27B for systems programming work. I use Opus and Sonnet at work and Qwen 3.x at home for fun and barely notice a difference given that systems programming work needs careful guidance for any model currently. I don't try to one shot landing pages or whatever.

reply

upvote

by bityard21 hours ago|

[-]

Are you using the same agent/harness/whatever for both Claude and Qwen, or something different for each one?

reply

upvote

by rubiquity21 hours ago|

[-]

I use Pi at home and Claude Code at work (no choice). I use bone stock Pi. No extensions.

reply

upvote

by kgeist17 hours ago|

[-]

From what I understand, ~30b is enough "intelligence" to make coding/reasoning etc. work, in general. Above ~30b, it's less about intelligence, and more about memorization. Larger models fail less and one-shot more often because they can memorize more APIs (documentation, examples, etc). Also from my experience, if a task is ambiguous, Sonnet has a better "intuition" of what my intent is. Probably also because of memorization, it has "access" to more repositories in its compressed knowledge to infer my intent more accurately.

reply

upvote

by Aurornis23 hours ago|

[-]

You should be skeptical. Benchmark racing is the current meta game in open weight LLMs.

Every release is accompanied by claims of being as good as Sonnet or Opus, but when I try them (even hosted full weights) they’re far from it.

Impressive for the size, though!

reply

upvote

by jjcm23 hours ago|

[-]

Opus 4.5 mind you, but I’m not too surprised given how good 3.5 was and how good the qwopus fine tune was. The model was shown to benefit heavily from further RL.

reply

upvote

by esafak23 hours ago|

[-]

Some of these benchmarks are supposedly easy to game. Which ones should we pay attention to?

reply

upvote

by NitpickLawyer21 hours ago|

[-]

SWE-REbench should not be gameable. They collect new issues from live repos, and if you check 1-2 months after a model was released, you can get an idea. But even that would be "benchmaxxxable", which is an overloaded term that can mean many things, but the most vanilla interpretation is that with RL you can get a model to follow a certain task pretty well, but it'll get "stuck" on that task type, or "stubborn" when asked similar but sufficiently different tasks. So for swe-rebench that would be "it fixes bugs in these types of repos, under this harness, but ask it to do soemthing else in a repo and you might not get the same results". In a nutshell.

reply

upvote

by underlines23 hours ago|

[-]

well, your own, unleaked ones, representing your real workloads.

if you can't afford to do that, look at a lot of them, eg. on artificialanalysis.com they merge multiple benchmarks across weighted categories and build an Intelligence Score, Coding Score and Agentic score.

reply

upvote

by WarmWash23 hours ago|

[-]

ARC-AGI 2

GLM 5 scores 5% on the semi-private set, compared to SOTA models which hover around 80%.

reply

upvote

by cbg021 hours ago|

[-]

None. Try them out with your own typical tasks to see the performance.

reply

upvote

by wesammikhail23 hours ago|

[-]

you'd be surprised how good small models have gotten. Size of the model isnt all that matters.

reply

upvote

by freedomben23 hours ago|

[-]

Plus you can control thinking time a lot more, so when Anthropic lobotomizes Opus on you...

reply

upvote

by verdverm23 hours ago|

[-]

My experience with qwen-3.6:35B-A3B reinforces this, gonna give this a spin when unsloth has quants available

Gemini flash was just as good as pro for most tasks with good prompts, tools, and context. Gemma 4 was nearly as good as flash and Qwen 3.6 appears to be even better.

reply

upvote

by cassianoleal23 hours ago|

[-]

> when unsloth has quants available

https://huggingface.co/unsloth/Qwen3.6-27B-GGUF

reply

upvote

by verdverm23 hours ago|

[-]

That was quick (compared to the 1T Kimi-2.6, not surprising)

reply

upvote

by danielhanchen22 hours ago|

[-]

Haha :) We had some issues with Kimi-2.6 since it was int4 and we were investigating how to handle it :)

reply

upvote

by verdverm19 hours ago|

[-]

Appreciate what y'all do! We were slacking about how many HGX-B300 it would take to run Kimi and it looks like we could actually fit 2-3 Kimis on a single HGX.

reply

upvote

by dudefeliciano23 hours ago|

[-]

> Size of the model isnt all that matters.

What matters is the motion in the tokens

reply

upvote

by cmrdporcupine22 hours ago|

[-]

A small model can be made to be "comparable to Opus" in some narrow domains, and that's what they've done here.

But when actually employed to write code they will fall over when they leave that specific domain.

Basically they might have skill but lack wisdom. Certainly at this size they will lack anywhere close to the same contextual knowledge.

Still these things could be useful in the context of more specialized tooling, or in a harness that heavily prompts in the right direction, or as a subagent for a "wiser" larger model that directs all the planning and reviews results.

reply