undefined

points

[-]

Have you tried the Gemma 4 series, out of curiosity? I haven’t run a local model in a while, but the benchmarks look good. I’d take a free local tool-use model if it was relatively consistent.

by v3ss0n52 minutes ago|

parent|

[-]

Qwen 3.6 burns it to the ground. it was not even a challenge. Gemma4 seriously fails at toolcalls and agentic works. It got all messed up after 2-3 turns of Vibecoding.

by 59nadir9 minutes ago|

parent|

[-]

Counter-point: I built an agent that can only interface with Kakoune, a much less common and more challenging situation for an LLM to find itself in, and Gemma4-A4B 8bit quantized does remarkably better in actually figuring out how to get text in buffers than Qwen3.6-35B-A3B in a similar class as Gemma4 A4B.

Now, is this the usual use case? No, it's a benchmark I created specifically in order to put LLMs in situations where they can't just blast out their bash commands without having to interface with something else and adapt.

by xrd26 minutes ago|

parent|

prev|

[-]

How do you run it? vllm? llama.cpp?

Can you share some parameters you enable tool calling and agentic usage?

Or, higher level, some philosophies on what approaches you are using for tuning to get better tool calling and/or agentic usage?

I'm having surprisingly good success with unsloth/Qwen3.6-27B-GGUF:Q4_K_M (love unsloth guys) on my RTX3090/24GB using opencode as the orchestrator.

It concocts some misleading paths, but the code often compiles, and I consider that a victory.

You have to watch it like you would watch a 14 year old boy who says he is doing his homework but you hear the sound effects of explosions.

by lambda25 minutes ago|

parent|

prev|

[-]

Gemma 4 31b was working ok for me; but it was consuming tons of memory on SWA checkpoints, I had to turn them way down, and as a 31b dense model is fairly slow on a Strix Halo. I did have a lot of tool calling issues on 26b-a4b, though.

The Qwen models are quite solid though.

by 2ndorderthought41 minutes ago|

parent|

prev|

[-]

Gemma4 is definitely not used for vibe/agentic coding. Not even worth trying. But its a different weight class.

by 2ndorderthought1 hours ago|

parent|

prev|

[-]

I tried the Gemma 4 I think 2 and 4b. The 2b was not useful for me at all. A little too weak for my use cases

The 4b was okay. It didn't get all of my small math questions right, it didn't know about some of the libraries I use, but it was able to do some basic auto complete type stuff. For microscopic models I like the llama 3.2 3b more right now for what I do, it's a little faster and seems a little stronger for what I do. But everyone is different and I don't think I'll use it anymore this past month has been crazy for local model releases.

by throwaw1238 minutes ago|

parent|

[-]

can you share your use cases for 2b and 4b models?

curious how people are leveraging these models

by 2ndorderthought20 minutes ago|

parent|

[-]

For me, I use them for quick auto complete or small questions. I am not a vibe/agentic coder. I know I am a relic and a Luddite because of this.

Instead of hitting stack overflow and Google I will ask questions like "can you give me an example of how to do x in library y?" Or "this error is appearing what might be happening if I checked a b and c". Or "please write unit tests for this function". Or code auto complete.

I am not looking for the world's best answer from a 3b model. I am looking for a super fast answer that reminds me of things I already know or maybe just maybe gives me a fast idea to stub something while I focus on something more important, I am going to refactor anyways. Think a low quality rubber duck

I mostly use 7-9b models for this now but llama 3.2 3b is pretty decent for not hogging resources while say I have other compute heavy operations happening on a weak computer.

Probably half the questions people ask chatgpt could get roughly the same quality of answer with a small model in my opinion. You can't fully trust an LLM anyways so the difference between 60% and 70% accuracy isn't as much are marketing makes it sound like. That said the quality of a good 7-9b model is worth it compared to a 3b if your machine can run it. Furthermore the quality of qwen 36 is crazy and makes me wonder if I will ever need an AI provider again if the trend continues.

by cyanydeez39 minutes ago|

prev|

[-]

Qwen3-Coder-Next seems to be perfect sized for coding. I tried the new and just found the verbosity not really useful for coding. But probably for more analytical tasks or writing docs.

by steveharing12 hours ago|

prev|

[-]

Yea, No doubt Qwen 3.6 open weights are far more strong

by rnadomvirlabe2 hours ago|

parent|

[-]

Why no doubt?

by captainbland1 hours ago|

parent|

[-]

No comparison with competitor models other than the previous granite version strongly implies that it does not compete well with other comparable models. At least this is the most reasonable assumption until data comes out to the contrary

by 2ndorderthought1 hours ago|

parent|

prev|

[-]

Qwen 36 is effectively a pocket sized frontier model. It's really surprising for me anyway

by steveharing11 hours ago|

parent|

prev|

[-]

Because Qwen 3.6 pushes way above its weight. Granite 8B is impressive, but Qwen still wins on raw capability, especially for coding.

by rnadomvirlabe1 hours ago|

parent|

[-]

You just asserted the same thing again. Why do you say this is the case?

by 2ndorderthought1 hours ago|

parent|

[-]

Qwen scores above sonnet in coding benchmarks. Runs locally. In personal use it's really good. Anecdotally others have used it to vibe code or agentic code successfully. Not toy problems. Not a toy model.

Qwen3.6 raises the bar for models of its size. There really isn't a comparison in my opinion.

by noodletheworld1 hours ago|

parent|

prev|

[-]

Having tried it.

Qwen is really good.

Also, generally, it makes sense. 8B models are generally not very good^.

That this 8B model is decent is impressive, but that it could perform on par with a good model 4 times as large is a daydream.

^ - To be polite. The small models + tool use for coding agents are almost universally ass. Proof: my personal experience. Ive tried many of them.

by irishcoffee1 hours ago|

parent|

[-]

So it’s just like, your opinion, man?

edit: It was a play on The Big Lebowski, folks.

by Terretta49 minutes ago|

parent|

[-]

College SAT scores do not tell you how the dev applying for your open back end systems engineering job is going to do once they're in your workplace harness.

Nor do class standings, nor hackerrank and the like.

What will tell you is asking them to fix a thing in your codebase. Once you ask an LLM to do that, a dozen times, I'd argue it's no longer "just your opinion man", it's a context-engineered performance x applicability assessment.

And it is very predictive.

But it's also why someone doing well at job A isn't necessarily going to be great at B, or bad at A doesn't mean will necessarily be bad at B.

I've often felt we should normalize a sort of mutual try-buy period where job-change seeker and company can spend a series of days without harming one's existing employment, to derisk the mutual learning. ESPECIALLY to derisk the career change for the applicant who only gets one timeline to manage, opposed to company that considers the applicant fungible.

But back to the LLM, yeah, the only valid opinion on whether it works for you is not benchmark, it's an informed opinion from 'using it in anger'.

by robotmaxtron35 minutes ago|

parent|

prev|

[-]

the (dead) internet is full of opinions exactly like this

by brazukadev30 minutes ago|

parent|

[-]

you tried qwen3.6 and you think it is not good?

by robotmaxtron20 minutes ago|

parent|