undefined

points

[-]

Very different from my experience, Gemma 31b just solved a physics problem Opus 4.7 gave up on. I definitely don't think they're equivalent in general, Opus for sure is way smarter and way more likely to get things right on the edge, but it's still quite likely to get things wrong too it doesn't make it that useful for a lot of stuff. Conversely there are so many things that you would use an LLM for that they will both reliably oneshot. Especially in agentic mode where you have ground truth feedback between turns the difference gets quite small for a lot of tasks.

That all being said I've spent hundreds (maybe thousands?) of hours on this stuff over the past few years so I don't see a lot of the rough edges. I really believe capability is there, Gemma 4 31B is a useful agent for all sorts of stuff, and anything you can reasonably expect an LLM to oneshot Qwen 3.6 35b MoE will handle at like 90tok/sec, absolutely fantastic for tasks that don't require a huge amount of precision.

by fg1378 hours ago|

parent|

[-]

Sure. Sample size = 1.

by coldtea58 minutes ago|

parent|

[-]

If it works for me it works for me. Sample size of 1 is all I need to tell that.

by baq15 minutes ago|

parent|

prev|

[-]

lots of interesting things happen in anecdotes.

by 2ndorderthought2 hours ago|

parent|

prev|

[-]

The models op is using are from a year ago. The big breakthroughs happened in April this past month

by thot_experiment8 hours ago|

parent|

prev|

[-]

It may surprise you but over thousands of hours I have actually gathered more than one sample.

EDIT: Here's another sample for ya. I went to the store to buy mixers and while I was out Gemma 4 31b got pretty far along with reverse engineering the bluetooth protocol of a desk thermometer I have. I forgot to turn on the web search tool, so it just went at it, writing more and more specific diagnostic logging/probing tools over the course of like 8 turns. It connected to the thermometer, scanned the characteristics and had made a dump of the bluetooth notification data. When I got back it was theorizing about how the data might be encoded in the bluetooth characteristics and it got into an infinite loop. (local models aren't perfect and i never said they were) I turned on the websearch tool and told it to "pick up the project where it left off", it read the directory, did a couple googles and had a working script to print temperature, humidity and battery state in like 3 turns. Reading back throught it's chain of thought I'm pretty sure it would have been able to get it eventually without googling.

idk, I thought I was a cool and smart engineer type for being able to do stuff like this, if my GPUs being able to do this more or less unsupervised isn't impressive I guess fuck me lol.

by K0balt1 hours ago|

parent|

[-]

What is your opinion on qwen 35b MOEvs qwen 27b dense?

by fg1378 hours ago|

prev|

[-]

This.

I have seen way too many people who are overly optimistic about local LLMs.

Having spent a decent amount of time playing with them on consumer nvidia GPUs, I understand well that they not going to be widely usable any time soon. Unfortunately not many people share that.

by 2ndorderthought2 hours ago|

parent|

[-]

So the cofounder of hugging face made a post about qwen 3.6 being atclaude level of performance for the lols?

When were you trying local models? The model releases from April 2026 are a serious change in performance.

by close045 hours ago|

parent|

prev|

[-]

Not this. Let's reframe the problem. How many years behind do you think they are? By all accounts Gemma 4 is better than a frontier model from 3 years ago. Back then we were wowed by frontier models but when the local model reaches the same performance it's no good anymore, because you moved the target?

Relatively speaking local models might always be behind the curve compared to frontier ones. You can tell by the hardware needed to run each. But in absolute terms they're already past the performance threshold everyone praised in the past.

Right now in a lab somewhere there's a model that's probably better than anything else. There's a ChatGPT 5.6, an Opus 4.8. Knowing that do you suddenly feel a wave of disappointment at the current frontier models?

by tommoneytools2 hours ago|

parent|

prev|

[-]

[dead]

by AntiUSAbah5 hours ago|

prev|

[-]

You are missing context.

A local model is as good as a frontier model for responding on a signal threat with you which requieres basic tool calling.

A local model is as good as a frontier model of writing a joke.

A local model is as good as a frontier model at responding to an email.

Not sure what needs to be said often enough, no one without a clue would play around with local model setup and would compleltly ignore frontier models and their capabilities?!

by HDBaseT10 hours ago|

prev|

[-]

At least in my experience, local models are very far away from models like Opus 4.7 or ChatGPT 5.5 in coding and problem solving areas.

I find them useful in basic research and learning and question asking tasks. Although at the same time, a Wikipedia page read or a few Google searches likely could accomplish the same and has been able to for decades.

by darkstar_164 hours ago|

parent|

[-]

I think you're doing it wrong. Use the frontier moddels for the research, planning etc and once you have a plan give it to a local model for implementation.

by 2ndorderthought2 hours ago|

prev|

[-]

The guy is running potato models!

by ActorNightly4 hours ago|

prev|

[-]

Im like 50% convinced that these people are paid by Apple to promote their products. Because the conversation is always just being able to execute models (even larger ones), on mac hardware with unified memory, but nobody ever mentions that inference speed is unusably slow.

You can have good local LLM performance through agents, but you need fast inference. Generally, 2x 3090 or at the minimum 2x3080s (you need 2 to speed up prefill processing to build KV Cache). You just ironically need to be good at prompt engineering, which has a lot of analogue in real world on being able to manage low skilled people in completing tasks.