undefined

upvote

points

by senko1 days ago |

upvote

by 0xbadcafebee21 hours ago|

[-]

It was almost certainly not trained for coding, as it's got both audio and vision input, is only 12B, and nowhere in the announcement is coding mentioned. It will likely not have good performance on coding in general, compared to other small models like Qwen 3.6 35B A3B, Gemma 4 26B A4B, Nvidia Nemotron 3 Nano 30B-A3B, gpt-oss-20b.

For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.

reply

upvote

by dirkg13 hours ago|

[-]

> For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.

https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s

even the 27B in some quants can fit.

https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...

qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.

Gemma family is better for almost all other tasks you'd use a local llm for.

reply

upvote

by selicos2 hours ago|

[-]

I want to try a hybrid setup of Gemma 4 E4B with lots of context for general, then Qwen 3.5 9B or larger for coding. Strix Halo set up this weekend, which may enable even larger Qwen models with tons of context.

reply

upvote

by ricardobayes5 hours ago|

[-]

You can run it, however those low quantized models (iQ2, iQ4, Q2) will very likely underperform the 9B versions at Q6/Q8.

reply

upvote

by dofm4 hours ago|

[-]

The larger Gemma models are quite good at PHP. I would not be surprised if that was a training objective — it's one of the more consumer-focussed programming languages. They have very good knowledge of wordpress hooks.

reply

upvote

by dotancohen19 hours ago|

[-]

  > For 16GB laptops, Qwen 3.5 9B is the undisputed champ.

You seem like the guy to ask. For a laptop with 12GB VRAM (RTX 5070) and 32 GB system RAM, what is a good multilingual (English, Hebrew, Greek) model for conversing with personal notes in Org mode format? I don't care how long updating the model or rag takes, and even inference can be reasonably slow, but the results of the query as they relate to my personal notes are important. I don't care about general knowledge, for those questions I can use e.g. ChatGPT.

Thanks

reply

upvote

by akmarinov13 hours ago|

[-]

Joins us over on Reddit at r/LocalLlaMA to get 10 different opinions on that

reply

upvote

by dotancohen9 hours ago|

[-]

I read there regularly. I find little value there between the memes. I was hoping to ask a knowledgeable person here.

reply

upvote

by alfiedotwtf7 hours ago|

[-]

/r/localllama for a while now seems to prefer Gemma 4 E4B for creative writing (especially the uncensored GGUFs).

reply

upvote

by plagiarist33 minutes ago|

[-]

Do they prefer E4B over the larger models or is it a matter of what fits their machine? I assume 4B isn't large enough to get interesting writing but I don't know anything about it.

reply

upvote

by 11 hours ago|

[-]

deleted

reply

upvote

by nl7 hours ago|

[-]

Qwen 3.5 35B A3

Qwen models are always good. The 35B A3 model is a MoE model which means it has higher performance in RAM constrained environments compared to the 27B dense model (which is better at coding).

I don't have experience to rate it's Hebrew or Greek performance but apparently it's not bad.

reply

upvote

by sourcecodeplz16 hours ago|

[-]

Any Gemma 4 model, they are great at translations, multilingual

reply

upvote

by silversmith13 hours ago|

[-]

For the biggest languages, Spanish, French, maybe.

For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.

It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.

reply

upvote

by dotancohen10 hours ago|

[-]

While Gemini 4 seems fine, Gemma 4 does not do Hebrew well. I've replaced it with Aya Expanse and am getting much better results, but there is still much improvement to be had.

I'm not doing translations, rather querying Hebrew text with a Hebrew prompt.

reply

upvote

by emmelaich15 hours ago|

[-]

You may like https://www.llmfit.org/

(not recommendation, I've not used it .. yet)

reply

upvote

by hypfer10 hours ago|

[-]

Just tried it and honestly it's a terrible experience lacking any sort of intent or reason.

Which is unsurprising in the AI space.

You get a wall of text showing you various random fine-tuned models by random people, and that is basically it.

Actual sane default requirements like "just give me the normal AI labs", "please filter for dense only" and "I want this exact context size at this quant" are not part of the tool, apparently. Neither is "compare these quants for me for the same model".

Or maybe it's just hidden enough that I did not find them before I've stopped caring.

Conway's law is at it again.

____

Edit:

I have since then had qwen3.6 ponder the codebase and think about my complaints.

Seems to require a major data model overhaul to actually fix those, so they're legit. Which I didn't doubt, but nice to have some extra fabricated confirmation after it initially refused and said "nooooo the readme says otherwise nooo hypfer is just a hater noo"

___

Edit 2:

It gets worse the longer I stare at it. This could've been a web calculator.

reply

upvote

by hypfer5 hours ago|

[-]

Done:

https://github.com/Hypfer/will-it-fit-llama-cpp

https://hypfer.github.io/will-it-fit-llama-cpp/

reply

upvote

by hparadiz8 hours ago|

[-]

We need benchmarks by engine, cli switch sets, and device with filters by cpu, gpu, and type. And if someone could please aggregate that in a way where people can upload results and just automatically see the best of any model for their device that would be a killer app.

reply

upvote

by alfiedotwtf7 hours ago|

[-]

I've wanted to vibe code a tuning app, that pumps data through your CPU-GPU-RAM to try and determine the best parameters for each model, but I think it's just too much work compared to manually running by hand a one-liner and changing things here and there.

reply

upvote

by dofm4 hours ago|

[-]

I have found these things to be fully exasperating, to be honest, even though I am seeking information about a pretty "known" machine — a 64GB M1 Max MBP.

(Honestly I think Apple's "AI push" could do worse than just focus on a curated model library, a couple of Apple-standard Gemini distillations, an OS-level model manager and some sort of tweak of their containers system to do what Docker's sbx does. They could demystify a lot of this shit.)

reply

upvote

by tacomagick13 hours ago|

[-]

Gemma 4 26A4B

reply

upvote

by kajecounterhack21 hours ago|

[-]

Have you found Gemma 4 31B better than Qwen 3.6 27B Q8? I just started using Qwen + Pi agent and it's great, but "which model works best" is still totally crowdsourced and I was going off of peoples' opinions on reddit. Would love to hear more opinions if people have them.

reply

upvote

by embedding-shape20 hours ago|

[-]

> Have you found Gemma 4 31B better than Qwen 3.6 27B Q8?

Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.

> I was going off of peoples' opinions on reddit

It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.

reply

upvote

by xenophonf19 hours ago|

[-]

It took me way too long to realize you were referring to r/localllama.

reply

upvote

by MoonWalk19 hours ago|

[-]

Why the obfuscation in the first place?

reply

upvote

by embedding-shape8 hours ago|

[-]

Just a bit of flair. Also, bunch of people have "keyword watchers" setup for various terms, so when you mention certain things on HN, reddit and elsewhere, you get commentators who enter the conversation not because the context or larger conversation, but because the single term/thing they care deeply about was mentioned, and it just gets very boring to read the whole attackers/defenders comments over and over again. But ultimately I just did it like that because it was more fun to write it like that.

reply

upvote

by zozbot23418 hours ago|

[-]

I'm not sure that GP is correct, many people in that forum tend to hate Qwen for closing up many of their more recent models and leaving the whole local inference community 'stranded' on their older releases.

reply

upvote

by julianlam14 hours ago|

[-]

Are you sure? Prior to today the sub seems to be pretty partial to Qwen.

reply

upvote

by kajecounterhack14 hours ago|

[-]

That was definitely not the subreddit where I got my info.

reply

upvote

by thangalin20 hours ago|

[-]

Yes. I'm using Gemma-4 31B (gemma-4-31B-it-assistant.Q4_K_M.gguf) with llama.cpp to attribute quotations throughout chapters of my sci-fi novel. I started with Qwen3, but couldn't get it to work. Qwen3 TTS Voice Design, on the other hand, is incredible (Qwen3-TTS-12Hz-1.7B-VoiceDesign). I'm using both for an audiobook generator that produces a variety of voices.

Screens:

* https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)

* https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)

reply

upvote

by khimaros1 hours ago|

[-]

building something similar: https://github.com/khimaros/autiobook

reply

upvote

by qingcharles12 hours ago|

[-]

Gemma 4 31B is enormously impressive. You get 1000 requests/day for free on Google's API and another 1000/day off OpenRouter. Only problem is you get 503 like crazy.

reply

upvote

by jmpeax16 hours ago|

[-]

> nowhere in the announcement is coding mentioned

It's right there in the middle benchmark bar "LiveCode Bench" 72%.

reply

upvote

by ricardobayes5 hours ago|

[-]

Qwen 3.5 9B is great for coding, but somehow, based on a few hours of subjetive tests, the Gemma 4 12B seems even better.

reply

upvote

by dofm4 hours ago|

[-]

It does appear to have training for javascript and PHP, from what I can see, and pretty solid knowledge of wordpress and woocommerce. I would guess it has beginner-friendly knowledge of Python, too?

(Though it is gaslighting me about PHP anonymous functions.)

I would not use it to write code (the MoE 26B writes really good PHP), but it appears to have absolutely good enough knowledge to write implementation plans, and I think that could be useful in a sort of agentic coding tutorial environment.

I test these models with simple things. My favourite mini test is asking an AI to write a "last login" tracker facility for wordpress with a sortable admin column, which is trivial code — only a few lines -- but touches on a reasonably deep bit of the WP API. If you ask it to prompt you with clarifying questions, those questions are quite revealing.

It can write the code. Not tested it but I am sure it works. It's not as elegant.

It is not as good at understanding nuanced instructions as either the 26B or the sparse Qwen 3.6. There are concise things you can say in a prompt to Qwen 3.6 that have it draw logical conclusions that fully impress me.

I am more impressed by it than I expected. I reckon this would be quite useful in a tutorial tool.

(I say this as a sort of qualified cynic; I think much of the AI circus is a farce. But if these things are to ever be useful for teaching without making people dependent on some cloud "intelligence tap", this is progress)

reply

upvote

by senko21 hours ago|

[-]

Yeah, I agree 24B-36B sizes are better in general.

I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.

reply

upvote

by iso163119 hours ago|

[-]

I find ram crazy. My thinkpad has 32G of ram, it's a t470 that's nearly a decade old

Why do people with modern laptops have such little amounts of ram?

reply

upvote

by willy_k18 hours ago|

[-]

The ram that’s important for LLMs is gpu-accessible memory, meaning either systems with unified ram or VRAM, the latter of which is tied to the caliber of GPU one has.

reply

upvote

by SturgeonsLaw9 hours ago|

[-]

Unified memory is soldered to the motherboard and needs to be ordered with the new laptop, for prices that are well above what the equivalent amount of SODIMM would cost.

Fine if work's paying, but for personal devices (that might have been purchased before local models got good), people have what they have.

reply

upvote

by doubled11218 hours ago|

[-]

My job still issues 16GB laptops as standard. You need a business reason to get more. This has been going on since before the price hikes.

I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.

Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.

Close some Chrome tabs?

reply

upvote

by alfiedotwtf7 hours ago|

[-]

8Gb was the standard for a long time (before Apple went Silicon), because from what I understood, is that SDRAM needs to contantly power cycle the memory bus otherwise the bits will fade, and so by having more RAM, your battery would last a little less... this was around the time when 3 hours charge was unheard of, so every little bit helped.

Probably doesn't matter these days with all-day batterys, but now the demand-supply curve is lopsided.

reply

upvote

by zigzag31222 hours ago|

[-]

> It roughly compares with GPT-4.1 (!!), released 14 months ago

I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.

reply

upvote

by mdp202121 hours ago|

[-]

> I suspect ... still wins in general world knowledge due to bigger size

Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.

I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.

(Corroboration: we can't delirate, and neither can the machine...)

reply

upvote

by bitexploder20 hours ago|

[-]

Don't LLMs work on attention though? The closer in their hyperdimensional space you can land your problem to their inherent understand the better they are at understanding your problem domain. RAG loops can be very slow and agents may simply lack the knowledge to use them correctly.

reply

upvote

by pu_pe7 hours ago|

[-]

I agree with you in general, but depending on the task I also find that a certain level of encyclopedic knowledge can be very valuable. For example, if you use it for coding, the model will likely not resort to search or RAGs when deciding whether to use a particular package or stack.

reply

upvote

by coldcity_again21 hours ago|

[-]

A great position to take. Strong opinions, weakly held.

reply

upvote

by UncleOxidant3 hours ago|

[-]

I've heard the assertion that the Gemma 4 models don't do well with lower quantization. I wonder if the "bizzare/trivial" syntax errors would go away at Q8?

reply

upvote

by superkuh21 hours ago|

[-]

>consumer-grade card with 12G of VRAM and got 5t/s

That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.

reply

upvote

by senko21 hours ago|

[-]

Good catch. I haven't looked deeply into it. This is with Vulkan backend on Linux which I understand should be roughly comparable to CUDA? Gfx is rtx 3060(ti?).

I should play a bit more with llama.cpp options and see what bappened there. Thanks!

reply

upvote

by superkuh18 hours ago|

[-]

I've had it happen in the past with llama.cpp on linux that the CPU will present itself as a vulkan device GPU1 with "PHYSICAL_DEVICE_TYPE_CPU" and had a mix-up. Might want to try llama-server --list-devices and then append --device Vulkan0 or whatever.

reply

upvote

by frikk23 hours ago|

[-]

Thank you for sharing this. Do you think the syntactical issues could be addressed with fine tuning or some other kind of parameter tweaking? That's frustrating hah.

reply

upvote

by profunctor23 hours ago|

[-]

With a harness you could feed the code to a linter and if there are errors feed that to a model automatically. It’s amazing that the models are good enough that I haven’t bothered doing this

reply

upvote

by pseudosavant20 hours ago|

[-]

Models this small and this capable bode really well for the usefulness of a PC like the RTX Spark that Nvidia/Microsoft announced this week. 128GB of unified memory will likely be more than sufficient for effective local agentic coding, even if SOTA cloud models will still be even better.

Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.

reply

upvote

by pseudollm19 hours ago|

[-]

> usefulness of the RTX Spark

Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow

reply

upvote

by SwellJoe18 hours ago|

[-]

Yep, I have a Strix Halo and while it can run models bigger than Qwen 3.6 27b, it's not usable interactively when you do. ds4 patched for ROCm works, but at such a slow speed, it's not usable for coding agents.

The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.

reply

upvote

by zozbot23418 hours ago|

[-]

Why does everyone expect interactivity from local AI? It's not the best use of the hardware, especially not miniPC hardware. Long-term batched inference with larger and more capable models is much more feasible AIUI.

reply

upvote

by int_19h15 hours ago|

[-]

I can't speak for others but IMO the only reason to run models locally right now is privacy - i.e. you don't trust any of the cloud providers to not look at your prompts. Price-wise the market is extremely competitive and cheap model serving favors large scale so anything that can be run locally can be run cheaper in the cloud. But if privacy is important, then it's important for everything, including traditional chatbot applications, which kinda do require interactivity.

reply

upvote

by SwellJoe18 hours ago|

[-]

Even batched it's uncomfortably slow. I started to benchmark ds4 with my security vulnerability benchmark (after Qwen 3.6 dense and MoE and a bunch of cloud models), but it was going to tie up the Strix Halo for more than a day, so I decided not to run it as it would prevent me from doing other stuff with it during that time.

Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.

Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.

reply

upvote

by hedgehog17 hours ago|

[-]

The 6-bit versions + 8-bit KV cache seems to save a good bit of memory without a significant loss of quality. The Qwen 35B is pretty fast in my testing, but MiniMax M2.7 230B is in some ways faster (way fewer tokens to arrive at an answer) even though it is much larger.

reply

upvote

by SwellJoe16 hours ago|

[-]

Qwen 3.6 35B-A3B with MTP at 8 bits is blazing fast, something like 50-60 tokens per second. That's plenty fast for interactive use, so I haven't tried lower bits. Unfortunately the MoE is notably dumber than the dense model (for the case I have data about...I've been benchmarking models for security vulnerability scanning, and 27B is notably better on hard bugs).

The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.

I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.

reply

upvote

by hedgehog15 hours ago|

[-]

Same chip, with a 6 bit 35B and 8 bit KV cache I see about 500 prefill and 55 decode at 30k into the context window. MiniMax seemed a bit lower token rate but much, much less prone to 40k tokens of monologue before generating an answer. A pattern I like is to use a smaller model to do most execution and then a larger model to review transcripts and output and do any fixups and tooling improvements (this is all batch jobs so all I care about is overall throughput).

reply

upvote

by milch13 hours ago|

[-]

What hardware do you need to run MiniMax M2.7 230B locally?

reply

upvote

by hedgehog10 hours ago|

[-]

Ryzen 395 is what I'm using, anything with 128GB+ of RAM accessible to the GPU should work fine for a 4 bit version of the model (so Spark or Mac Studio should be ok too).

reply

upvote

by dirkg13 hours ago|

[-]

The RTX/DGX Spark, Mac Ultras with 128GB unified ram are all ~$5k. Its still an expensive toy for rich people, it might as well be an H100 for 99.9% of the population (not devs with high paying jobs, of course).

the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.

there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.

reply

upvote

by green7ea5 hours ago|

[-]

I'm using a Strix Halo laptop (~3k, 64GiB) and with Gemma 4 and Qwen 3.6, both at 8 bits, I'm seeing very impressive results.

As a work tool, this is reasonably priced. You can save a bit of money by opting for a non-laptop form factor.

reply

upvote

by organsnyder4 hours ago|

[-]

My Framework Desktop with 128GB was about half that. I did luck out by buying before RAM prices went crazy, though.

I'm looking forward to the fallout when the data center bubble bursts. There's a good possibility we'll see a glut of hardware, either on the used market or from manufacturers that no longer have massive orders from OpenAI and the like.

reply

upvote

by zozbot23418 hours ago|

[-]

RTX Spark is pretty much the DGX Spark in a laptop form factor, plus some lower-performing chips in the same series to be released later according to rumors. We know quite well how the top-of-the-line chip performs: it's very interesting for some application areas, less so for others.

reply

upvote

by borissk4 hours ago|

[-]

We are really getting close to singularity - the pace of LLM improvement is constantly accelerating.

reply

upvote

by McGlockenshire22 hours ago|

[-]

> my consumer-grade card with 12G of VRAM and got 5t/s for output

Thank you for giving me hope!

reply

upvote

by DeathArrow10 hours ago|

[-]

>The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually

Can you instruct it to use a lsp?

reply