BUT DO NOT buy this MacBook if you plan on doing serious coding using local LLMs with it. The reason is simple: your fingers will burn and your head will explode from the noise.
Running any kind of sophisticated job on the very laptop you are using is just not viable. Sure you can use it in clamshell mode, but forget touching it while working with AI coding or agents.
If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk. Connect to it over LAN or Tailscale. The MacMini will also cost you almost 1/3 of the MacBook Pro.
Thank me later.
With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
Since you can control the low power mode setting from the command line: `sudo pmset -a lowpowermode 1`.
It should be pretty straightforward to hook this up to Hammerspoon[1] using hs.application.frontmostApplication() to apply the setting based on whatever foreground application you choose.
Thinking out loud, that being said, the necessity of sudo might make this slightly more complex. An always on background admin agent might be needed I suppose to bypass the password prompts (or add pmset to the sudoers file, if you prefer).
614 GB/s of memory bandwidth
> MacMini M4 with 64GB of RAM
273 GB/s of memory bandwidth (also only currently available with 48GB)
When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.
And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.
This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.
It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.
All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.
I do like Gemma for translation, however.
That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs).
I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.
So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)
I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.
There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).
There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)
But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...
Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.
There should be a lot more content on setups and best practices etc. if these macs would be used with local models only.
I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).
Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).
I'm mainly interested in coding/image creation tasks. Has anyone built out a server for a similar use-case and, if so, whats your experience been? What cards should I be looking into? Am I looking at spending ~10-15k for something that can give me near frontier quality/speed? I know about the DGX Spark/Mac Mini's, but I'd like to be able to upgrade later down the road.
Don't expect workstation loads with no fan or heatsink, true. But it's not a real problem, it's still quieter than a desktop.
That said, rather than Mac Mini, if you only work from one place, I'd recommend a Studio Ultra M3 with 512GB. Same or more tokens per second, multiple models loaded. Cool and quiet.
If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.
They pulled them a month or two ago, right after I bought it.
> Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00
Using linux for actual work on my workstation.
10x rtx6000 Pro in a large workstation is probably the way to go for someone wanting to run GLM5.2.
Other than that it is cloud.
As good as these small models got we are still not "at breakeven" for me.
What is "breakeven" with LLMs? For me it is when I no longer have to read the actual code it wrote. I can trust that if I told it to implement and document a certain architecture it actually did that with no stupid mistakes.
The first model ever that did that for me was the first opus. 4.4 if I remember correctly.
The second model was Gemini 3 Pro preview. For few weeks. Then it was lobotomised. I guess it was too expensive to run and they quantized it too hell.
Only Opus remains. If this GLM model truly rivals even an old opus I'll be very happy when day comes that I'll be able to run it locally.
You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.
$6800 is a lot of API credits for GLM, for example, on any provider you want to use.
Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.
I still am going to buy a second one haha
I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.
But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.
Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.
I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.
As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.
Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.
After about 1 minute the entire machine basically bricked and I had to hard reset :D
if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.
more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.
that`s also how i would interpret the recent rumors on m6 and m7.
naturally, the cooling and all that will be optimized around that.
so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.
you are basically paying the price now to be on the bleeding (sweating) edge
My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.
An AMD AI Pro R9700 32GB brand new is $1350 right now.
After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.
You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.
32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).
You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.
But man, I have never purchased a computer which is more expensive than a decent family car.
https://www.microcenter.com/product/709071/pny-nvidia-rtx-pr...
I know you probably weren't referring to this type of memory in your post, but IMO it might be worth avoiding this term in the future unless you're referring to HBM, the standard.
Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.
As more context will degrade a lot the t/s. On top this is 1 slot.
If you use sub agents the kv cache will be invalidated with colliding request and make it even slower.
So the in real world 256k (the max qwen offer) and using 3-4 slots the numbers are very different.
This is the major issue with so many postes over local models not benchmarking real world use. Real context and not taking this in context.
If you use 1 slot the issue, you loose the ability of using sub agents when exploring and all end up in the main agent context overloading it, triggering compactation and oh boy with 64k context that compecation will be an endless loop.
What tasks you would really be able to do with 64k context 1 agent? For sure so quick edits but not complex planning where you need to ingest a lot files and end up loosing 80% of the ingested files to compactation.
I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.
needs to be noted that it's increasingly uncommon to be able to do so. for desktops you have to build everything yourself - prebuilds (either gaming or workstations) have proprietary PSU and motherboards (in case of workstations, sometimes CPU is bound to the motherboard / manufacturer, for example Threadrippers). Windows laptops now often come with soldered RAM and soon will probably be without M.2 slots like Macs.
There is Framework though I guess
Mac Studio: Ships: 16–18 weeks
Mac mini: Ships: 10–12 weeks
As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.
Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.
But you do need a fast LAN connection, otherwise working with agents will be a pain.
Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.
I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option
But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
So not "implement me a shading algorithm"
But more like: make an multi user app running on a k8 cluster, design the whole thing to be indempotent, scalable, easy to deploy remotely via ipmi/pxe boot.
Then see how it makes stupid mistakes along the way.
Today's AI is pretty amazing when it comes to fixing narrow problems (or creating Web apps with no infra). Give it anything where it needs to go online, download some helm templates and look through them to figure out parameters, as well as write an app and it will make lots of mistakes in seemingly simple stuff.
Opus seems to be the model that works the best with this.
Wouldn't this damage the MBP display?
My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.
It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!
Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.
- M3 Pro MacBook Pro 36GB
- M2 Pro MacBook Pro 16GB
- Mac Studio M4 Max 48GB
and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.
You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.
While it is wild to have this much power in a take-it-anywhere laptop form factor, I sort of regret not just going for a Mac Studio + base M5 MBP.
llama.cpp's Metal backend does use them when they're available.
How is this config?
qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).
I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.
to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.
Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.
A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.
Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.
Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.
I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.
I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).
I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.
But you do you.
The only reason I can tell it's on, is the very quiet hum of the slow speed water pump. Large fans run at 1200rpm and are fully quiet.
I have over a meter of radiators there.
Fun fact, I bought my first rtx3090 4 years ago. A year ago I bought another one and they are still the same price used.
I may buy another one (for my servers)
If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.
But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.
Some people will be happy to pay that premium for privacy, but at roughly 10X the cost of a MacBook Neo, that money could also buy a lot of credits on OpenRouter or frontier labs.
[0]: https://www.apple.com/shop/buy-mac/macbook-pro/14-inch-space...
I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
And I can't say that I won't switch to openrouter (even just for the same models) at some point.
But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.
The rest of my life is ultra-frugal so I am relaxed about this.
Having spent a good weekend learning how to perform latent-steering through playing with pytorch and a local Gemma4 model, there is no way I could have groked any of that in the the way I did without hands on time.
This is on an M3 Max 36GB I've had for a couple of years. No further outlay needed.
I don't know if it has changed my mind about a career change but as I am sure you can understand, I no longer feel like I am running away defeated.
My very best wishes to you :-)
The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.
The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.
(sarcasm, btw)
Over the long term it's always been better to buy than to rent, even if the renting option is technically more efficient on the GPUs, you don't have to pay some hosting providers profit margin.
And for users that aren't running multiple agents 24/7, you should be able to fit a good user:GPU ratio.
What makes you so certain that economies of scale won't work the opposite way you imagine? E.g., if model improvement tapers off, but RAM costs decline (hard to believe atm, but historically likely), then eventually everyone will be able to run SOTA models on their personal hardware.
Heck, even if model sizes simply grow more slowly than RAM costs decrease, the same would happen.
For example (and relevant to AI) I can generate electricity on my roof at $0.20-25/kWh, batteries included. In California the electric utility can’t offer it cheaper than $0.30-0.50/kWh. Therefore at scale, electricity is actually more expensive.
There are many such examples.
Right now, there is way more scale in centralized AI than there is at the edge. But that could flip. I'd still probably put the probability that it will under 50%. But I'd also put it above zero!
I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place
That's never the point of keeping local alternatives though.
For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.
Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.
I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.
For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?
And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.
Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.
Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.
The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.
Like, just watching a computer I already owned act like ChatGPT with the wifi disconnected.
It was the first time I stopped feeling quite so helpless, somehow.
Qwen barely needs any of Opencode's prompt, in my experience; I think I cut it down to about three general lines I found by googling. Mainly you need only a pre-amble to make sure that the plan mode, plan switch and build mode prompt fragments make sense.
Gemma 4 also needs almost nothing at all, which is fascinating, considering it is not a coding-specialist model. It just seems to be who you need it to be when you ask.
Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight.
I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.
Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.
I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.
Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.
I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.
YMMV.
But if this is the case, as you say, it seems like a good opportunity to build a more welcoming set of entry points into this!
(Very reminiscent of 3D printing, where you get a lot of very trivial advice poorly applied, which is an analogy I've now made several times.)
Several of the youtubers are pretty helpful, though; I watched half a dozen things and absorbed the broad pattern and then went for it.
Also I got a lot out of reading HN comments, which is why I am here; tucked away in the corners of these discussions are people who can help. Over time I hope I am one.
To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.
What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?
I dunno, all of this seems really boring and "been there done that" to me at this moment in time!
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.
I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.
(Also who am I kidding about the existence of a printer with no problems)
I have colleagues that seem perfectly content to delegate too much to the agents, and it saddens me. It feels like there will be swaths of engineers that didn't train some of the critical thinking skills that I take for granted.
I certainly see it in slack discourse around anything more complicated than a feature implementation. Maybe I'm just cynical. Time will tell, I suppose.
That is why I'm content to delegate to agents - I have more code/features I want to write than I have time to debug (writing is the easy part).
Over the last few months, I've been digging into performance problems with a high throughput service that my team owns. I started working on the problems in my own time, put out short and medium term improvements that legitimately avoided operational issues, and started developing an alternate architecture that should meaningfully address the problems for the long term.
I've learned new things and made improvements that probably wouldn't have ever gone in otherwise.
I've spent my whole career being frustrated by the pile of low severity bugs and performance issues that "I could fix that if I could only justify putting a couple hours into it!". And now I can just fix all those. Nobody is going to question my use of time to write prompts and do code reviews of those things, when I can to my "real" work simultaneously.
What does "mainstream" refer to when we're talking about software development and LLMs? As opposed to "engineers".
But I think there is (and has always been) also a distinction between the "mainstream" of software developers vs people who are working on new tools and capabilities to be used by that "mainstream".
IMO it is certainly true that the most efficient and cost effective was to do "mainstream" software delivery at the moment is hosted frontier models. But for people thinking about "what's next?", it makes a ton of sense to be exploring different models in anticipation of a possible (but certainly not inevitable) sea change.
I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.
I guess I could also ask it to do the work. But where do you draw the line?
The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.
And that's how skills die.
The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.
No longer having to sweat all the details is a Good Thing, not a Bad Thing.
But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.
For example, you need to know it uses gasoline (or diesel), it requires oil changes every certain amount of time, break pad replacement, etc.
You also probably need to know that you can't operate cars over a certain amount of water, that you need a driver's license, stopping at red lights, etc.
Sure, you might not need to be a mechanic, but that's far from not understanding how a car works, which to me sounds similar to knowing how to shoe a horse, which is different than being a horse vet.
Maybe a more apt analogy would be a skill like making fire without a lighter.
That skill died too, so what's your point?
Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.
So this is really the only way I know how to proceed.
And we happen to be discussing this on a forum where the type of people who will be the specialists for the kinda of systems we're discussing are likely to gather.
I'd be surprised if in my casual discussions out in the real world, I were to run into a lot of people who care exactly how all this works, to the extent that they want to invest significant money into hardware that allows them to run things themselves and dig into what's actually going on. But I'm not at all surprised to come across such people here! (Indeed, it would be very disappointed if I didn't!)
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs
Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.
However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.
What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.
I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!
https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.
Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.
When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.
Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.
Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.
If you want to run unquantized, you definitely need 128GB.
Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.
As of writing this, it shows 24 offers between 700 and 950.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?
$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
https://github.com/noonghunna/qwen36-27b-single-3090
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.
I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.
I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.
[edited to mention ollama as a nice alt]
I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.
In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).
YMMV for sure.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
Context size?
i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?
In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.
At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.
I paid 2424 euros in total for this machine. And it can easily run the models discussed in the comments and in the article. It's tiny, and runs CachyOS like a champ. Over 4000 euros less than the price you listed.
We can all send a thank you letter for our friendly billionaires such as Sam Altman for the price situation we're in today: https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...
I think you might be a little to into the stew here.
I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!
These devices, especially the DGX line, are fantastic if you are interested in low-level CUDA programming. The DGX spark can be used to prototype CUDA code/libraries for GPUs that most of us couldn't think about affording. If you want to learn how to program for datacenter level GPUs then these are the best way to get that at home. Sure your code will run very slow compared to the real thing, but you can take that code and, theoretically, run it on the real thing. For anything else though, I feel there are better options.
If you're interested in pure inference I'm pretty partial to Apple devices. The M4 Max gets you 546 GB/s, the M5 MAX 614 GB/s, and the M3 ultra (you'd have to buy used at this point) 819 GB/s. Plus you have a very useful computer even if you realize you don't want a full time home inference server. Additionally these devices require very low power (if you're running high end consumer GPUs you do have to think about what your energy costs are per hour and how warm you like your room).
If you're interested inference and training, or already have a pretty beefy desktop PC, or simply demand the most token/s you can get, then GPUs are the way to go. The downside is they're still pretty memory restricted (but honestly the options for what you can run on any RTX N090 are pretty good). You'll get blazing inference and prefill speeds on these devices. The only down side is, if you are using them heavily, you will see it on your energy bill and feel it in your room.
The "should I wait" question is also potentially applicable. The world of consumer hardware is looking increasingly bleak (and expensive) but if Apple does release a new "Ultra" model we could be looking at inference speeds very close to GPUs (there's still limitations to these devices that makes training preferable on GPU)
What I had in mind was an AMD Strix Halo machine, but it seems to have none of the advantages you mentioned. It's neither high bandwidth, nor does it have CUDA support, nor does it have support from the big OEMs. All the boards are from relatively obscure Chinese vendors.
It seems like all the major OEMs have rallied behind Nvidia, if you look at the upcoming RTX Spark laptops.
The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.
What's going to happen is that the capability at any given size point is going to get better over time as new training regimes cram more into the available space. A 27b model released next year will be better than a 27b model this year (else why release it?). Hardware will get more useful, not less.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
> GLM-5.2 class models already need 1TB+ of RAM.
If you quantize GLM-5.2 to 4 bit, you can do it in less than 500GB: https://huggingface.co/unsloth/GLM-5.2-GGUF (table on the right)If you find three finds that also have a 128GB MacBook, you can chain them together (the MacBooks, not your friends) and make it work.
You could also run GLM-5.2 on a single MacBook if you stream the active parameters from disk, but even with speculative decoding, you'd probably only get in the order of 1 token per second, so this is not really practical for most applications.
They’re trending to be the right size to be good.
Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.
in terms of privacy, yes that's a real application, but someone taking it all away? I don't see it happening.
it's not an OS or a device, it's just a box/thing that runs a model, it's really commodity stuff we're talking about
more realistic concern would be that the open labs wouldn't be able to compete in the future thus development ends, but that means you can't host models that don't come out so...
again maybe I misunderstood but I just don't see why this would be worth it just for that one concern
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.
You're welcome to make your substantive points thoughtfully, just not aggressively.
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"
The real test is whether or not it can work with your existing codebases. In my limited experiments Qwen 3.5 (maybe 3.6 is loads better) does OK on a Rust+React app, and less well on a C# monolith. Not to the point of being unusable but definitely poorly enough that I went back to Claude after 20 minutes. If I lost access to a cloud model and had to use Qwen instead I'd be visibly sad.
Not really germane to your comment but I hope I don’t sound old when I say I remember a time when spinning up a PoC was a week of work, and a statement like yours was pure science fiction.
If I start prompting away the core of a new project I lose interest in the entire thing almost straight away. I hate it. The next day I could care less about it. In fact it just makes me lazy, like a fat person who drives everywhere.
I love typing code and thinking for myself. Im going to continue to do that. I still dont know anyone who's shipped anything truly useful with this garbage tech, let alone with a local 30b param model. So much cope in these comments.
Spending 6k on hardware to run the worlds most mediocre model truly does make you an incredibly stupid person, so Im not really suprised by these comments of people saying these tiny models are helping them so much.
Its like a special needs kid all of sudden got the ability to code, of course they'd be impressed by basically all the code it produces.
I’ve used Qwen 3.6 27B for many things at work, and I’m regularly able use it for reasonably scoped tasks.
I’m not saying these models are perfect.
But you are complaining about people on the extreme, while at the same shouting from the opposite extreme.
2) Not every team will have someone with 20 years of experience in a particular domain eager to spin up a PoC.
What are you even saying? Are you aware that there is a massive range in the scope of projects? You must work on some incredibly simple CRUD apps if this is your take.
This is an underrated consideration when evaluating the small models: The further you deviate from standard example code, the more their weaknesses show.
My experience is that Qwen3.6 produced some amazing results for a small model when I tried it with simple apps that are widely reproduced everywhere. If you want a React TODO app or to set up a little boilerplate app with shadcn and other popular tools, it will produce something that looks not too bad.
Then when I started straying outside of common tasks and into some of my more niche work, it would spin for hours and go in circles before finally producing some groan-inducing output that wasn't usable.
If you're looking for a model to help with simple refactoring or small tasks where you provide very explicit instructions for exactly what you want, but you don't want to do all of the typing yourself, they can do a lot of good work, though. But you're right that once you get into long context sessions involving topics off the beaten path, the weaknesses are very apparent.
The quantizations that are popular for making these models fit on smaller hardware make the problems worse. When you read it about online there is almost a consensus that 4-bit quants are lossless and that you can use q8_0/q8_0 kv cache quantization without any real loss, but in my experience with real projects there's a substantial degradation in long context performance with any of these quants.
Never go below an fp16 kv cache unless you've already tested it in advance with your model on a verified task that you know it can successfully complete. People should also test the difference using the exact same seed value so they can see how the tokens diverge. If you have memory constraints, sometimes you can still use an fp16 kv cache and use storage for an agentic buffer to work your task with mixed abstractions rather than having everything in memory.
For 4-bit weight quants, Gemma 4 31B QAT is where people should be looking instead of Qwen 3.6.
Modifying existing code is way easier if you don't expect it to be smart about it. Don't say "add X feature" and let it explore the codebase and build its own understanding. Point it at the relevant files and say "the goal is to add X feature to this code, follow Y guidelines". Now you've done the hardest part of making the decisions and it just has to follow instructions while coloring within the lines.
Is that not how you would work with any model, local or not? I wouldn't trust it to make the right decisions unattended. I just know the moment I look away it's going to do something utterly braindead.
All small-scale stuff. For large integrated projects I am finding DeepSeek v4 Pro commercial API to be very inexpensive and helps me produce good results.
1. Maybe you should tell us what those limited experiments are.
2. Maybe you should actually try 3.6 because it's huge difference in most cases. Don't forget to tell us quants and don't forget to tell us scope.
3. Maybe actually show us data compared to frontier models instead of this... vibe comment. Pretty tired of this kind of comments on HN that doesn't require logic or evidence. Just vibes. Like the pelican riding a bicycle crap that everyone has taken for granted but has no objective way of assessing goodness.
(I'm aware the price is, in absolute terms, more expensive where I live compared to the USA. That reinforces what I think, because anyone sane that would've bought one of those in another country would sell them as soon as they landed here and save that money.)
In other words, yes, buying this kind of machine only to run an LLM locally doesn't make sense, because local LLMs generally still suck for serious programming work (they work great for spam filtering though!). But more generally this machine makes sense for a lot of people.
* yes, you can run it on an older/smaller GPU plus system RAM but performance will suffer
* if you want optimal GPU performance you need the model in VRAM plus context, so 24GB (3090, 4090) or 32GB (5090) cards, plus a system that's reasonable powerful to plug them in to. Ideally you'd have a multiple cards working together but for optimal performance this means either 2x 3090 or nvidia's workstation cards.
* you can go for a 128gb Strix Halo system, but the memory bandwidth isn't great and they're becoming increasingly more expensive (5.5k EUR for HP laptop, 3.9k EUR for GMKtec EVO-X2 mini PC)
* you can go for a 128gb DGX Spark (5k EUR+) which also has unspectacular memory bandwidth or RTX Spark (price unclear but probably not cheaper)
* or go for a Mac with a decent CPU and a good amount of RAM (bandwidth varies by model, but typically a bit better than Strix Halo/DGX Spark and worse than bespoke GPUs.
As usual with such questions, there are of course cheaper paths (if you want to accept the tradeoffs) but Macs are reasonable vs. competition for these workloads.
I wasn't really expecting much from these local open weight models neither when it comes to speed or "intelligence", but my preconceptions were quickly put ashame when I got ollama up and running and pulled my first model. I get a consistent 117-128 t/s with Gemma4:26b-a4b without any tuning (just the default settings), which was much faster than I had expected. Can't wait to dive deeper into this, especially with Qwen3.6 models.
Does anyone's have experience adding a 2nd Nvidia GPU of the same generation but different (slower) model in the same system? Will it give a major boost with larger models, or will the slower card just be a bottleneck? I have an unused RTX 5060 Ti 16GB that I'm considering to install alongside the RTX 5080, but it would necessitate removing some other hardware, so I haven't bothered yet.
You get fewer tokens per second, but at some point the balance between quality and quantity makes the large model size worth the spend.
When you're spending this kind of money, you may as well treat yourself to a pretty screen and some decent speakers. Nothing the competition doesn't offer these days, but you get them for free with the car-priced RAM upgrade so why go for less.
Personally when going on the road I like portability (14" MBP or MBA), but at home I want raw non-thermally throttled power.
$5k for DGX Spark as well.
I spent less than $4k, OEM are better boxes for cooling, no apple markup, I get a real Linux system for stuff like k3s.
No Apple markup but you get the Nvidia market up instead. Prior to the recent Apple price increase due to RAM shortage, an M5 Max 128GB was a bargain if you want to run local LLMs.
You need an expensive motherboard, cooling, PSU(s) to use multiple high end GPUs together. Then there is the noise and the fact that you can't bring it on an airplane.
I've ran comparisons against everything that's available on OpenRouter (well, as of few weeks ago), and for $0/tok, the local 27B Qwen can't be beat. Sure, it's slower, and yeah, the office is a few degrees warmer than it ought to be -- but nobody can pull the plug, nobody is watching over my shoulder, and the results are on par with SOTA.
Can't wait for a similarly sized Qwen 3.7 - from what I've seen so far, it's a leap ahead of the previous version.
Builds and local test runs are 3 times faster than the Windows laptop option. The machine will pay for itself just based on that within 3 months. I can spin up a local kubernetes cluster and do full integration tests while I am working on other things as well.
It isn’t a strictly Mac vs Windows thing though. It looks like the culprit is the MDM software on the Windows machines is just crazy slow and constantly getting in the way.
If I was paid less it would definitely make less sense for the company to pay for this machine.
Imagine its value if war broke out over Taiwan / Greater China, or really any of the dark scenarios with global connectivity or the truthiness of commercially available models. It is a very, very difficult piece of equipment to make at any other moment in history. I wish I could have purchased more. I saw the signs and price trends and out of stocks as they unfolded. No doubt others with the means are stockpiling.
There is not a period in the history of computing where this is true of consumer hardware over a decade for anything other than hardware already at the very bottom of its depreciation curve. It is surprising to me that you state that as an obvious assumption.
I suppose if your base case is Taiwan war that may be true, but there's a lot of folks who seem to be assuming the current hardware crunch will go on indefinitely when the natural state of hardware is getting cheaper over time.
Yes. Your people earn an order of magnitude less income than Americans.
Yes. Back in the my days at $faang in europe it was not uncommon to hear people getting 120-160 k€/year in compensation and we were “poor” compared to us engineers at the same faang (4-500 k$/year total compensation) with a bit of seniority…
I just got a B70 with 32GB RAM for the equivalent of $1200 (incl. sales tax and import duties to my non-US location, so presumably it could be cheaper elsewhere). The memory bandwidth is 608 GB/s. For M5 Max (32-core GPU) it's 460 GB/s and for M5 Max (40-core GPU) it's 614 GB/s. A 3090 is still faster at ~900 GB/s but you're getting 32GB VRAM for a lot less than equivalent Nvidia cards. It's about 1/3 the bandwidth of a 5090 for 1/3 the cost, but with the same 32GB VRAM. If you're interested in being able to run bigger quants with some context and stay on a lower budget then it's an appealing trade off.
I'm still exploring using these local models so don't want to spend the equivalent of $5 000 - $10 000 just to test it out. I don't mind slightly slower perf to do some experimentation more affordably.
I actually got an B50 16GB (with meager 70w TDP!) first to test an Intel card with my stack - it worked easily with Ubuntu & Vulkan. I'd read a lot about hassles and people writing them off as unusable but it seems like these are often with SYCL which doesn't even seem to outperform vulkan and so why bother? (The B50 was just $370 inclusive tax and duties). Literally `apt install` the vulkan libraries and it worked with default xe driver in 26.04 and the vulkan build of llama.cpp. The SR-IOV PF/VF also just works with qemu/kvm, no tricks required. Since I got it fwupdmgr has updated the firmware twice so Intel is presumably actually trying to support these products.
ROCm nightly was pretty easy to setup and get up running. The 9070XT has been a decent card for my use cases.
But the SYCL ecosystem versions. Absolutely horrendous and everything is hundred commits behind. Vulkan is probably the only way forward with this card.
Seriously, just put $10 into openrouter and play with models that are cheap but bigger than what you'd reasonably be able to run locally like deepseek v4 flash (unquantized). You'll be surprised by how far that $10 goes for a model better than what you'd be able to run. Even further on the model you would be able to run locally. Then think of how many long it would take to match the cost of spend + power on doing it locally...
I can run qwen 3.6 35B on my gaming PC at around 50 tok/s and other than power cost of a tiny bit extra per month, it's hardware I already owned from years ago.
I'm not really sure why qwen 3.6 35B is so expensive on openrouter, it seems abnormally high for what hardware it takes to run it.
I'm trying to go the same route, but I have a 5070Ti with only 16GB VRAM (I bought it for gaming) and I'm not sure how to run anything decent on it. I have 64 GB RAM if that matters
The main thing in LM studio (or whatever software you use, assuming it has fairly up to date stuff and exposes the toggles) is to offload MoE layers to the CPU, and use K/V cache quantization at Q8_0 or Q4_0.
Since you have more VRAM than I do, you could probably get away with MoE offload of like 15-20 so some remains on the GPU.
Just make sure GPU offload is turned all the way up. And I use 64k context size, although with 16GB VRAM you can probably do more.
You can find the best performance spot by playing with MoE offload until you find the number that gives the highest tok/s on your hardware.
https://www.reddit.com/r/LocalLLaMA/comments/1t9eo83/running...
On a 2021 M1 Pro (32GB RAM) I can get either of them as `IQ4_NL` quantized models (the first with reduced context, around 160k; the second can do the whole 264k with RAM left over), running something like 30tokens/s.
On a Framework 13 AMD AI HX370 it can use the same, but both on Q8_0 quantization, full context window, parallelism. Speed is just ~15tokens/s so slower, but definitely smarter than the lower quantized siblings.
Both of them are good developer partners for an engineer who wants more of a second pair of eyes and a rubber duck, rather than a model to just do everything for them. Pretty good for my brain dumping, some commit reviews, sanity checks, just always assume that every claim has to be checked and re-checked.
The only problem is really the context loading, that's pretty slow (starts off around 300token/s on empty context, by the time we get to something like 70-80k which is just a bit of repo discovery, it can run around 80 prompt token/s or less, so there's always a lot more waiting around. Local tools need to bump all of their timeouts, and have to be mindful that there's unlikely to be really meaningful parallelism on these machines with local models.
I'm still figuring out how to approach these things, though. Definitely better than glorified autocomplete or search tool (and too slow for the former, pretty decent for the latter). Their limited skill and performance make it more in line with other tools like my IDE or editors, that they are still in the "tools" compartment of my thinking, rather than "independent, cognitively active entities". Which feels like a good thing.
QAT, MTP, 128k context.
I liked Qwen 3.6 27b too, it just seems that Gemma4 is a bit underrated.
Though I’m currently working on QADing the smaller Qwen 3.5 models from FP16 teacher to NVFP4 student, to hopefully eventually apply it to 3.6 27B… harder to get right than I expected though!
https://huggingface.co/google/gemma-4-31B-it/discussions/118
Local development for who? How many of y'all are rocking 128GB of memory? Am I reading Apple's site correctly that it's a $10,000 laptop?
I’m not having it build whole features from scratch, though. I give it pretty explicit instructions closer to the class or function level, and it still saves me an immense amount of time, while I’m very connected to the code that’s written.
Definitely the sweet spot for me.
For 24GB VRAM cards (e.g. 4090) you can use Q6_K (22.5GB) or Q5_K_M (19.5GB) quants, possibly offloading some of the weights to RAM.
At any rate it makes a stolen backpack or spilled drink a lot less damaging.
Unsloth recommends 18GB of RAM for Qwen3.6-27B (for their version of the model).
Sent from my 8gb M2 Mac mini.
I struggle to imagine purchasing multiple 1k+ cards on my own dime.
If Qwen models are so much easier to run, why are the providers charging more than V4 Flash?
[0]: https://aibenchy.com/compare/qwen-qwen3-6-35b-a3b-medium/qwe... <-- compare how the three models draw hamsters svgs, lol
Are these unified memory Macs and giant 24GB desktop GPUs achieving dozens or hundreds of tokens per second commensurate with their 10x-20x cost?
The full 128GB is surely helpful in keeping browsers, editors and other things running since even 20-35GB models + k/v caches can eat up a lot of the core 64GB in my experience.
I love it because the watercooled 3090s are completely silent even under load. Facebook marketplace is definitely the move for a lot of the parts unfortunately, since you ideally would have higher end parts that are 2-3 years old.
This thing sounds like it should be a monster but we keep running into issues of the old GPU architecture, lack of support for AMX or AMX not being as big of a help as you'd hope when it does work, etc. Apparently we only got 5 tokens per second trying to set up Qwen 3.6 27B, and a similarly bad result trying to run GLM 5.2 which fits in memory but the custom kernels we had to try to contrive were too slow. I feel like this system should have tons of potential, especially if something was designed to let the AMX and huge system memory shine.
Does anyone have any suggestions? This thing was fun to set up and it's really cool but it's been a bit disappointing not getting any big tangible results so far.
We have a similar system on a single-cpu Tyan board with 256GB RAM that I'm hoping we might be able to use in conjunction with the first one if EXO ever gets good Linux support for GPU/RDMA over InfiniBand.
Qwen 3.6 27B will run in full offload with a 4-bit quantisation in 64GB on an M1 Max. It is quite slow.
I don't know about 48GB but 64GB should be enough.
It got rather tangled up when I tried it with one of my coding tests, which is a simple wordpress plugin, but I frustrate the model by asking it to write code for older PHP, break WP coding conventions and use a rather bespoke method for arranging code in objects. So it is sort of a hybrid of a green field and brown field task; a bit muddy.
It did not do as well as Qwen 3.6 35B, but the way it worked through its thoughts was interesting.
TBH I struggled to understand what DeepReinforce are doing that is materially different; the explanation of their training technique goes over my head at this point.
So for example I'd favour a used M1 Max over a used M2 Pro, at least based on my naïve understanding. Not quite sure where the balance changes.
There appear to be some hardware improvements with the M3 and up regarding the Apple Neural Engine which I'd hope would show up in MLX performance; I remember seeing some optimisations in image generation models that are only possible on later hardware.
The GPU cores are progressively better I believe, but the memory bandwidth is lower. Though perhaps the M4 can get closer to actually saturating said bandwidth.
(And I must reiterate that my understanding of this stuff is pretty naïve.)
A very useful resource for characteristics and comparative performance of all M variants, if anybody is interested, is https://github.com/ggml-org/llama.cpp/discussions/4167?sort=...
Its sister discussion for nvidia gpus is https://github.com/ggml-org/llama.cpp/discussions/15013
Note the drop in performance for the base (binned) m3 max version. You are better off with full m1 max than the binned m3 max, even price aside.
The issue I have with my m1 max is that with 64gb you cannot run really decent MoE models, ie the ones you can run like qwen 35B-A3B have only 3b active parameters and are much less capable than qwen 27b in my testing. So I end up running the 27b one, but it runs relatively slow (though still usable at 10-20 tok/s) and I would have been better off a used nvidia gpu setup for dense models. I assume 35B-A3B has its use cases, eg as subagents, just that I cannot find them. With a higher amount of ram I could probably run bigger MoE models which could be more comparable, though prefill would still be an issue (and prob a bigger one). The only hopeful thing is that there are performance hacks appearing (speculative decoding and prefill) that seem to start improving inference speed once getting implemented, so I am mildly hopeful.
(I must also iterate that my understanding is not very deep either)
It was super rough going to get started with them back in January, but right now the cards purrrr and I haven't even tried tuning yet. You need to use a patched vLLM image with aiter but besides that things are finally working on the ROCm front.
The results are impressive considering the amount of people trashing AMD and still trying to recommend 3090s. I hope to buy a 2nd one at some point, but I also hate the version hell of vLLM, the R9700, the ROCM version, and Qwen3.6 all not agreeing with each other. I haven't gotten vLLM to run properly for Qwen3.6, since the version that runs on a 9700 doesn't support 3.6 yet.
I'm trying to quickly hack out a optimized path for just Qwen3.6 to run against rocm natively (e.g. my own inference server for 9700s basically) and see if it can perform better than llamacpp vulkan's results.
Word of caution - the last llamacpp with good performance was b9209 from a month ago. After that, for some reason, vulkan performance dropped by 10x, which has made me lose confidence in llamacpp in the long run.
Having said all that, 3x is 96GB for 4k and peak 900 watts. A 96GB Blackwell is $12k and peak 600 watss. And they will have a similar memory throughput (minor negative to the AMD cards for split processing). It's crazy how price efficient the r9700 is compared to the Nvidia cards.
llama-server -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q6_K -c 135000 -ngl 999 -np 2 -t 16 --temp 0.0 --top-p 0.95 --top-k 20 --min-p 0.00 -b 4096 -ub 4096 --chat-template-kwargs '{"preserve_thinking": true}' -fa 1 --spec-type draft-mtp --spec-draft-n-max 2
My biggest gripe is that both pi and opencode seem to have trouble parsing the thinking blocks at times, and the model sometimes cuts-off mid-thinking or prints out weird character tokens at times. I don't know if that's because of llamacpp, pi/opencode, or qwen3.6, or some weird combination of them all, as I haven't investigated that problem fully yet.
On M5 128GB one can make use of the ram and use sparse MoE. For example, DeepSeek-V4-Flash will fit, served by DwarfStar (https://github.com/antirez/ds4). One will probably improve 2x the token/sec speed, given DS4F 13B activated params in the MoE are ~1/2 of the ~27B of the dense Qwen.
27B Of the Qwen fit even on a cheaper 24GB card, e.g. amd 7900xtx (<$1K?) or slightly dearer nvidia 3090 (with cuda). With ~900 GB/s bandwidth they will likely be ~50% faster than the M5 with 600 GB/s.
"My personal impression is that within these quantizations Qwen 3.6 27B is as good as (or maybe slightly better than) DwarfStar4. Though, I won’t be surprised if for longer context projects DS4 has an edge."
It does about 30 tok/s which is enough for me. It's about half what the online models do, but it's enough.
I've heard their 9B models are also good, but they aren't much faster if you have the ram and a nice cpu.
These qwen3.6 models are the first ones I find can do much. GPT OSS was good, and Gemma4 is better. Gemma knows more facts, but qwen3.6 is smarter.
If a model runs fast enough for your use case and does exactly what you need it to, then you don't need a much slower model that might be more accurate. If you do anything more complicated, the dense models become more necessary and they are much more computationally heavy by comparison.
On your hardware an Unsloth quant of Gemma 4 26BA4B QAT would likely give you better results, but because it has 4B active parameters instead of Qwen's 3B active parameters, it will probably run slower.
Progress marches without mercy.
https://github.com/ikawrakow/ik_llama.cpp
Edit: it's gonna be slow if you're not using any VRAM. But it's possible. Software isn't going to speed that up anytime soon, it's just a hardware bandwidth limit.
>
> --jinja for tool calling support
Pretty sure this flag hasn't done anything for a while. It's enabled by default since ~November of last year
Personally I prefer the 35B MoE model, which is fast enough to be interactively useful, and capable, but I would probably use the 27B if I wanted to generate whole applications like that.
I am unconvinced that most "local" AI applications need anything much more powerful than the Gemma 4 12B model. Local agentic coding is a small niche, but there are plenty of ways a local model can help with development tasks.
I would really like to see a 12B or 16B Qwen 3.6.
I am currently playing with Ornith 1.0 in the MoE configuration, which is based on the 35B variant of Qwen 3.5; I am not sure if it is better than the 3.6 version.
Benchmarks say it is; my own silly tests either suggest otherwise or suggest that I have to talk to it a bit differently.
I really want to have a model that i can run locally on my 24gb m4 pro mbp for when i don't have internet to connect to my 3090 running the qwen, and i love how gemma 4 models 'feel', but i can't make them be competent. I am in the middle of finetuning both qwen3.5 9B and gemma 4 12B just to try and make those bridge closer to 27B for coding/agentic tasks (and am trying to ternarize and DQT 27B so that it fits in ~9gb pre-KV).
How do you run the gemma? What do you use it for (and in what harness), maybe llama.cpp and pi-mono just aren't for this model and that's what i'm doing wrong.
I am still mostly tinkering/learning rather than spilling out code, and I feel quite slow on it. So it doesn't matter too much to me if it is really slow. More the journey than the destination if that makes sense. I'm stubborn.
I have tried the Gemma 4 12B model (Unsloth's QAT version) with search/browse tools in LM Studio and Unsloth Studio, when I am trying to understand a new thing.
Basically I get it to write introductory starter documentation for me to absorb, because my big personal problem, these days, is focussing enough to start a project and then digging in; I need the help.
I have found its limits on obscure packages (that it sometimes makes up) but before that it's a bit like stumbling on a blog post that happens to be really right for your particular need. Good enough to work through.
It's stuff I could ask Perplexity to do, or ChatGPT, to be fair, I just like LM Studio for this and have the inquisitiveness to want to run it locally.
In your case: I don't believe it's the quant. I'm sure it's the model — it has good coding knowledge but it's clearly not specialised. It might be good enough at writing Python/PHP/JavaScript at a novice level. It is also quite good on WordPress tooling and functions.
But I wouldn't bother with it for agentic coding if you've got experience elsewhere. Might be interesting to see what you can do with the 9B Ornith model?
Qwen 3.6 MoE in its Unsloth version is another matter. Impressive and I am trying to find ways to support my old brain doing what I've done before.
However, text-to-speech, speech-to-text, and non-code LLM use cases are so useful to have local, and don't require big hardware.
Having a universal reliable inference engine interface, I think, is the big unlock that needs to happen before app devs can ship these features.
Personal concrete use case: meeting recording app. This uses Parakeet + Qwen to create local transcriptions and post-cleanup, respectively.
Right now this app has to download and manage all these models, then bundle an inference engine to run them. It's a lot of code that probably should belong to the OS, or at least a standard interface.
While apps can offload some of this to llama.cpp or a similar process over http, that's another set of setup for the user to do before they can have a useful app.
Anyway, if you're getting started on a Mac, I'd suggest trying out oMLX (https://github.com/jundot/omlx) before messing with llama.cpp. In particular they have community benchmarks so you can see what kind of performance you're likely to get: https://omlx.ai/benchmarks. I wished each one had more configuration details though.
Certainly this is falsifiable easily by any of us doing it on a regular basis
> Qwen stuck in thought loops
This does happen when context is not managed effectively; creating plans, using subagents and compactions strategically resolves this
> creating plans, using subagents and compactions
Yes, these are all things that Claude Code does for you. However, for the thought loop issue, these are not the fixes. The canonical fix is to limit the number of thought tokens (llama.cpp's `--reasoning-budget`) or try to mess with the various penalty parameters. In any case, it's not a solved problem as far as I can tell.
I ran into some small problems with codex during setup and, for a few reasons, did not want to set up a cli shell with them at the time. Since I was not doing anything really serious, but just exploring a half-baked idea for an android app, I ran qwen in lms and connected it to android studio.
None of the mini projects that I have attempted ( more granular call control, silly html scrolling game, music play app ) were one shots despite carefully preparing the prompt ahead of time. Admittedly, some of it may have something to do with android studio, but I did not try it with google account yet. All took between an hour to four to generate ( prep, initial run, testing, iteration and so on ).
If it helps, miniforum AI MAX 395. I am not saying it is bad. Quite the opposite, but you want to be aware of the limitations though and plan around those.
72.06 t/s. That's the full Qwen 3.6 27B model BF16, using MTP, running on Ollama. Yes I know I should bite the bullet and get vllm running on that box.
That was, also, at a 570 watt limit: I normally run a little less, but when I first tried this I actually forgot I had set the limit to 300 (it's a hot day, I figured why fight the A/C?), and at 300 watts the same question came back at 69.38 t/s. (The extra power matters more for compute bound things, the difference in generating LTX2.3 videos is considerably higher... but still not linear.)
Jackrong has a few different ones available depending on what you're trying to do: https://huggingface.co/Jackrong
don't get me wrong, the frontier models are leaps and bounds ahead of what qwen/kimikgemma are doing - but i don't need to drive a ferrari to the grocery store everytime either.
- Memory bandwidth; BUT the requirements are currently capped because models have stopped growing at around 1-1.5 trillion parameters for quite a while now. You only need more bandwidth if you're optimizing for the highest possible concurrency (i.e. you're a cloud provider). Also, MoE exists.
- Support for native low-precision math (like FP4 and FP8); BUT once your GPU supports native FP4 (Blackwell+), there's generally no reason for GPUs to go lower because of the obvious quality degradation.
- VRAM capacity - just like memory bandwidth, it's practically capped by 1-1.5 trillion parameter models and is unlikely to need much more in the near future. Also, the current trend is toward miniaturization: modern 30B-class models (which require far less VRAM), now completely destroy 200B-class models from just two years ago on most tasks. We also have better understanding now how to compress contexts.
Most model improvements currently seem to come from RL/harness-based methods, not from scaling models or running new algorithms that require fundamentally new GPUs.
So I don't see why GPUs that exist today must become "outdated" in a few years. They'll be seen as outdated by hyperscalers because they need to serve the maximum number of users as cheaply as possible, so of course they'll replace their GPUs with newer ones that have higher memory bandwidth or more tensor cores. But you don't need that for local inference.
How does that work? They have negative GPUs now!
[0] https://deepclause.substack.com/p/how-to-make-small-models-p...
I do not have a crazy rig, a modest gaming one at that, but in trying to understand more about agents and their capabilities, I am SOL with my 16 GB of RAM and 8GB of VRAM. I can get most small, non tool calling models to perform well, but I've had major issues with anything over 9B doing anything more than reasoning (egregiously slow at higher parameter counts).
And so far, I cant get even Pi to extend itself or do any meaningful work with any of the models I currently can get to run.
I very much appreciate the frank response, as it makes me feel less defeated at knowing my understanding of how it should work is not the full issue, hahaha
You should look at gemma-4-26B-A4B. 16+8=24gb and Q4 is about 16GB. Not much context left, but might run.
For you, you could try gemma-4-26B-A4B
It seemed to get the idea of my prompt to extend the footer info (I want it to show the model abilities like tool calling or reasoning where the context percent thing is), made a plan and wrote the file, but then got hung up on implementation because it couldn’t figure out how Pi renders that part of the UI in Powershell
So possibly trying a different terminal might help on that front, haha
But certainly seems like we are a few years away from that, sadly.
Am I also screwed in being able to train my own small model or adjust another one with such a non-workhorse PC?
Was just trying to see how small I could go and get acceptable results, but yeah, larger Qwen 3.6 with MTP is going to be better. Cant wait to see how AI model (unsloth/local-llm/heretic/reaper/etc communities) are tweaking/engineering quality down into smaller models. Lots of new things coming out.
Ok that's the part I'm interested in, don't care about minesweeper clones....
> Make a landing page selling candles for women that are into wellbeing and SPA.
can't be serious...
Offloading compute to them is much easier, except its still a limited set of open models. Most companies are already running in AWS, so it's an easy add, models run in a trusted location, just another line item on the Amazon bill. You don't have to talk anyone into signing up with a new vendor. Plus you don't have to worry about local hardware at all.
Qwen 3.6 dense runs at 40tok/s
I find that for local coding, I need to spend a lot of time building concise SKILLs for specific things I work on and try to only enable one or two skills per coding session.
To the author of the linked article nice job, and if you feel like adding to it, please add details on your setup.
I feel like the amount of context bloat that OpenCode puts these small models into the dumb zone too quickly. The system prompt alone is 9k tokens, and when you add your own setup it can easily creep up to 15k.
I've seen sites here and there but they feel like quick little toys that don't get updated, so they always suggest old models.
I've been using the full GLM 5.2 model this way (through opencode) at work for the past week. It's quite impressive.
On a serious note, I run my models on desktop pc, simple api and i can use them wherever whenever.
I ran those throu opus saking if it was good advice and was not impressed:
I read the actual qr_scanner.ino. Short answer: partially, but I'd push back on most of it. That review reads like generic ESP boilerplate advice written against an imagined version of your code — several of its "fixes" are already in your file, and its headline "critical" claim misreads what the code does. Going point by point:...
The benchmark seemed fine until I saw that.
If you use sub agents, they will overwrite the cache and each request will trigger full reprocessing. Have fun with that as it will crash the t/s metrics on each prefill on top of the max 64k including input + output is a major blocker.
If you push the context higher and add parallel slots the requirements will be far higher and the numbers less shiny.
For anything else local, including writing some automation scripts and such, it works great.
Source: https://chatgpt.com/share/6a42dd8a-4e28-83e8-9ef7-6ba56d665c...
If you want to play a hyperbolic minesweeper, Hyperrogue features that https://hyperrogue.fandom.com/wiki/Minefield
It basically exploits the face that time can be traded for intelligence with local models
Even llama.cpp's bundled web UI handles it fine. Dead simple.
Which MCP server do you use?
Neither is going to return much knowledge. Basically just relevant url so you need a second tool to grab them and there bot walls get tricky
Hopefully we're looking at a future where local models become more & more realistic to use for reducing remote AOI spend.
Is there any way to use MLX and GPU at the same time? Or does memory become a big problem?
TBH, I never understood Apple hyping these neural cores because I didn't think anyone actually uses them except maybe certain photo/video editing software.
If I can generate voice at the same time as video, that would be useful.
The neural cores aren't suitable for LLMs/transformers and isn't used in LLM inference. On the M5 and later chips, it comes with neural accelerators, aka Tensor Cores, which speed up the 'prefill' (i.e. processing your context window) part, but don't do anything for inference.
The MLX vs GGUF debate is mostly irrelevant. The GGUF pathways are optimised for apple silicon to the extent of practically identical performance to MLX. MLX is just one way of using Apple GPUs, it comes with many optimisations in the box, but they're not hard and they're no longer MLX-exclusive.
I haven't seen anyone make an argument they are as good as SotA (OpenAI, Anthropic). It's just they are approaching state where they are "as good" for some _limited_ set of use cases. Which will allow us to resolve 2 primary issues with these SotA models: privacy and vendor lock-in. Plus, they're very useful for education purposes, you get to explore what things looks like under the hood, play with various models, tools, maybe put something simple together yourself.
You get Macbook - great. You got gaming rig with a decent GPU - great (set it up as a dedicated server that you connect to through simple REST).
What exactly is wrong with any of that?
Consider that there are literally trillions of dollars being wagered on this not being the future state of computing. Not even speculating that HN is being astroturfed (though I see no reason it wouldn't be by interested parties), but many of the US tech employees here have direct financial incentives in various forms to be rooting for the failure of open source and optionally local models.
For me it’s the first local model that actually makes sense as a general intelligence.
I do have access for a 64 gb ram mac mini but most people don't.
tweaking sampler might help
This part should have featured something about real work. But instead it features a paragraph about one-shot bs that creates "something".
Unless your work is to create thousands wordpress tremplates to sell - this is not a "real work".
Give it a repository (any kind of OSS project will do for an example) and a github issue requesting a knew feature or describing a confirmed bug. (you can and probably should write a prompt for LLM shough, don't just provide the issue itself)
And then whatch it go.
And then judge the result and it's quality.
Sorry, but from my experience 27B is just useless. You do get a result and some times it does work, but most of the times it is not event on junior dev level. And it takes it a lot of time to do the thing, unless you have an extremely expensive machine.
If your expectation is to treat it as a tool, then you're wrong.
I guess that's where the disconnect lies.
I already have tools for autocomplete, working with structured data and many more. Deterministic tools.
Obviously you do not expect something like that from a model with some harness. It can read some input (user's or other tools) and give you some output.
My expectation is that this tool, given some meaning full input (instructions, expectations, motivations and an optional source files to work with), will produce something that will actually be aligned with the input.
For example: consider I have a services that has some sort of events created now and then. I what those events to be available for other services. So I decide it to have a transactional outbox and an observer that will pull events from the outbox and put them into a kafka topic.
My expectation is that I can give this tool some context (source code and description), state my instructions, expectations, motivations, design decisions and have an implementation as a result.
My other expectation is that given my context etc and agent's context (skills etc) were correct and adequate - the outout will also be correct and adequate.
That will get you a near-frontier experience. DSv4 Flash launched in April with capabilities on par with GLM 5.0, which launched in February.
It's a surprising example of the recency bias to me to assume anything other than the market returning to its historic norm, even if the AI buildout doesn't slow, producers will scale factories to meet that demand.
> https://sleepingrobots.com/dreams/stop-using-ollama/
I had faced roadblocks while integrating with openclaw using ollama (Was trying to experiment with `qwen3-vl:2b`). I was tracking the issue back to openclaw at that time, I didn't even consider investigating ollama.
I attached a threads post here where I'm talking to meta ai to expand on both scenarios (not to use ollama, but llama.cpp & my take on the why this is the way it is - ie. commercial gains)
I'm running the NVFP4 alongside Gemma4 at the same quant on an OEM Spark
also i like that if i drop more sophisticated tools into my harness (e.g. any of the NLP/RAG-based search tools in place of grep/rg), the agent will actually reach for them and make progress faster; previous models have been reluctant to embrace new tools.
Lora if effective could be a great reason to run local models.
So it will be no surprise that there will be a time where everyone will be able to run a local model, say GLM 5.2 locally on their machine. Like it or not.
Qwen on the other hand got straight to work with astonishing competency on the same system.
From what I read llama3 needs beefier compute to reliably invoke tools, which I presume relates to it focussing more on simulating AGI rather than being a useful tool.
https://arena.ai/leaderboard/code/webdev/pareto?license=open...
https://arena.ai/leaderboard/text/pareto?license=open-source
200k @ K : Q5_0 V: 4_1 (which is a bit of a sweet spot)