With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.
If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.
I almost always keep my laptop on low power mode.
Since you can control the low power mode setting from the command line: `sudo pmset -a lowpowermode 1`.
It should be pretty straightforward to hook this up to Hammerspoon[1] using hs.application.frontmostApplication() to apply the setting based on whatever foreground application you choose.
Thinking out loud, that being said, the necessity of sudo might make this slightly more complex. An always on background admin agent might be needed I suppose to bypass the password prompts (or add pmset to the sudoers file, if you prefer).
614 GB/s of memory bandwidth
> MacMini M4 with 64GB of RAM
273 GB/s of memory bandwidth (also only currently available with 48GB)
When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.
And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.
This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.
It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.
All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.
I do like Gemma for translation, however.
That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs).
I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.
So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)
I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.
There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).
There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)
But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...
Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.
There should be a lot more content on setups and best practices etc. if these macs would be used with local models only.
I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).
Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).
Don't expect workstation loads with no fan or heatsink, true. But it's not a real problem, it's still quieter than a desktop.
That said, rather than Mac Mini, if you only work from one place, I'd recommend a Studio Ultra M3 with 512GB. Same or more tokens per second, multiple models loaded. Cool and quiet.
If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.
They pulled them a month or two ago, right after I bought it.
> Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00
I'm mainly interested in coding/image creation tasks. Has anyone built out a server for a similar use-case and, if so, whats your experience been? What cards should I be looking into? Am I looking at spending ~10-15k for something that can give me near frontier quality/speed? I know about the DGX Spark/Mac Mini's, but I'd like to be able to upgrade later down the road.
10x rtx6000 Pro in a large workstation is probably the way to go for someone wanting to run GLM5.2.
Other than that it is cloud.
As good as these small models got we are still not "at breakeven" for me.
What is "breakeven" with LLMs? For me it is when I no longer have to read the actual code it wrote. I can trust that if I told it to implement and document a certain architecture it actually did that with no stupid mistakes.
The first model ever that did that for me was the first opus. 4.4 if I remember correctly.
The second model was Gemini 3 Pro preview. For few weeks. Then it was lobotomised. I guess it was too expensive to run and they quantized it too hell.
Only Opus remains. If this GLM model truly rivals even an old opus I'll be very happy when day comes that I'll be able to run it locally.
You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.
$6800 is a lot of API credits for GLM, for example, on any provider you want to use.
Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.
I still am going to buy a second one haha
I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.
But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.
Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.
I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.
As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.
Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.
After about 1 minute the entire machine basically bricked and I had to hard reset :D
Using linux for actual work on my workstation.
if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.
more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.
that`s also how i would interpret the recent rumors on m6 and m7.
naturally, the cooling and all that will be optimized around that.
so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.
you are basically paying the price now to be on the bleeding (sweating) edge
My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.
You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.
32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).
You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.
An AMD AI Pro R9700 32GB brand new is $1350 right now.
After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.
But man, I have never purchased a computer which is more expensive than a decent family car.
https://www.microcenter.com/product/709071/pny-nvidia-rtx-pr...
I know you probably weren't referring to this type of memory in your post, but IMO it might be worth avoiding this term in the future unless you're referring to HBM, the standard.
Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.
As more context will degrade a lot the t/s. On top this is 1 slot.
If you use sub agents the kv cache will be invalidated with colliding request and make it even slower.
So the in real world 256k (the max qwen offer) and using 3-4 slots the numbers are very different.
This is the major issue with so many postes over local models not benchmarking real world use. Real context and not taking this in context.
If you use 1 slot the issue, you loose the ability of using sub agents when exploring and all end up in the main agent context overloading it, triggering compactation and oh boy with 64k context that compecation will be an endless loop.
What tasks you would really be able to do with 64k context 1 agent? For sure so quick edits but not complex planning where you need to ingest a lot files and end up loosing 80% of the ingested files to compactation.
I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.
needs to be noted that it's increasingly uncommon to be able to do so. for desktops you have to build everything yourself - prebuilds (either gaming or workstations) have proprietary PSU and motherboards (in case of workstations, sometimes CPU is bound to the motherboard / manufacturer, for example Threadrippers). Windows laptops now often come with soldered RAM and soon will probably be without M.2 slots like Macs.
There is Framework though I guess
Mac Studio: Ships: 16–18 weeks
Mac mini: Ships: 10–12 weeks
As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.
Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.
But you do need a fast LAN connection, otherwise working with agents will be a pain.
Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.
I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option
But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.
So not "implement me a shading algorithm"
But more like: make an multi user app running on a k8 cluster, design the whole thing to be indempotent, scalable, easy to deploy remotely via ipmi/pxe boot.
Then see how it makes stupid mistakes along the way.
Today's AI is pretty amazing when it comes to fixing narrow problems (or creating Web apps with no infra). Give it anything where it needs to go online, download some helm templates and look through them to figure out parameters, as well as write an app and it will make lots of mistakes in seemingly simple stuff.
Opus seems to be the model that works the best with this.
Wouldn't this damage the MBP display?
My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.
It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!
Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.
- M3 Pro MacBook Pro 36GB
- M2 Pro MacBook Pro 16GB
- Mac Studio M4 Max 48GB
and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.
You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.
While it is wild to have this much power in a take-it-anywhere laptop form factor, I sort of regret not just going for a Mac Studio + base M5 MBP.
llama.cpp's Metal backend does use them when they're available.
How is this config?
qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).
I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.
to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.
Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.
A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.
Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.
Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.
I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.
I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).
I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.
But you do you.
The only reason I can tell it's on, is the very quiet hum of the slow speed water pump. Large fans run at 1200rpm and are fully quiet.
I have over a meter of radiators there.
Fun fact, I bought my first rtx3090 4 years ago. A year ago I bought another one and they are still the same price used.
I may buy another one (for my servers)
If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.
But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.