undefined

upvote

points

by iagooar1 days ago |

upvote

by jasonjmcghee15 hours ago|

[-]

I'm surprised no one has else has mentioned - low power mode.

With no speculative decoding, using high power mode, I get 80 t/s on 35B A3B - and it gets hot and spins up. On low power mode I get 38 t/s - no fans, cool to warm laptop.

If you currently don't use speculative decoding and you start using it, it can nearly offset the difference between high and low power, and it's night and day experience.

I almost always keep my laptop on low power mode.

reply

upvote

by html5cat12 hours ago|

[-]

Awesome idea! Will try it out. Wish there was a way to enable low power on a per-app basis. Scrolling and reading on low power mode is really annoying.

reply

upvote

by the_lucifer3 hours ago|

[-]

> Wish there was a way to enable low power on a per-app basis.

Since you can control the low power mode setting from the command line: `sudo pmset -a lowpowermode 1`.

It should be pretty straightforward to hook this up to Hammerspoon[1] using hs.application.frontmostApplication() to apply the setting based on whatever foreground application you choose.

Thinking out loud, that being said, the necessity of sudo might make this slightly more complex. An always on background admin agent might be needed I suppose to bypass the password prompts (or add pmset to the sudoers file, if you prefer).

[1]: https://www.hammerspoon.org/

reply

upvote

by kmacdough2 hours ago|

[-]

Unfortunately doesn't cover scrolling HN while the agent toils away.

reply

upvote

by anon37383914 hours ago|

[-]

Can you mention what inference stack you're using? I've tried MTP several times with that model and it always seems to significantly cut my token generation speed from ~60 tokens/sec to ~40 (M3 Max).

reply

upvote

by c1611 hours ago|

[-]

Will give this a try later. Enjoy working with A3B Coder, but the heat coming out my 32gb M5 is a lot. This might be the trick - Thanks!

reply

upvote

by mycall12 hours ago|

[-]

It is less efficient use of the GPU and uses more electricity overall, no?

reply

upvote

by spider-mario7 hours ago|

[-]

Oh no, 0.6 kWh a day!

reply

upvote

by bigyabai2 hours ago|

[-]

Yes, this is a tradeoff that foregoes the efficiency of race-to-idle.

reply

upvote

by astrostl20 hours ago|

[-]

> MacBook Pro M5 128GB RAM

614 GB/s of memory bandwidth

> MacMini M4 with 64GB of RAM

273 GB/s of memory bandwidth (also only currently available with 48GB)

When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible. In this case a hypothetical Mini with 1TB of memory would still be over 2x slower with 27-35B models.

And FWIW I have an M4 Max MBP 128GB that I keep on a Roost laptop stand, with a separate keyboard/mouse/video. It does fire up the cooling jets when running local LLMs, but stays within tolerance for me on noise. I haven't heat-tested it on longer runs, but I imagine the risen airflow helps a ton.

reply

upvote

by iagooar12 hours ago|

[-]

On paper the M4 should be roughly 1/3 of the M5, in practice it is only 1/2. With the right, optimized model like qwen3.6 35B MoE MLX you can get over 40 tok / sec on it. I run dozens of background jobs that are not time-critical on it.

reply

upvote

by bfjvibybd6cuvu69 hours ago|

[-]

What kind of jobs?

reply

upvote

by bigyabai19 hours ago|

[-]

> When it comes to inference speed, you want your model to fit in memory, and then to have as much memory bandwidth as possible.

This is only true when your GPU isn't bottlenecked building a KV cache, which it usually will be on Apple Silicon. The Achilles heel of the M-series chips are their weak, SOC-grade GPU that holds back the Max and Ultra models from having interactive TTFTs on larger models and contexts.

reply

upvote

by fancyfredbot5 hours ago|

[-]

Normally people refer to the compute-bound phase as "prefill". Nothing wrong with saying it's building the kv cache though, it's accurate just unusual.

reply

upvote

by SwellJoe23 hours ago|

[-]

I opted to buy a normal 32GB laptop for this very reason. I know how loud and hot the GPUs in my desktop run when running even smallish models like Qwen 27B or Gemma 4 31B (which is a better model for most than Qwen 3.6, despite the benchmarks). I also have a Strix Halo which doesn't get loud, because it has a single huge fan, but it does get hot. So, there's no way a laptop could work as hard as models make them work, and not be unbearable. Tiny fans trying to remove all that heat? They gotta be screaming. No reason to spend all that money on a laptop that I couldn't realistically make use of. I do run a lot of VMs on my desktop, but I can get to those on a VPN.

It's a nice idea to run a model on a laptop so you can work anywhere...but, that's a job for models in the cloud. Not much data has to traverse the network, so it's not a big deal. Or one could also setup a VPN so you can reach a self-hosted model on a big box at home for things that require data privacy.

All that said, there are models that work great on very small devices for some tasks and won't work it to death. Gemma 4 12B QAT 4-bit runs on a 16GB device, maybe even smaller, including a tablet. It's the best self-hostable vision model I've tested for my purposes (categorization, identification, labeling, type stuff), beating much larger models. It's also a decent conversationalist with good prose but it doesn't know much of anything (not a lot of the world fits in 7GB), so it needs search if you want to use it for research. It's a pretty good tool user. I definitely wouldn't want to use it for code, though, beyond very simple stuff.

reply

upvote

by girvo22 hours ago|

[-]

Gemma is better than Qwen at everything except coding, in all my evaluations. Which is a shame because that is what I use them for!

reply

upvote

by automajicly3 hours ago|

[-]

I have a M1 Macbook Pro...with only 16gb and I struggled with Qwens2.5-14b trying to do large projects. I loved Qwen but I had to try and do something different. So I switched to Gemma4-12b which looking at it now, seems more like a downgrade than an upgrade.Can you refer me to any Qwen coding models that wont choke my poor 16gb and also connect contextually? I need that context. I love the laser point focus, but I need context and basic understanding of that context.

reply

upvote

by lambda4 hours ago|

[-]

I haven't run a proper eval, but I've been getting better luck with Qwen models than Gemma on plant and animal identification using vision.

I do like Gemma for translation, however.

reply

upvote

by UncleOxidant20 hours ago|

[-]

It would be great if the Gemma folks would release a code-focused model. Probably won't happen, but it's fun to dream.

reply

upvote

by SwellJoe20 hours ago|

[-]

The Ornith folks say they're doing that, but haven't released the Gemma-based 31b yet (https://github.com/deepreinforce-ai/Ornith-1). But, also, the Qwen-based 35b MoE Ornith version performs worse than Qwen 3.6 and Qwen AgentWorld on my benchmarks (which are focused on finding security bugs, so not exactly the same as agentic coding, but closely related skills).

That said, the reason they're able to release Ornith branded post-trains of both Gemma and Qwen is because they're open weights under a friendly license. Someone, not just Google, could make a coding focused Gemma post-train. I don't think it's actually much weaker than Qwen 3.6 for coding; Gemma 4 31b outperforms Qwen 3.6 27b by a wide margin on security bug hunting (at least for the specific bugs in my benchmarks, which are mostly relatively difficult bugs from the Mythos-reported bugs).

I'd really love to see a bigger MoE from Google, though. A 70b or 120b MoE would likely be super fun.

reply

upvote

by urbsgpw7 hours ago|

[-]

Ya, doesn't seem to be google's focus at all, right?

reply

upvote

by ekianjo19 hours ago|

[-]

gemma is also worse for tool calling. not just coding

reply

upvote

by satvikpendem18 hours ago|

[-]

That is because they use a different tool calling format than most other models. Unsloth quants fix this in their Gemma releases.

reply

upvote

by feffe8 hours ago|

[-]

I've never been able to fix the tool calling issues. Running unsloth versions with llama.cpp, constant issues. Have tried many forum fixes, including lots of fixed chat templates, to no avail. It's mostly the edit call that breaks, which often results in "let me just rewrite the whole file from context".

reply

upvote

by stevenhubertron17 hours ago|

[-]

Can you say a bit more about this? The bad tool calling has made me give up on using Gemma for my Hermes and a personal recipe site. I have only downloaded from Ollama.

reply

upvote

by satvikpendem16 hours ago|

[-]

Ollama is not recommended [0], use llama.cpp or more specifically Unsloth Studio which wraps llama.cpp and which has an API mode you can use to hook into Hermes or another agent. Unsloth make both the Studio and the quants which fix various issues with many models [1] as well as implementing new features like MTP and QAT support much sooner than other teams. In general you should read r/LocalLLaMa as it has a lot of updates regarding local models as the field moves fast.

[0] https://sleepingrobots.com/dreams/stop-using-ollama/

[1] https://github.com/unslothai/unsloth/discussions/4921

reply

upvote

by 20 hours ago|

[-]

deleted

reply

upvote

by mycall12 hours ago|

[-]

You can limit TDP on Strix Halo so it runs between 32 and 45W which seems to be the sweet spot for heat vs speed.

reply

upvote

by andai23 hours ago|

[-]

> The reason is simple: your fingers will burn and your head will explode from the noise.

So, just buy a mac mini and put it in the other room? ( Like everyone was doing in February? :)

I've been running coding agents on my laptop in yolo mode for the past half year or so (though mostly not local ones, laptop too slow!) and the way I'm doing that without terror is that I just gave them their own Linux user "agent". They're free to nuke their homedir /agent, and they can't touch (or even read) mine.

There's some slight ergonomics issues (I need to sudo into the user to do anything, but I set up an alias for it), sometimes I get issues with permissions or ownership (gave up on "sticky bits" and just made a function I can run once a day when it breaks).

There's enough hassle that I wish I just had a dedicated machine for it, and then I'd just give them root on it. (For giggles I gave claude root on a $3 VPS and that's going just fine...)

But yeah after months of trial and error I reinvented "just buy a mac mini" from first principles...

reply

upvote

by iagooar23 hours ago|

[-]

Just buy a Mac Mini really is good advice if you want to get into real, always-on convenient agentic work.

Soon it is going to be good even for coding using local LLMs. Until then, just run API models on it for coding, local LLMs for "knowledge" work or daily driver agent like Hermes.

reply

upvote

by marcuskaz22 hours ago|

[-]

Except they're not available, 3-4 month wait time.

reply

upvote

by KiwiJohnno19 hours ago|

[-]

I ordered a mac mini m4 pro with 48 gb of ram a couple of weeks ago. Apple said 8-9 weeks.

reply

upvote

by iagooar22 hours ago|

[-]

Buy a refurished or 2nd hand one.

reply

upvote

by 1over13721 hours ago|

[-]

Also not really available.

reply

upvote

by klardotsh20 hours ago|

[-]

Especially with anything resembling a usable amount of RAM. Mac Minis and Studios >=64GB are basically permanently sold out everywhere, because everyone, including commercial entities with deeper pockets than most of us plebs, has the exact same idea at the exact same time.

reply

upvote

by chiply3147 hours ago|

[-]

I think more bought them to run their Clawed on it but still with external LLM calls.

There should be a lot more content on setups and best practices etc. if these macs would be used with local models only.

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by roadside_picnic20 hours ago|

[-]

In general if you're setting up a local LLM you should assume it's going to be primarily working as a server and talking to various clients. I use my MBP, but that's because I don't travel much anymore so it can happily work as a server at all times. With the right agent setup you can probably manage most things from your phone even if you don't have a seperate machine to use as a client.

I have an older laptop I run a hermes agent on backed by an API based open (non-local) model and Macbook Pro M4 for running another model locally (also using hermes). The agents have a Mattermost (open source version of slack) server they run and I run Mattermost on my phone so I can talk to them and task them with things. In fact, it was through the hermes WhatsApp endpoint that I got the first agent (non-local) to setup the Mattermost server and unboard the second agent (local mbp).

Then I can just chat with them through Mattermost when I need work done. Whenever I need something done I just hope on the Mattermost server and chat with them. I've had them build me multiple research reports (the fully local agent did awesome at this), learn how to use Stable Diffusion on my desktop to generate images, install and perform maintenance on various local services I run (including Open WebUI).

reply

upvote

by jtbaker19 hours ago|

[-]

Nope, have both these machines, can confirm the M5 max blows the M4 mini away. It does get hot, but I use it mostly with an external monitor and keyboard. Conceptually I like the headless model better with a workstation, but work was buying the M5 and can't get it in any other form factor at the monute.

reply

upvote

by Terretta8 hours ago|

[-]

I have that model, and do local LLMs and local image generation. DO buy this if you plan on serious local LLM use and enjoy working from anywhere.

Don't expect workstation loads with no fan or heatsink, true. But it's not a real problem, it's still quieter than a desktop.

That said, rather than Mac Mini, if you only work from one place, I'd recommend a Studio Ultra M3 with 512GB. Same or more tokens per second, multiple models loaded. Cool and quiet.

reply

upvote

by 827a19 hours ago|

[-]

Apple does not sell a 64GB variant of the M4 Mac Mini. IIRC they never have; its always capped out at 48GB.

If you were planning on getting an M5 128GB; just get a DGX Spark (~$4500) or a 5090-equipped machine (~$4500) plus a Macbook Air (~$1500). You'll come in below the M5 Max 128 pricing (~$6700+ USD) and be happier for it.

reply

upvote

by mkesper9 hours ago|

[-]

DGX Spark does not have high memory bandwith. M3 Max (Mac Studio) features more memory bandwith than that one. See https://aimultiple.com/dgx-spark-alternatives

reply

upvote

by angoragoats19 hours ago|

[-]

The Mac mini was available with 64GB of RAM literally 4 days ago; the option was discontinued on June 25th.

reply

upvote

by ozim14 hours ago|

[-]

DGX Spark everyone is saying performance for the money is not there

reply

upvote

by Foobar856813 hours ago|

[-]

I have an access to a DGX spark, and while it performs better than my MacBook Pro (M3 Max), the performance on Qwen and Gemma dense models is dog shit, and not worth it.

reply

upvote

by icedchai4 hours ago|

[-]

Performance with Strix Halo isn't there, either. At least I got mine relatively cheap in 2025, before the run up in prices...

reply

upvote

by ozim2 hours ago|

[-]

Not fun part is I didn’t get mine and I don’t think prices will go down in next 5 years.

reply

upvote

by dd8601fn16 hours ago|

[-]

I'm using a 64GB M4 Mac Mini.

They pulled them a month or two ago, right after I bought it.

reply

upvote

by dgacmu18 hours ago|

[-]

That's incorrect, I have one on my desk right now. They've stopped selling it now, but I got one a year and a half ago:

> Apple M4 Pro chip with 14‑core CPU, 20‑core GPU, 16-core Neural Engine 64GB unified memory 2TB SSD storage 10 Gigabit Ethernet Three Thunderbolt 5 ports, HDMI port, two USB‑C ports, headphone jack Accessory Kit $2,649.00

reply

upvote

by lprd2 hours ago|

[-]

Yikes! I've been needing an upgrade, and I was on the fence between a specc'd out MBP, or building out a AI server and delegating tasks to it over Netbird/Tailscale to my homelab.

I'm mainly interested in coding/image creation tasks. Has anyone built out a server for a similar use-case and, if so, whats your experience been? What cards should I be looking into? Am I looking at spending ~10-15k for something that can give me near frontier quality/speed? I know about the DGX Spark/Mac Mini's, but I'd like to be able to upgrade later down the road.

reply

upvote

by Roark665 hours ago|

[-]

I think there is no reasonably priced machine you could run locally to do serious work with LLMs...

10x rtx6000 Pro in a large workstation is probably the way to go for someone wanting to run GLM5.2.

Other than that it is cloud.

As good as these small models got we are still not "at breakeven" for me.

What is "breakeven" with LLMs? For me it is when I no longer have to read the actual code it wrote. I can trust that if I told it to implement and document a certain architecture it actually did that with no stupid mistakes.

The first model ever that did that for me was the first opus. 4.4 if I remember correctly.

The second model was Gemini 3 Pro preview. For few weeks. Then it was lobotomised. I guess it was too expensive to run and they quantized it too hell.

Only Opus remains. If this GLM model truly rivals even an old opus I'll be very happy when day comes that I'll be able to run it locally.

reply

upvote

by acters1 days ago|

[-]

Would the new upcoming AMD AI ryzen halo desktop be a better value offer? or dgx spark?

You would have to get a third party reseller/scalper or refurbished mac mini to get 64gb of ram ever since apple stopped selling it.

reply

upvote

by girvo22 hours ago|

[-]

My GB10 Spark-alike is absolutely amazingly fun… but it is not cost effective. Step 3.7 Flash is shockingly capable (IQ4_XS and used for web dev mainly), but it cost me $6800 AUD. They’re even more expensive now. The numbers just don’t make sense: with proper triple head MTP I can get it up to ~40tk/s decode and it runs at around 1000+ tk/s prefill.

$6800 is a lot of API credits for GLM, for example, on any provider you want to use.

Now being able to run models uncensored and with privacy has value! But the cost for these is rough today.

I still am going to buy a second one haha

reply

upvote

by c7b22 hours ago|

[-]

My 2c: you don't need the Strix Halo desktop, the chip comes in many rigs, most of them cheaper, the performance difference isn't worth it. It used to be half the price of a DGX Spark or a Mac with 128GB RAM. If you can still find it at that price I'd say it's the best bang for your buck. Otherwise, Macs have 2-3x the memory bandwidth of the DGX Spark, depending on the chip, so I'd prefer them. Unless you're planning on building a cluster. The DGX Spark has two 100GB/s connectors, ideal for clustering. But I haven't checked what else you could get for the price of two DGX Sparks.

reply

upvote

by brandensilva17 hours ago|

[-]

Thoughts on a M5 Ultra 768GB if it drops? What's the price to make it worth it for you over a spark cluster?

I'm wanting to run Kimi 2.6/2.7 GGUF on it and just slap it in the server rack, but trying to decide if a spark cluster makes more sense.

reply

upvote

by PeterStuer12 hours ago|

[-]

The M3 with 512GB is currently sitting at around 30K, used. You can extrapolate from there.

reply

upvote

by brandensilva7 hours ago|

[-]

[dead]

reply

upvote

by lee_ars23 hours ago|

[-]

I'm currently fiddling with a DGX Spark and Qwen3.6-35B-A3B (specifically Qwen3.6-35B-A3B-NVFP4 under vLLM, with EAGLE3 speculative decoding via eagle3-dogacel-vllm), and it's pretty okay in terms of smarts. The speed is relatively usable at about 50 tok/sec with a 256k context window, and it's definitely smart enough to one-shot some basic coding tasks. I had it doing reverse engineering/disassembly of some ancient MS-DOS assembly language games from the 80s and it handled the task well and produced good outputs.

But it's also really easy to trip up. I fed it some of my Ars pieces and asked it to analyze themes and composition, and it got into a looping argument with me over how it was unable to analyze "my" writing because "the user cannot be the article author, the user is the user, the user did not write the article, the article author wrote the article." I was utterly unable to convince it that I was in fact me.

Qwen3.6-35B-A3B hums along at about 50GB of RAM used with --gpu-memory-utilization=0.42. I haven't tried Qwen3.6-27B (I'd likely grab Qwen3.6-27B-FP8, I think), but I'm curious to see if it makes much of a difference.

reply

upvote

by coder54320 hours ago|

[-]

Compared to a dynamic quant like Unsloth's UD-Q4_K_XL, which keeps some important parameters in higher precision, a basic NVFP4 quant seems to do a lot more damage to the model unless it is carefully calibrated.

I would recommend using llama-server if you're just on a single Spark. You get access to dynamic quants like that more easily, the performance is not that different from vLLM most of the time these days, and it is much faster and easier to switch between models.

As far as intelligence goes, Qwen3.6-27B is much smarter than the 35B-A3B model, but that's also not the sort of thing to argue with an AI model about in the first place. Just open a new chat and try again.

Gemma-4-31B is not as good at agentic use cases as Qwen3.6-27B, but it is a fairly balanced model overall, and worth trying out too. Its MTP can nearly triple the performance of the model, where the benefits of MTP or Eagle seem more limited for Qwen3.6-27B in my testing, maybe doubling the speed.

reply

upvote

by cpburns200920 hours ago|

[-]

Looping is a common problem with the Qwen models. I've had good luck using --repeat-penalty=1.1 with llama.cpp and 27B. vLLM should have a similar option.

reply

upvote

by etdznots4 hours ago|

[-]

This is the default value!

reply

upvote

by cpburns20093 hours ago|

[-]

Llama.cpp defaults to 1.0 (disabled) and so does vLLM. It looks like only ollama defaults to 1.1.

reply

upvote

by 7 hours ago|

[-]

deleted

reply

upvote

by rnxrx22 hours ago|

[-]

There are also nvfp4 quants of Qwen 3.6 27/35 floating around. I've done benchmarks of both and the quality difference vs fp8/bf16 was barely notable. Honestly the nvfp4 capability is the most interesting feature of the Spark (at least for me).

reply

upvote

by anon37383921 hours ago|

[-]

I use Qwen 3.6 35B-A3B constantly, but I don’t see the type of behavior you mentioned. I’m using Unsloth’s Q8_K_XL quant.

reply

upvote

by gnerd0019 hours ago|

[-]

`llama-server` looping mitigations --repeat-penalty something greater than 1.0, set reasoning/thinking OFF explicitly, prefer a gguf with more than 4bit quant

reply

upvote

by pkroll23 hours ago|

[-]

Check the LLM benchmarks once it's out: it's such a common use case for these kinds of machines, you won't be waiting long.

reply

upvote

by swang1 days ago|

[-]

I have an M4 Max and when I was trying out local LLM work with pi it has probably felt like the hottest I've ever felt any kind of Macbook be. I could feel the radiated heat off it even a few inches away. Honestly felt hotter than any Intel Macbook I've used. Because of that I stopped as I didn't want to harm my laptop in case I need to hold it for 10 years due to all the supply issues/price increases.

reply

upvote

by dimitrios123 hours ago|

[-]

I tried to run it on a M4 Air for shits and giggles.

After about 1 minute the entire machine basically bricked and I had to hard reset :D

reply

upvote

by stiray2 hours ago|

[-]

I am using MacBook Pro M4 with 64GB of RAM and I have it on direct path of air conditioning airflow, 40ish cm from the device, while running LM Studio opened to network. No noise, not hot to the touch.

Using linux for actual work on my workstation.

reply

upvote

by HSO12 hours ago|

[-]

running potentially sota open-weight models locally only became a thing in fall 2023.

if a hardware cycle takes ~3 years then fall 2026 would be the first possible device generation where apple exploits its advantage with the unified ram architecture.

more realistically, spring 2027, since they probably also needed some time to make up their minds to lean into that on the top end.

that`s also how i would interpret the recent rumors on m6 and m7.

naturally, the cooling and all that will be optimized around that.

so the first devices that are actually intended and designed for this use case will come at the earliest this fall and more likely in q1/q2 next year.

you are basically paying the price now to be on the bleeding (sweating) edge

reply

upvote

by somewhatrandom921 hours ago|

[-]

Try using DwarfStar 4 and use the --power flag: https://github.com/antirez/ds4#reducing-heat-power-usage-and...

reply

upvote

by pantulis11 hours ago|

[-]

DwarfStar is the only thing I've run that doesn't try and make my Mac Studio 128GB take off. Yes, it gets hot while doing inference but quickly cools down when idling, something I haven't experienced with Ollama, LMStudio or OMLX.

reply

upvote

by boomskats21 hours ago|

[-]

Can you run Qwen 3.6 27B on antirez/ds4 now? I thought it was all about the DeepSeek models.

reply

upvote

by somewhatrandom921 hours ago|

[-]

No, I don't think Qwen, but I believe he may try and put some version of GLM in it.

reply

upvote

by c7b22 hours ago|

[-]

This. Do consider local LLMs, but set aside a dedicated machine for it. Connect via VPN or reverse proxy. If it's not a Mac them I'd also put a server distro on it. No need for a desktop environment, save your RAM.

reply

upvote

by tedivm22 hours ago|

[-]

I have a Linux box with two 3090s and it's been great for running Qwen3.6 27b. I lowered the power on each card down to 250w, and then built a small ducting/fan system to vent the waste heat outside. The machine is pretty much silent, and I'm still getting 110 tokens per second out of it for coding tasks.

https://github.com/tedivm/qwen36-27b-docker

reply

upvote

by drnick11 hours ago|

[-]

How useful is the second 3090 in this setup? I run the 5-bit quantized model on a single 3090. Does the second 3090 allow you to use the full precision model instead or a less aggressive quantization by splitting the layers? What about running the 35B model instead?

reply

upvote

by urbsgpw7 hours ago|

[-]

But is Qwen3.6 27B actually worth this investment? If I had to guess you still use SOTA for architectural/planning work?

reply

upvote

by geophile23 hours ago|

[-]

That's exactly what I'm doing -- Mini M4 Pro 64GB, qwen3.6.

My hearing is not great, but I think I would have noticed the fan, and I have never heard it. In fact, I had to google to find out if it even has a fan.

reply

upvote

by trollbridge17 hours ago|

[-]

I'm still kicking myself for buying a 32GB M1 Max Studio two years ago when it wouldn't have been that difficult to get a 64GB instead.

reply

upvote

by oceanplexian1 days ago|

[-]

If you want to do coding with a local LLM your best bet is a 6 year old Nvidia 3090 which is substantially more powerful than the highest end overhyped Apple product for 1/5th the price.

reply

upvote

by chorizo23 hours ago|

[-]

That’s 24GB VRAM. Not enough to run a 27B model at a useful quant+context size.

reply

upvote

by nsbk22 hours ago|

[-]

I beg to differ. Have a look at this repo with single/double 3090 optimized configs for Qwen and Gema models: https://github.com/noonghunna/club-3090

reply

upvote

by sanderjd23 hours ago|

[-]

Yeah seems to me like the mac studios with the unified memory architecture are genuinely good bang for the buck at the moment, because of this memory size consideration?

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by SkitterKherpi23 hours ago|

[-]

You can run 8bit 27B models at 24GB, it's definitely enough for the model size.

reply

upvote

by SwellJoe22 hours ago|

[-]

The 8-bit quantized 27B Qwen 3.6 is 29GB. You absolutely cannot run that entirely on a 24GB GPU.

You could run a 4-bit, which is 16-17GB. But, you'd need a smallish context or you'd need to quantize your KV cache. Something like TurboQuant or RotorQuant might help.

32GB is the lower bound for comfortably running this size model. I'd maybe even say 64GB is right-sized, because a 256k context is nice to have for agentic workflows, and that won't fit on a 32GB card without heavy quantization (but I haven't tried TurboQuant or RotorQuant to know what impact it has on memory use for context).

You could also put some of the model into system RAM, but that defeats the purpose of your argument that a 3090 will outperform a Mac Mini or Mac Studio. If part of a dense model is in system RAM, it absolutely will not outperform a recent unified memory device.

reply

upvote

by cpburns200922 hours ago|

[-]

A 32gb card does run it nicely. I use unsloth's UD-Q5_K_XL at 256k context (k/v at q8_0), and get ~67 t/s on a 5090. I still need to look into MTP.

reply

upvote

by adornKey11 hours ago|

[-]

Nice. I used Q4_K_M to have some headroom. But yours seems to fit nicely.

reply

upvote

by pbgcp202620 hours ago|

[-]

[dead]

reply

upvote

by bityard23 hours ago|

[-]

Quantization is a trade-off, though. The quality, while still perhaps good enough for many tasks, is not as good as the full 16-bit weights that the model was designed for/released with.

reply

upvote

by pbgcp202620 hours ago|

[-]

[dead]

reply

upvote

by barbacoa21 hours ago|

[-]

I'm running qwen 3.6 27b at 8bit quantization and 262k context. It takes 53gb of vram on my system.

reply

upvote

by jnovek23 hours ago|

[-]

I think that’s only true for MoE models. A dense model like 3.6 27b will require more (plus a KV store).

reply

upvote

by bityard23 hours ago|

[-]

No, even MoE models need to fit into (V)RAM. MoE has faster inference because only a subset of layers are used to predict the next token, but the set of layers used changes with every token.

reply

upvote

by angoragoats19 hours ago|

[-]

So buy two.

reply

upvote

by ThunderSizzle9 hours ago|

[-]

The cheapest 3090s I could find with any sort of guarantee were pushing $1500.

An AMD AI Pro R9700 32GB brand new is $1350 right now.

After some tweaking, I had it running faster than the models the 3090 could run, and it could obviously run with higher context limits and bigger models due to the extra vram.

reply

upvote

by iagooar23 hours ago|

[-]

My problem is I won't accept anything lower than the 96GB the RTX Pro 6000 Blackwell has. My dream is a workstation with 2x Pro 6000 to run DeepSeek v4 Flash comfortably, possibly qwen 3.6 / ornith on turbo speed.

But man, I have never purchased a computer which is more expensive than a decent family car.

reply

upvote

by d0gsg0w00f17 hours ago|

[-]

I had this dream too. My 2xDGX Sparks arrive in my reality on Monday.

reply

upvote

by jnovek23 hours ago|

[-]

An M1 Ultra has 800gbps unified memory. It’s nothing to do with Apple, it’s their microarchitecture. They’re just about the only game in town with high-bandwidth memory if you want >24GB (for less than $10k, anyway).

reply

upvote

by murderfs21 hours ago|

[-]

A 5090 gets you 32GB with 1.8 TB/s of memory bandwidth for ~$4k, RTX A6000 gets you 48GB at 768 GB/s for ~$3.5k, 2x 3090 gets you 48GB for $2000 or so, and if you're willing to go into the wilderness, there are much cheaper options like the AMD MI50.

reply

upvote

by jtbaker17 hours ago|

[-]

The RTX 5000 Pro 72GB seems like kind of a sleeper to me, and sips < 300W of power, approx 1/2 that of its big bro the RTX 6000. Kind of dream about installing it in a 10" rack, it seems like it might be able to work? @jeffgeerling you out there?

https://www.microcenter.com/product/709071/pny-nvidia-rtx-pr...

reply

upvote

by angoragoats1 hours ago|

[-]

I'd also like to call out that "high bandwidth memory" (HBM) is a specifically defined thing[0], and is used in high end GPUs, and notably not used in Apple's machines.

I know you probably weren't referring to this type of memory in your post, but IMO it might be worth avoiding this term in the future unless you're referring to HBM, the standard.

[0] https://en.wikipedia.org/wiki/High_Bandwidth_Memory

reply

upvote

by angoragoats19 hours ago|

[-]

Yeah this is just not the case at all; a 5090 or any of the recent nvidia workstation cards all fit this criteria.

Also, while memory bandwidth is important, it isn’t the only consideration. Apple’s architecture has memory bandwidth equal to a mid-range consumer GPU, but its GPU speed is much, much worse than, say, a 5080 or 5090. This translates into e.g. much slower time to first token on Mac systems compared to dedicated GPUs.

reply

upvote

by dheera22 hours ago|

[-]

32GB V100

reply

upvote

by t0mpr1c315 hours ago|

[-]

Meh. I'd rather have 2x RTX 5060 Ti.

reply

upvote

by overgard22 hours ago|

[-]

I'm running an M5 Max 128GB with Qwen 3.6 and unreal engine in the background and it seems to be ok for me. Quite a power drain if it's not plugged in but I haven't seen any thermal issues.

reply

upvote

by amatecha17 hours ago|

[-]

I wonder if that's why there is such a good selection of 128gb M5 MBP's on the Apple Certified Refurbished store lol https://www.apple.com/ca/shop/refurbished/mac/macbook-pro-12...

reply

upvote

by sixothree14 hours ago|

[-]

Wait. Did they raise their prices a second time?

reply

upvote

by nirvdrum14 hours ago|

[-]

Probably USD vs CAD. The parent posted a /ca/ link, which will look really similar to /us/, but the prices will all appear to be higher.

reply

upvote

by sixothree14 hours ago|

[-]

Ah. Thank you. It seemed pretty sticky too, navigating the items via my previous orders even persisted the currency.

reply

upvote

by blagui6 hours ago|

[-]

So the sweet spot for dev in 2026 is 64k context windows? Are we back in 2024?

As more context will degrade a lot the t/s. On top this is 1 slot.

If you use sub agents the kv cache will be invalidated with colliding request and make it even slower.

So the in real world 256k (the max qwen offer) and using 3-4 slots the numbers are very different.

This is the major issue with so many postes over local models not benchmarking real world use. Real context and not taking this in context.

If you use 1 slot the issue, you loose the ability of using sub agents when exploring and all end up in the main agent context overloading it, triggering compactation and oh boy with 64k context that compecation will be an endless loop.

What tasks you would really be able to do with 64k context 1 agent? For sure so quick edits but not complex planning where you need to ingest a lot files and end up loosing 80% of the ingested files to compactation.

reply

upvote

by PeterStuer12 hours ago|

[-]

No laptop is thermally designed to handle sustained high workloads. The whole point of a laptop is to keep it thin, quiet and light, the exact opposite of what cooling needs.

reply

upvote

by Arubis1 days ago|

[-]

Don't forget that your OLED screen will start to color-shift as the heat cooks the panel!

reply

upvote

by manmal1 days ago|

[-]

There is no MacBook Pro with OLED (yet).

reply

upvote

by Arubis1 days ago|

[-]

My mistake on tech; it’s a beautiful display. Alas, I speak from experience when it comes to the thermally-caused color shift. Hopefully it’ll be AppleCare covered.

reply

upvote

by b3ing5 hours ago|

[-]

You can use a fan app to ramp up how fast the fans spin instead of the default so you can prevent any throttling

reply

upvote

by trollbridge17 hours ago|

[-]

Or just buy an R9700 and put it in the basement?

reply

upvote

by xd193623 hours ago|

[-]

Apple does not currently sell a Mac Mini with 64GB RAM.

reply

upvote

by iagooar23 hours ago|

[-]

Get a 2nd hand one. I was lucky enough to get a new one first, last week I get a 2nd hand one in order to run one of my Hermes minions at work.

reply

upvote

by stevenaenns23 hours ago|

[-]

how many tokens/s generation do you get?

reply

upvote

by iagooar23 hours ago|

[-]

Ballpark 25-30 tok / sec on the Mac Mini Pro M4 + qwen3.6 35B. The generation itself is good, prefill is known to be slow on any Apple M-chip architecture. It is really decent.

reply

upvote

by angoragoats19 hours ago|

[-]

They did until 4 days ago, so I’d forgive the OP for not knowing that the option was discontinued.

reply

upvote

by Arch-TK20 hours ago|

[-]

It's okay, completely wrong thread for this statement, but I wouldn't voluntarily use current MacOS (no idea if the older variants weren't terrible) over anything but ssh. Worse than Windows 11.

reply

upvote

by amatecha17 hours ago|

[-]

"macOS" (or however they spell it now) is pretty bad, but I'm not sure it's possible Apple could ever possibly produce an OS as bad as Windows 11 lol, it's really surprising to me to see someone suggest it's somehow actually worse?! How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update? I know multiple people personally who have experienced this with Windows 10/11, not once with a Mac. Just that alone is like the end of the argument for me, ignoring all the shockingly brutal UI problems.

reply

upvote

by Tenoke11 hours ago|

[-]

>How many times has an Apple OS wiped your hard drive or otherwise been completely borked from a forced update

I use Windows and this has never happened to me. I have had Macbooks I cant open to fix/replace something trivial while I can replace any part easily on a Windows PC/laptop though.

reply

upvote

by asimovDev8 hours ago|

[-]

>Windows PC/laptop though.

needs to be noted that it's increasingly uncommon to be able to do so. for desktops you have to build everything yourself - prebuilds (either gaming or workstations) have proprietary PSU and motherboards (in case of workstations, sometimes CPU is bound to the motherboard / manufacturer, for example Threadrippers). Windows laptops now often come with soldered RAM and soon will probably be without M.2 slots like Macs.

There is Framework though I guess

reply

upvote

by Tenoke6 hours ago|

[-]

Well my PC is built on my own and I change parts every year or two. My laptop is a Thinkpad from last year and Im pretty sure I can easily open it to replace something just like my last one.

reply

upvote

by braebo19 hours ago|

[-]

I could not disagree more.

reply

upvote

by toephu222 hours ago|

[-]

I just checked apple's website and configured them:

Mac Studio: Ships: 16–18 weeks

Mac mini: Ships: 10–12 weeks

reply

upvote

by icedchai4 hours ago|

[-]

Hopefully they're ramping up on the M5 variants.

reply

upvote

by jarek835 hours ago|

[-]

You can't buy Mac Mini with 64GB RAM today. Most what you can have is 48GB

reply

upvote

by stared21 hours ago|

[-]

Yes, it gets really hot really fast.

As much as I was tempted to use it on longer projects, I had some reservations about whether it would put too much strain on my MacBook.

reply

upvote

by cosmic_cheese23 hours ago|

[-]

They really need to release those updated Studios already.

reply

upvote

by DennisP21 hours ago|

[-]

Since they've reduced the max RAM on current Studios from 512GB to 96GB, I'm not holding my breath.

reply

upvote

by Aperocky10 hours ago|

[-]

Thank you - I was very close but thanks to chores and availability haven't pulled the trigger. You are very convincing.

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by Matl23 hours ago|

[-]

> If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement - or at least a few meters from your desk.

Can confirm this works rather well, most things that integrate with LLMs, (agents, editors), support providing a remote (LAN) URL for Ollama, LM Studio etc.

But you do need a fast LAN connection, otherwise working with agents will be a pain.

reply

upvote

by Retr0id23 hours ago|

[-]

> you do need a fast LAN connection

Huh, how come? Low-latency I can understand, but I was under the impression that token throughputs were still barely exceeding dialup bandwidths.

reply

upvote

by iagooar22 hours ago|

[-]

I disagree LAN connection is the bottleneck. I do even work with it remotely via Tailscale on shaky hotel WIFI and it works fine (or as fine as any other API-based model).

reply

upvote

by cmgbhm23 hours ago|

[-]

A local model on my m2 made me come to that conclusion but I definitely was having “that config is $2k more” regret. Thanks for posting this!

reply

upvote

by seunosewa16 hours ago|

[-]

You can get some work done by using low power mode even when plugged in, and making your fan start running when the temps just start to rise (maybe 40 degrees. Use a third party fan app to set it up

reply

upvote

by SkitterKherpi1 days ago|

[-]

I am considering getting something like NVIDIA's RTX Spark when it comes out, though even that will be limited to 128GB.

reply

upvote

by jazzyjackson23 hours ago|

[-]

They’ll sell you a bundle, either a pair or a quartet so you can have 256 or 512GB over a 400GB/s network link

I can’t figure out when it makes sense to pay 10k up front for a quantized Llama 3.1 but it’s an interesting option

reply

upvote

by c7b22 hours ago|

[-]

You could fit a Q4 GLM5.2 in 512GB and still have some space for context (372-475GB for the model): https://unsloth.ai/docs/models/glm-5.2

But yeah, there's a bit of a dearth of models that could fully utilize memory in the 128-256GB bracket at the moment. But things move so fast in this space, I wouldn't base my decision on a generation of models that's just a few months old.

reply

upvote

by rnxrx22 hours ago|

[-]

It depends on what's meant by "fully utilized" but fp8 quants of Nemotron 3 Super, the latest Minimax, Cohere A+ and the Mistral small and (especially) medium variants all sit in that 128-256 category, especially with full context or even moderate concurrency. In fact, in a 192GB environment I work with (Hopper GPUs, fwiw) I was pushed into using 4-bit quants with a couple of those to get the model working with a reasonable context window (..but 256 would have rocked out).

reply

upvote

by girvo22 hours ago|

[-]

Not Llama 3.1, but Step 3.7 Flash is one of the few new high quality models in this size bracket. DeepSeek v4 Flash too

reply

upvote

by SkitterKherpi23 hours ago|

[-]

10k is rather a lot yes. For LLMs you can use a lot of tokens with 10k with less hassle without the machine (and also it's not like electricity is free), but for some other things like video models 10k would get burned very fast. I am looking for something more in the 5k range though.

reply

upvote

by awesomeusername1 days ago|

[-]

It's out, I'm daily driving one. It's great

reply

upvote

by SkitterKherpi23 hours ago|

[-]

I assume you have the dgx spark? At this point I am not 100% on the difference other than Linux and Windows. The RTX spark should come around Q4, unless I am mistaken.

reply

upvote

by vikingcat23 hours ago|

[-]

Are you running a local LLM on it? Did you buy a whole laptop?

reply

upvote

by zkmon7 hours ago|

[-]

The Q6_K gguf fits nicely on a 24GB GPU. That's amazing.

reply

upvote

by bilekas21 hours ago|

[-]

Can you define "serious programming"? Because I use it to implement things I COULD go and figure out like algorithms or test generation or evaluations etc, the "serious" programming I tend to do myself. That is what I'm paid for.

reply

upvote

by Roark665 hours ago|

[-]

Serious programming is dealing with a large knowledge surface area.

So not "implement me a shading algorithm"

But more like: make an multi user app running on a k8 cluster, design the whole thing to be indempotent, scalable, easy to deploy remotely via ipmi/pxe boot.

Then see how it makes stupid mistakes along the way.

Today's AI is pretty amazing when it comes to fixing narrow problems (or creating Web apps with no infra). Give it anything where it needs to go online, download some helm templates and look through them to figure out parameters, as well as write an app and it will make lots of mistakes in seemingly simple stuff.

Opus seems to be the model that works the best with this.

reply

upvote

by overgard19 hours ago|

[-]

Serious programming is using as many agents and loops as possible because anthropic needs you to spend more on tokens

reply

upvote

by seanmcdirmid23 hours ago|

[-]

What sort of M5 are you running? A max? MacMini's don't offer max CPUs.

reply

upvote

by iagooar23 hours ago|

[-]

M5 Max. But I also have a MacMini M4 Pro 64GB. Qwen3.6 runs on the M4 just fine - sure the M5 is at least 2x the speed. If Apple launches a MacMini with an M5, I will be the 1st one to get it.

reply

upvote

by kristianp23 hours ago|

[-]

You're only going to get an incremental improvement with an M5 Pro mini compared to an M4 Pro mini. Memory bandwidth goes from 273GB/s to 307GB/s, about 12.5% improvement for LLMs.

reply

upvote

by freehorse21 hours ago|

[-]

M5's have the neural accelarator that boosts prefill speed a lot. But token generation itself will not change that much, that's true.

reply

upvote

by iagooar23 hours ago|

[-]

I thought they might ship an M5 Max version, but you are probably right.

reply

upvote

by Abishek_Muthian16 hours ago|

[-]

>Sure you can use it in clamshell mode

Wouldn't this damage the MBP display?

My RTX laptop has air intake underneath the keyboard and clamshell mode is surely a recipe for disaster; I've taken numerous measures to ensure that the laptop doesn't stay awake when the lid is down.

reply

upvote

by kamranjon10 hours ago|

[-]

I completely disagree, it is probably the best platform currently for this - and the way I run it is as a server with tailscale accessible from my coding machine (same as you suggest here) - the difference is that you can stop the server, use it as a video editing rig on a whim, or use it for training instead of inference (yes PyTorch has caught up and Metal is a great platform for this now).

It’s just so flexible, and I even use it in agent mode (ds4) directly on the machine as well sometimes (it’s really not that bad, I’m often running inference for small side projects on my couch), if there is another machine that can do all of this and still function as one of the more ergonomic, well built, and compact laptops out there, I’d love to hear what it is cause I’d likely be interested!

reply

upvote

by jarjoura23 hours ago|

[-]

TBF, I just recently picked up this same model, and it's reminding me of the last gen Intel i9 MBP. Just visiting any non-basic website spins up the fans and battery life isn't great either. Yes, this thing is fast, but damn it gets hot just using it for normal tasks.

Still, I don't agree. I think this machine is meant to use local models. You just have to wear pants if you want to keep it directly on your lap. I rarely use it that way anyway. I prefer it plugged into an external display and comfortably sitting on a laptop stand.

reply

upvote

by y1n021 hours ago|

[-]

Is there something wrong with the m5s? I have an m4 pro and I’ve never heard the fan on it. I don’t do much with local llms, but I naturally use the web and play games (windows games at that with wine/crossover).

reply

upvote

by inventor777721 hours ago|

[-]

That seems very unusual for modern Apple Silicon. Our family has:

- M3 Pro MacBook Pro 36GB

- M2 Pro MacBook Pro 16GB

- Mac Studio M4 Max 48GB

and I have not heard the fans on any of them with normal use. The only time I've ever heard automatic fans was when I was using a local 12B model on the M3 MacBook Pro, and when running 70B models on the Studio.

You should consider checking Activity Monitor and making sure that the usual suspects are not causing issues with sustained high CPU. And you can use an app like [Stats](https://mac-stats.com) if you want to see that info while actively using the computer.

reply

upvote

by KingMob8 hours ago|

[-]

As someone who just upgraded a month ago from the last Intel MBP to a new base M5 MBP, I think your laptop might have a problem. I'm definitely not experiencing any of what you describe when doing normal tasks.

reply

upvote

by 23 hours ago|

[-]

deleted

reply

upvote

by lowbloodsugar19 hours ago|

[-]

This is not normal. You have a broken Mac. Make an appointment.

reply

upvote

by m3kw97 hours ago|

[-]

Your MacBook will not last running current big LLMs on these hardware. The heat will wear on it.

reply

upvote

by verdverm1 days ago|

[-]

Get an OEM Spark instead, mine are silent and can fit 2 qwen/gemma at 8bit or give you room for a bunch of other, smaller models (embed,rerank,etc)

reply

upvote

by throwaway24040318 hours ago|

[-]

No, buy a framework desktop.

reply

upvote

by pistoriusp12 hours ago|

[-]

Mac Mini in the rack and a Neo in the lap.

reply

upvote

by kelchm6 hours ago|

[-]

This -- with the M5 Max MBP is running flat out, you'll go from full battery to empty in under two hours.

While it is wild to have this much power in a take-it-anywhere laptop form factor, I sort of regret not just going for a Mac Studio + base M5 MBP.

reply

upvote

by singpolyma322 hours ago|

[-]

With 128 you can run 122b ;)

reply

upvote

by codazoda22 hours ago|

[-]

Today the Mini tops out at 48GB. Gotta go to the Studio to get 64GB.

reply

upvote

by aurareturn22 hours ago|

[-]

Don't buy the Mini or Studio. Both have the M4 which lacks the Neural Accelerators, making prompt processing ~3-4x slower.

reply

upvote

by mortenjorck22 hours ago|

[-]

I assume those don't just work automatically with an off-the-shelf gguf. What do you need in your local inference stack to take advantage of M5's neural accelerators?

reply

upvote

by wren699112 hours ago|

[-]

Apple muddied the waters by calling them "neural accelerators" but it seems like what they actually added in the M5 generation is tensor instructions for the existing GPU cores. It's not a separate accelerator like the ANE.

llama.cpp's Metal backend does use them when they're available.

reply

upvote

by aurareturn22 hours ago|

[-]

They do work with llama.cpp and MLX automatically.

reply

upvote

by 2Gkashmiri13 hours ago|

[-]

Apple Mac Studio (M3 Ultra Chip/28 CPU, 60 GPU/96 GB/1 TB

How is this config?

reply

upvote

by busymom01 days ago|

[-]

Also look into buying the Mac mini refurbished from Apple. They come almost brand new, same warranty and you save money.

reply

upvote

by ako15 hours ago|

[-]

You could use an external keyboard?

reply

upvote

by Fr0styMatt8823 hours ago|

[-]

What kind of speed in tk/s do you get with the MacBook?

reply

upvote

by iagooar23 hours ago|

[-]

qwen3.6 27B MLX 8bit -> 15 tok / sec. A bit slow but it is a delightful model to use, and smart too.

qwen3.6 35B A3B MLX 8bit -> 85-90 tok / sec! It is impressively fast and roughly 90% as good as 27B (in my opinion).

reply

upvote

by samtheprogram22 hours ago|

[-]

Are you sure you're running it with MLX?

reply

upvote

by gyanchawdhary8 hours ago|

[-]

This is a very exaggerated take. I have an Apple M5 Max with 128 GB ram running 15'ish Coasts (coasts.dev) environments, each of them running postgress, python, redis and FE stack + locally running voice models and face swap models .. and the only time the fan kicks in is when I open multiple google analytics tabs.

reply

upvote

by gigatexal20 hours ago|

[-]

Same. And your M5 has acceleration that I don’t with my M3 max. I can’t do anything local it gets hotter than an Intel Mac trying to run docker from back in the day.

reply

upvote

by julianlam17 hours ago|

[-]

Very surprised an Apple device can have some atrocious ventilation design.

I'm running this model on a Framework 13 and the chassis barely heats up at all while running full tilt.

reply

upvote

by 2Gkashmiri17 hours ago|

[-]

How is Mac studio 32gb or 96 gb ram one?

reply

upvote

by dzonga22 hours ago|

[-]

why not buy one of those "a.i" desktop kits being sold by Nvidia/AMD and just connect to them via network ?

to me that's cheaper than paying an LLM provider such as Anthropic spreading FUD around open weight models & more sustainable too.

reply

upvote

by Gigachad20 hours ago|

[-]

It's still currently way cheaper to pay open router to run qwen for you. And you have the option to use much bigger better models like DeepSeek v4 flash.

reply

upvote

by zxexz16 hours ago|

[-]

[dead]

reply

upvote

by ActorNightly22 hours ago|

[-]

>If you want to run Qwen3.6 27B / 35B at its best, get a MacMini M4 with 64GB of RAM and put it in the basement

Im sorry, but its time to start calling Apple sycophants out. Stop trying to push your tech jewelry on other people. You only buy those computers because they are Apple, you don't know anything about computing or running LLMs, you don't do any real work, so you should probably not give advice on what to buy.

A single 3090 will run Qwen3.6 27b fine, and its VRAM speed is twice of what the best Mac has. And the build will be cheaper. Decent CPU/Motherboard, 32gb of DDR4 ram, an SSD and a Single 3090 should run max about $4grand. Mac m4 mini is 6grand.

Then, when gpu prices come down (or you find one on a deal), you can upgrade the card, or stick a second one, and benefit from more speed. You can't do that with the trash Apple produces.

Flag me if you want, I don't care. Its embarrasing for the tech community to give advice this bad.

reply

upvote

by iagooar22 hours ago|

[-]

I am not going to flag you, I am much OK with having good arguments.

I just purchased a Mac Mini M4 Pro 64GB for $3k - 2nd hand of course.

I am not a hater of Nvidia and I am planning on building a workstation based on RTX cards. You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

I am pretty sure I know a thing or two about computing, I have been in the trenches for many, many years and I have had machines of all kinds, shapes and colors. It just so happens that Macs are very capable, very convenient machines that happen to work great in the era of LLMs, too.

But you do you.

reply

upvote

by Roark665 hours ago|

[-]

I have to see I'm in the rtx camp. A dual rtx3090 workstation with 200G of ram and zen5 9950x cpu. All watercooled.

The only reason I can tell it's on, is the very quiet hum of the slow speed water pump. Large fans run at 1200rpm and are fully quiet.

I have over a meter of radiators there.

Fun fact, I bought my first rtx3090 4 years ago. A year ago I bought another one and they are still the same price used.

I may buy another one (for my servers)

reply

upvote

by lowbloodsugar19 hours ago|

[-]

If you are in Apple ecosystem, and have reasons to own one besides inference, then buying a used Mac mini pro isn’t such a bad idea. I just bought a regular Mac mini just to provide a nice front end to my Ubuntu workstation. But if all you want is inference, then a cheap PC with a 32gb 9700 (or two!) in it is far cheaper. This specific thread was about someone who already has a MacBook. A cheap PC and GPU pairs well. Or a spark: slower but more memory. Or fuck it! Get a 5090 or a 6000!

reply

upvote

by ActorNightly22 hours ago|

[-]

>You clearly do not seem to understand how convenient the MacMini actually IS - the form factor, how quiet it is, how durable it is, how well it integrates with other Macs, how well it works as a bridge to a personal agent like Hermes (integration with iMessage, Calendar, Reminders, iCloud, etc).

If you are that locked in to Apple, its pretty easy to buy a used Mac Mini older gen for all the non AI stuff.

But this is a discussion about inference. Buying a Mac anything for any sort of local inference is a COLOSSAL waste of money.

reply