I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.
I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.
Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.
The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.
Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.
Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.
When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.
And I can't say that I won't switch to openrouter (even just for the same models) at some point.
But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.
The rest of my life is ultra-frugal so I am relaxed about this.
Having spent a good weekend learning how to perform latent-steering through playing with pytorch and a local Gemma4 model, there is no way I could have groked any of that in the the way I did without hands on time.
This is on an M3 Max 36GB I've had for a couple of years. No further outlay needed.
I don't know if it has changed my mind about a career change but as I am sure you can understand, I no longer feel like I am running away defeated.
My very best wishes to you :-)
The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.
The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.
(sarcasm, btw)
Over the long term it's always been better to buy than to rent, even if the renting option is technically more efficient on the GPUs, you don't have to pay some hosting providers profit margin.
And for users that aren't running multiple agents 24/7, you should be able to fit a good user:GPU ratio.
For example (and relevant to AI) I can generate electricity on my roof at $0.20-25/kWh, batteries included. In California the electric utility can’t offer it cheaper than $0.30-0.50/kWh. Therefore at scale, electricity is actually more expensive.
There are many such examples.
Right now, there is way more scale in centralized AI than there is at the edge. But that could flip. I'd still probably put the probability that it will under 50%. But I'd also put it above zero!
What makes you so certain that economies of scale won't work the opposite way you imagine? E.g., if model improvement tapers off, but RAM costs decline (hard to believe atm, but historically likely), then eventually everyone will be able to run SOTA models on their personal hardware.
Heck, even if model sizes simply grow more slowly than RAM costs decrease, the same would happen.
I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place
That's never the point of keeping local alternatives though.
For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.
Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.
I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.
For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?
And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.
Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.
Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.
The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.
Like, just watching a computer I already owned act like ChatGPT with the wifi disconnected.
It was the first time I stopped feeling quite so helpless, somehow.
Qwen barely needs any of Opencode's prompt, in my experience; I think I cut it down to about three general lines I found by googling. Mainly you need only a pre-amble to make sure that the plan mode, plan switch and build mode prompt fragments make sense.
Gemma 4 also needs almost nothing at all, which is fascinating, considering it is not a coding-specialist model. It just seems to be who you need it to be when you ask.
Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight.
I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.
Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.
I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.
Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.
I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.
YMMV.
But if this is the case, as you say, it seems like a good opportunity to build a more welcoming set of entry points into this!
(Very reminiscent of 3D printing, where you get a lot of very trivial advice poorly applied, which is an analogy I've now made several times.)
Several of the youtubers are pretty helpful, though; I watched half a dozen things and absorbed the broad pattern and then went for it.
Also I got a lot out of reading HN comments, which is why I am here; tucked away in the corners of these discussions are people who can help. Over time I hope I am one.
To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.
What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?
I dunno, all of this seems really boring and "been there done that" to me at this moment in time!
I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.
So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.
I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.
For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.
I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.
(Also who am I kidding about the existence of a printer with no problems)
I have colleagues that seem perfectly content to delegate too much to the agents, and it saddens me. It feels like there will be swaths of engineers that didn't train some of the critical thinking skills that I take for granted.
I certainly see it in slack discourse around anything more complicated than a feature implementation. Maybe I'm just cynical. Time will tell, I suppose.
That is why I'm content to delegate to agents - I have more code/features I want to write than I have time to debug (writing is the easy part).
Over the last few months, I've been digging into performance problems with a high throughput service that my team owns. I started working on the problems in my own time, put out short and medium term improvements that legitimately avoided operational issues, and started developing an alternate architecture that should meaningfully address the problems for the long term.
I've learned new things and made improvements that probably wouldn't have ever gone in otherwise.
I've spent my whole career being frustrated by the pile of low severity bugs and performance issues that "I could fix that if I could only justify putting a couple hours into it!". And now I can just fix all those. Nobody is going to question my use of time to write prompts and do code reviews of those things, when I can to my "real" work simultaneously.
What does "mainstream" refer to when we're talking about software development and LLMs? As opposed to "engineers".
But I think there is (and has always been) also a distinction between the "mainstream" of software developers vs people who are working on new tools and capabilities to be used by that "mainstream".
IMO it is certainly true that the most efficient and cost effective was to do "mainstream" software delivery at the moment is hosted frontier models. But for people thinking about "what's next?", it makes a ton of sense to be exploring different models in anticipation of a possible (but certainly not inevitable) sea change.
I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.
I guess I could also ask it to do the work. But where do you draw the line?
The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.
And that's how skills die.
The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.
No longer having to sweat all the details is a Good Thing, not a Bad Thing.
But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.
For example, you need to know it uses gasoline (or diesel), it requires oil changes every certain amount of time, break pad replacement, etc.
You also probably need to know that you can't operate cars over a certain amount of water, that you need a driver's license, stopping at red lights, etc.
Sure, you might not need to be a mechanic, but that's far from not understanding how a car works, which to me sounds similar to knowing how to shoe a horse, which is different than being a horse vet.
Maybe a more apt analogy would be a skill like making fire without a lighter.
That skill died too, so what's your point?
Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.
So this is really the only way I know how to proceed.
And we happen to be discussing this on a forum where the type of people who will be the specialists for the kinda of systems we're discussing are likely to gather.
I'd be surprised if in my casual discussions out in the real world, I were to run into a lot of people who care exactly how all this works, to the extent that they want to invest significant money into hardware that allows them to run things themselves and dig into what's actually going on. But I'm not at all surprised to come across such people here! (Indeed, it would be very disappointed if I didn't!)
I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)
I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.
A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.
- opencode with it's webui
- deer-flow with it's research/powered front end
They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).
It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.
It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.
Have you tried Paseo?
I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.
(You can also use the Opencode GUI to frame a remote opencode web interface)
I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs
Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.
However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.
What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.
I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.
In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.
So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.
Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.
edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!
https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...
The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.
Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.
Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.
The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.
It does seem to be doing useful work but it’s not API call level quality
If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.
With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)
I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.
Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.
Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.
When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.
Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.
Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.
If you want to run unquantized, you definitely need 128GB.
Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.
As of writing this, it shows 24 offers between 700 and 950.
Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.
If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?
$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)
Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.
Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.
https://github.com/noonghunna/qwen36-27b-single-3090
Flies though (50-70tps is impressive for a model this smart)
I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.
The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.
I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.
I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.
[edited to mention ollama as a nice alt]
I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.
In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).
YMMV for sure.
I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.
Context size?
i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?
In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.
At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.
I paid 2424 euros in total for this machine. And it can easily run the models discussed in the comments and in the article. It's tiny, and runs CachyOS like a champ. Over 4000 euros less than the price you listed.
We can all send a thank you letter for our friendly billionaires such as Sam Altman for the price situation we're in today: https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...
I think you might be a little to into the stew here.
I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.
In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.
Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B
So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).
The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.
I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!
These devices, especially the DGX line, are fantastic if you are interested in low-level CUDA programming. The DGX spark can be used to prototype CUDA code/libraries for GPUs that most of us couldn't think about affording. If you want to learn how to program for datacenter level GPUs then these are the best way to get that at home. Sure your code will run very slow compared to the real thing, but you can take that code and, theoretically, run it on the real thing. For anything else though, I feel there are better options.
If you're interested in pure inference I'm pretty partial to Apple devices. The M4 Max gets you 546 GB/s, the M5 MAX 614 GB/s, and the M3 ultra (you'd have to buy used at this point) 819 GB/s. Plus you have a very useful computer even if you realize you don't want a full time home inference server. Additionally these devices require very low power (if you're running high end consumer GPUs you do have to think about what your energy costs are per hour and how warm you like your room).
If you're interested inference and training, or already have a pretty beefy desktop PC, or simply demand the most token/s you can get, then GPUs are the way to go. The downside is they're still pretty memory restricted (but honestly the options for what you can run on any RTX N090 are pretty good). You'll get blazing inference and prefill speeds on these devices. The only down side is, if you are using them heavily, you will see it on your energy bill and feel it in your room.
The "should I wait" question is also potentially applicable. The world of consumer hardware is looking increasingly bleak (and expensive) but if Apple does release a new "Ultra" model we could be looking at inference speeds very close to GPUs (there's still limitations to these devices that makes training preferable on GPU)
What I had in mind was an AMD Strix Halo machine, but it seems to have none of the advantages you mentioned. It's neither high bandwidth, nor does it have CUDA support, nor does it have support from the big OEMs. All the boards are from relatively obscure Chinese vendors.
It seems like all the major OEMs have rallied behind Nvidia, if you look at the upcoming RTX Spark laptops.
The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.
What's going to happen is that the capability at any given size point is going to get better over time as new training regimes cram more into the available space. A 27b model released next year will be better than a 27b model this year (else why release it?). Hardware will get more useful, not less.
... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.
> GLM-5.2 class models already need 1TB+ of RAM.
If you quantize GLM-5.2 to 4 bit, you can do it in less than 500GB: https://huggingface.co/unsloth/GLM-5.2-GGUF (table on the right)If you find three finds that also have a 128GB MacBook, you can chain them together (the MacBooks, not your friends) and make it work.
You could also run GLM-5.2 on a single MacBook if you stream the active parameters from disk, but even with speculative decoding, you'd probably only get in the order of 1 token per second, so this is not really practical for most applications.
They’re trending to be the right size to be good.
Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.
at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.
Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.
i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.
in terms of privacy, yes that's a real application, but someone taking it all away? I don't see it happening.
it's not an OS or a device, it's just a box/thing that runs a model, it's really commodity stuff we're talking about
more realistic concern would be that the open labs wouldn't be able to compete in the future thus development ends, but that means you can't host models that don't come out so...
again maybe I misunderstood but I just don't see why this would be worth it just for that one concern
From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.
Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.
You're welcome to make your substantive points thoughtfully, just not aggressively.
Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.
But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"