undefined

upvote

points

by bensyverson1 days ago |

upvote

by dofm1 days ago|

[-]

The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

I don't know how much serious hands-free agentic coding I will ever do on my MacBook alone, but I do know that I would not have got so far into understanding this without tinkering with local models, llama.cpp, LM Studio, and LM Studio and all that.

I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Until, that is, I could poke around with setting it up on my own (secondhand) machine, watching the API calls, understanding some of the terminology. I didn't even buy the machine for that; it's just adequate to the task.

The Neo is too small to really get much benefit from this opportunity to make it more visceral and knowable.

reply

upvote

by pizza2341 days ago|

[-]

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Cloud models are (much) faster, they don't consume so much power/generate heat, they have much bigger (LLM) context, they're much more precise and they have a much wider (engineering) context of the given problem.

Except privacy and use cases that are blocked by cloud models (e.g. reverse engineering), local LLMs are currently an expensive toy.

When I try to program with a local LLM (I'm on a 32/128 GB system), I end up wasting time compared to a cloud LLM.

reply

upvote

by dofm1 days ago|

[-]

Again, I would not argue against any of this.

And I can't say that I won't switch to openrouter (even just for the same models) at some point.

But one of the things I have found about my own process learning is that some lessons only come to you when you make yourself available to them. And if that means doing things the difficult way, that is what you should do.

reply

upvote

by wahnfrieden1 days ago|

[-]

Difficult... and wastefully expensive

reply

upvote

by sanderjd23 hours ago|

[-]

Seems like an investment into building expertise, which is likely to have high ROI in the future, rather than a wasteful cost.

reply

upvote

by dofm1 days ago|

[-]

I mean, it's a (secondhand) computer I bought for other tasks (processing very large photos, compiling large apps quickly). It's running all the time. It can also run LLMs when I want to.

The rest of my life is ultra-frugal so I am relaxed about this.

reply

upvote

by _puk23 hours ago|

[-]

Don't bite. You're right.

Having spent a good weekend learning how to perform latent-steering through playing with pytorch and a local Gemma4 model, there is no way I could have groked any of that in the the way I did without hands on time.

This is on an M3 Max 36GB I've had for a couple of years. No further outlay needed.

reply

upvote

by monkmartinez23 hours ago|

[-]

My thinking is totally aligned with yours, perhaps its because I am trying to do a second act at almost 50 from blue-collar to white collar office work. I have no formal degree, but I have been hobby programming for 20 years. I have made a habit of "letting myself be available to all lessons"... the localllama group has made this journey really fun if nothing else. I have learned an ABSOLUTE ton from this era!

reply

upvote

by dofm23 hours ago|

[-]

I have been contemplating a move in the opposite direction because I have just been exhausted and depressed, so for me, really learning this stuff this way has been about managing those feelings, about a sense of pride and ownership of my processes.

I don't know if it has changed my mind about a career change but as I am sure you can understand, I no longer feel like I am running away defeated.

My very best wishes to you :-)

reply

upvote

by moffkalast22 hours ago|

[-]

People pay thousands for model trains, everyone needs a hobby.

reply

upvote

by dofm21 hours ago|

[-]

Training models vs modelling trains

reply

upvote

by moffkalast10 hours ago|

[-]

Ah yes, the EMD0E9-30B-Union-Pacific.gguf

reply

upvote

by fragmede6 hours ago|

[-]

I'm sending Codex-gpt-5.5-cyber.stl to my printer right now!

reply

upvote

by Shorel1 hours ago|

[-]

From your post I can only perceive the instinct to pick a side, and trying to make sure it is the "winning side". But the truth is far more nuanced. I have acces to both, paid and local models, and even if slower, the local models have been far more educative about how these technologies are put together, and what is required for local computing to thrive again. Paid models will not suddenly disappear just because I play with glm-4.6 on Ollama. At the same time, my work pays the cloud subscription and I use the cloud models to perform the tasks my work requires. There's no need to choose one side.

reply

upvote

by sanderjd23 hours ago|

[-]

> currently

The interesting question is whether that gap will narrow, and if so, how much, and on what timescale.

The exact answer to this question is not knowable, but if you are the kind of person who comes to a site called "hacker news", and you think there is a nonzero chance that the answer is that yes, the gap will narrow and this won't always be an expensive toy, then now seems like a pretty great time to get in the game and start exploring the capabilities.

reply

upvote

by Abishek_Muthian16 hours ago|

[-]

I agree completely. I think local AI is best limited to purpose built SLMs; all this craze around running quantized coding LLMs has taken the attention off SLMs.

reply

upvote

by icedchai3 hours ago|

[-]

Same. Local LLMs are fun to experiment with, but when I want generated code of a sufficient quality, I use a cloud LLM.

reply

upvote

by AlpacaJones1 days ago|

[-]

The key word there is 'currently'.

reply

upvote

by smt881 days ago|

[-]

Economies of scale are a fact of nature and aren’t going to be subverted in the future by even the most advanced local models

reply

upvote

by kennywinker1 days ago|

[-]

Which is of course why, if you want to render 3d scenes to play a video game, you have to rent time on a mainframe system. I don’t see that changing ever - it’s just economies of scale!

(sarcasm, btw)

reply

upvote

by Gigachad20 hours ago|

[-]

The economies of scale gains are lost because you still have a middle man hosting provider who wants to profit too.

Over the long term it's always been better to buy than to rent, even if the renting option is technically more efficient on the GPUs, you don't have to pay some hosting providers profit margin.

reply

upvote

by Dylan1680711 hours ago|

[-]

If the hosting provider can fit 1000 users onto 100 GPUs, that's enough for quite nice margins and being far cheaper than buying your own GPU.

And for users that aren't running multiple agents 24/7, you should be able to fit a good user:GPU ratio.

reply

upvote

by Gigachad10 hours ago|

[-]

Maybe. The economics work out better than for game streaming. When I looked in to game streaming it ended up being cheaper to buy over the long term. Though games tend to use 100% of the hardware for hours, and they tend to all be used at the same hours of the day and have to be hyper local for latency reasons. Something LLMs don’t have issues with.

reply

upvote

by oceanplexian23 hours ago|

[-]

Things can get both more expensive and cheaper at scale, hence the term.

For example (and relevant to AI) I can generate electricity on my roof at $0.20-25/kWh, batteries included. In California the electric utility can’t offer it cheaper than $0.30-0.50/kWh. Therefore at scale, electricity is actually more expensive.

There are many such examples.

reply

upvote

by Dylan1680711 hours ago|

[-]

Apples and Oranges. The utility uses a weird conflated fee that combines the price of the electricity and the price of connecting your house to the grid. If they split it up your marginal price per kWh would be much less.

reply

upvote

by sanderjd23 hours ago|

[-]

Yeah, I think the fallacy here is the conflation of scale and centralization.

Right now, there is way more scale in centralized AI than there is at the edge. But that could flip. I'd still probably put the probability that it will under 50%. But I'd also put it above zero!

reply

upvote

by KingMob7 hours ago|

[-]

Setting aside that very little about economics rises to the level of "facts of nature" like physics...

What makes you so certain that economies of scale won't work the opposite way you imagine? E.g., if model improvement tapers off, but RAM costs decline (hard to believe atm, but historically likely), then eventually everyone will be able to run SOTA models on their personal hardware.

Heck, even if model sizes simply grow more slowly than RAM costs decrease, the same would happen.

reply

upvote

by sanderjd23 hours ago|

[-]

... said the IBM executive to a young Bill Gates.

reply

upvote

by bogeholm23 hours ago|

[-]

> Cloud models […] don't consume so much power/generate heat

I do realize the cloud is just someone else’s computer right? Power goes in, tokens and heat come out - just in another place

reply

upvote

by actionfromafar23 hours ago|

[-]

The cloud computers produce more tokens per watt. That said, if you have a computer at home running 24/7 for other reasons and you also can use it for some LLM work, why not.

reply

upvote

by psychoslave1 days ago|

[-]

Anything done local will likely come at higher cost and at scale with less energy efficiency and commodity, with less possibility to fine tune engineer deeply on wider horizon of issues.

That's never the point of keeping local alternatives though.

reply

upvote

by dofm1 days ago|

[-]

Right.

For me this dates all the way back to installing Slackware 1.0 (0.99pl12!) on an offline 486SX rather than just using the internet-connected workstations in the lab.

Here, I already had a Mac that was powerful enough to run a local LLM, so now I do, because I can.

reply

upvote

[-]

deleted

reply

upvote

[-]

deleted

reply

upvote

by musebox355 hours ago|

[-]

Thanks for posting this. This is the tinkerer mentality. It is not for everyone, but certain things can only be learned in that way. It is the best antidote to AI paranoia. There is much that does not transfer between frontier models and local ones. There is that. But you can not tinker as much as you can with the former.

reply

upvote

by VerifiedReports23 hours ago|

[-]

Exactly. The distinction between the various layers in "AI" systems is pretty vague to the newcomer. What is the "model" vs. the engine "running" it vs. weights?

I don't recall any previous tech stack that was barfed onto the scene with so little background or reference material, going from zero to endless undefined jargon... and no primer in sight.

For people who demand an understanding of their tools, it's a lot of work. I recognize the value of "AI" in performing the tasks I'd have to do manually; for example, keeping the data structures of my front- and back-ends in sync in a project. But do I want to interrupt my development and take weeks off to digest all of these tools?

And if I do, I want to run the show and fully understand it. And like you, I think that's best done locally.

reply

upvote

by Fr0styMatt8823 hours ago|

[-]

The most unexpected thing for me was kind of philosophical in a ‘holy shit’ way.

Cloud models still feel ‘magic’, like you send a request off and get something back, like it’s something ‘special’. I used to joke that ChatGPT might be some kind of mechanical turk underneath.

Watching a model run local on your own machine hits different — you realise that yes, it IS just a computer program. Which for me actually makes me appreciate the leap we’ve made MORE, not less. From an information-theoretic point of view, LLMs really are something special.

The fact that they are just programs, that I’ve now experienced first-hand that they’re just programs, makes all those questions around consciousness and intelligence much more interesting.

reply

upvote

by dofm22 hours ago|

[-]

Yep — it hasn't changed how I feel about what LLMs are capable of (and very much not capable of) but this visceral feeling is fascinating.

Like, just watching a computer I already owned act like ChatGPT with the wifi disconnected.

It was the first time I stopped feeling quite so helpless, somehow.

reply

upvote

by QuercusMax22 hours ago|

[-]

Yeah, it's been fun for me running models (mostly Qwen 3.6 27B) on my 48GB M4 MacBook Pro. When i'm using it to run models, it's basically unusable for anything else - I actually do the work on my Macbook Neo. Took me a while to figure out why the models couldn't figure out how to make tool calls - because LMStudio by default uses a 32K input window, which is smaller than OpenCode's prompt, so half of the instructions were being pruned from the middle!

reply

upvote

by dofm22 hours ago|

[-]

Yes — there is a setting for that isn't there. And as soon as you realise there's a setting for that, you have new knowledge.

Qwen barely needs any of Opencode's prompt, in my experience; I think I cut it down to about three general lines I found by googling. Mainly you need only a pre-amble to make sure that the plan mode, plan switch and build mode prompt fragments make sense.

Gemma 4 also needs almost nothing at all, which is fascinating, considering it is not a coding-specialist model. It just seems to be who you need it to be when you ask.

reply

upvote

by hypfer10 hours ago|

[-]

What are those 3 lines you've cut it down to?

reply

upvote

by 3 hours ago|

[-]

deleted

reply

upvote

by QuercusMax14 hours ago|

[-]

[dead]

reply

upvote

by ricardobayes23 hours ago|

[-]

For the most part you can just download LM Studio and go from there. It provides a chat interface and an easy-to-use interface to browse, load and use LLM models. The engine: it is abstracted away by LM Studio, if you want to dig deep it's llama.cpp as the runtime. Weights are the files what you download, they are the models for practical purposes.

reply

upvote

by dofm23 hours ago|

[-]

I definitely would recommend LM Studio as a learning environment, because it surfaces a bunch of things in relatively clear-minded ways. I am very grateful for it.

reply

upvote

by codazoda1 days ago|

[-]

I agree with the learning aspect, but I have another motivation. I suspect that closed models might become too expensive to run for personal hobbyist use. I’ve been planning to buy a 64GB machine just to allow the limited local models this enables.

reply

upvote

by ehnto17 hours ago|

[-]

It's also great to have capability to run local models for more brute force tasks. Because you can change the system prompt, you can get local LLMs to do all kinds of high volume tasks without burning through tokens on a hosted model.

Just one example, I needed a bunch of images tagged and organised, with a local vision capable model I could pretty easily set that up and leave it running overnight.

I already had the GPU and memory for gaming, so it was at no cost for me to start running local models. But I feel the long term writing is on the wall, local models will only make more and more sense as they get better and more efficient.

reply

upvote

by bpye13 hours ago|

[-]

> The maths there is pretty undeniable, but it is not where I'd make the split. Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Seems like a GPU with 12GB+ VRAM is going to be a much more affordable way to achieve that? Even a B580 should get reasonable perf there.

reply

upvote

by dofm13 hours ago|

[-]

No idea. I am a Mac guy, have been for a very long time. I buy them secondhand as a rule.

I guess I would build a powerful home LLM server if I was convinced I really needed one for my purposes for some agentic application or other. At the moment I'd prefer to ride this out with a machine that is also an excellent Mac.

reply

upvote

by not_kurt_godel23 hours ago|

[-]

> Having a machine that can run some modest local LLMs, like the Gemma 4 12B, is really worth it.

Agree having a powerful machine is really worth it in general for professionals, but strong disagree that running local LLMs has anything to do with it. It's hard enough as it is getting a good ROI on your time/money prompting/wrangling with frontier models. IMO leaning on the comparatively limited capabilities of local LLMs is best avoided in favor of keeping your own personal coding skills fresh and continuing to learn new ones.

reply

upvote

by dofm23 hours ago|

[-]

I'm not that bothered about my coding skills, which are fine, and pretty up-to-date considering I'm now an old bloke. I am bothered about building an instinctive understanding that helps me deal with my anxieties and decide whether I want to carry on with this working life or quit.

I needed to do this, this way, in my own time, to put my brain back together. It has worked for me, which is why I recommend it.

YMMV.

reply

upvote

by ricardobayes23 hours ago|

[-]

Unfortunately the local llm bunch is not the most emphatetic one in my experience: you are somehow "expected" to immediately know all this stuff and god forbid you ask the wrong question. I've never seen or felt this level of bullying and weird vibes over tools and LLM models. "My setup works for you or beat it".

reply

upvote

by sanderjd23 hours ago|

[-]

Where has that been your experience? My experience interacting with people about this is almost entirely in HN threads like this one, and I haven't found what you're saying here to be the case.

But if this is the case, as you say, it seems like a good opportunity to build a more welcoming set of entry points into this!

reply

upvote

by dofm23 hours ago|

[-]

There's also a lot of cargo-cult stuff, isn't there? Especially in the Reddit groups. Just do XYZ. And people ask why and they are never around to explain. Because, perhaps, they can't.

(Very reminiscent of 3D printing, where you get a lot of very trivial advice poorly applied, which is an analogy I've now made several times.)

Several of the youtubers are pretty helpful, though; I watched half a dozen things and absorbed the broad pattern and then went for it.

Also I got a lot out of reading HN comments, which is why I am here; tucked away in the corners of these discussions are people who can help. Over time I hope I am one.

reply

upvote

by sanderjd23 hours ago|

[-]

Continuing to learn new ones, like what?

To me, "how do contemporary AI systems work and interact with contemporary hardware and how can I best take advantage of their capabilities?" is the set of skills that are worth learning at this moment.

What else is there? New / additional programming languages? New / additional database systems? frameworks? orchestrators? cloud provider / infra tooling? architectural patterns?

I dunno, all of this seems really boring and "been there done that" to me at this moment in time!

reply

upvote

by not_kurt_godel22 hours ago|

[-]

Yes, that all tracks, and all of those skills are worth maintaining and improving. Great to tinker with LLMs locally hands-on to learn, and having a powerful enough machine to enable that to a reasonable degree is just one of many reasons why it's worth it. I'm just saying that IMO "how can I best take advantage" lands firmly in the bucket of only cloud-hosted frontier models being worth my time. I would speculate that holds true for a large portion of the wider HN audience but YMMV of course.

reply

upvote

by sanderjd22 hours ago|

[-]

Maybe. I felt this way a year ago and definitely two years ago. But now my sense is that it's played out at this point, and the valuable thing to build expertise on now - precisely because I think it's coming rather than here - is local / open weights / hybrid models and harnesses.

reply

upvote

by ricardobayes23 hours ago|

[-]

I'd say give it some time for the dust to settle. This field badly needs standardized benchmarks even before the conversation around model goodness can start.

reply

upvote

by ddalex1 days ago|

[-]

I just got Claude to download and install all the models and servers and agents and prepare all the launch scripts for me... no need to learn, just ask it to do it for you

reply

upvote

by dofm1 days ago|

[-]

Right, but I am a middle-aged bloke who is experiencing existential angst about whether I can carry on in this industry.

I have a pretty deep, maybe paranoid need to be confident I have an intrinsic understanding, and I have found in my life that lessons come to you when you make yourself open to learning.

So I need to build on top of what I know, taking as much of the hard way as I can bear to take at any one time — it has to be not quite difficult enough to put me off.

I can't really explain what I have learned this way that is different, but I feel it in a way that I wouldn't if I'd simply pushed a button.

For the same reason, I have a really basic 3D printer that I've set up myself, set up Klipper, configured how I want it, learned how to calibrate, all that. And now I can say that I feel I have an understanding of 3D printing. I could hold my head above water in a discussion with a real expert, maybe find work in an adjacent field where my insights would keep me grounded.

I can afford a really good printer that has all that set up, and more, has no problems. But I'd just be someone who has a 3D printer.

(Also who am I kidding about the existence of a printer with no problems)

reply

upvote

by greyskull23 hours ago|

[-]

This really resonates with me, and I'm only a decade and change into my career. I use claude a lot day to day. I try to use it sensibly, making me more productive and produce better work. I'm also trying not to lose understanding along the way. I want to be able to actually talk to the conclusions I'm reaching.

I have colleagues that seem perfectly content to delegate too much to the agents, and it saddens me. It feels like there will be swaths of engineers that didn't train some of the critical thinking skills that I take for granted.

I certainly see it in slack discourse around anything more complicated than a feature implementation. Maybe I'm just cynical. Time will tell, I suppose.

reply

upvote

by bluGill22 hours ago|

[-]

You will not live enough to learn everything. Eventually you have to say "I could figure [something] out but I won't take that time." Most things are that way - I probably could learn brain surgery (I used this example because it has a reputation of being a very difficult course of study). I would like to make a lathe from scratch - but I don't have easy access to enough iron ore to get started - even if I start from scrap metal, I probably wouldn't spend months making my own surface plate (...) and so I own a factory made lathe instead.

That is why I'm content to delegate to agents - I have more code/features I want to write than I have time to debug (writing is the easy part).

reply

upvote

by sanderjd22 hours ago|

[-]

For me (about halfway between you and dofm in my career by your own statements in this thread), it's a dream at the moment. I can delegate all the tedious stuff that I've done "the hard way" a thousand times already and feel I have very little of value remaining to learn, so that I can spend more time on all the things that are actually new and thus much more interesting.

reply

upvote

by greyskull22 hours ago|

[-]

It's been a great multiplier for me in similar ways. The "dreamiest" thing has been that it has freed up time that I would normally have spent doing sprint work, to work on things that just don't make the cut until it's bad enough to deprioritize other work.

Over the last few months, I've been digging into performance problems with a high throughput service that my team owns. I started working on the problems in my own time, put out short and medium term improvements that legitimately avoided operational issues, and started developing an alternate architecture that should meaningfully address the problems for the long term.

I've learned new things and made improvements that probably wouldn't have ever gone in otherwise.

reply

upvote

by sanderjd22 hours ago|

[-]

Yes exactly. There is a narrative that it's driving everything toward low quality slop, but in my own work it's exactly the opposite. We're doing work on quality and performance that we never would have gotten to in the past.

I've spent my whole career being frustrated by the pile of low severity bugs and performance issues that "I could fix that if I could only justify putting a couple hours into it!". And now I can just fix all those. Nobody is going to question my use of time to write prompts and do code reviews of those things, when I can to my "real" work simultaneously.

reply

upvote

by sanderjd22 hours ago|

[-]

Yeah, this is just the engineer's mindset. It's not surprising that this is a popular view here, even if it is not (and does not need to be) the mainstream perspective.

reply

upvote

by greyskull22 hours ago|

[-]

> mainstream

What does "mainstream" refer to when we're talking about software development and LLMs? As opposed to "engineers".

reply

upvote

by sanderjd21 hours ago|

[-]

This is a very fair question! When I wrote this comment, I was definitely thinking of the "real" mainstream, i.e. users of llm chat to generate text, not software engineers.

But I think there is (and has always been) also a distinction between the "mainstream" of software developers vs people who are working on new tools and capabilities to be used by that "mainstream".

IMO it is certainly true that the most efficient and cost effective was to do "mainstream" software delivery at the moment is hosted frontier models. But for people thinking about "what's next?", it makes a ton of sense to be exploring different models in anticipation of a possible (but certainly not inevitable) sea change.

reply

upvote

by swiftcoder1 days ago|

[-]

I don't necessarily think your answer is wrong for all people, but if you work in software... how do you plan to differentiate yourself from everyone else out there, if the depth of your understanding is "Claude can do it for me"?

reply

upvote

by dofm1 days ago|

[-]

This ultimately is the discussion I am here for.

I mean one of the things I use a local LLM for, because I can, is to generate starter documentation. But I ask it to — I want it to give me overviews, plans, all that. It can make something bespoke for me.

I guess I could also ask it to do the work. But where do you draw the line?

The universal labour-saving device is the great provocation of the next 100 years I think, and both Star Trek and Wall-E have grappled with it.

reply

upvote

by coldtea1 days ago|

[-]

>no need to learn, just ask it to do it for you

And that's how skills die.

reply

upvote

by ddalex12 hours ago|

[-]

And why is this skill important, if a machine can do it ? What's the last time you ploughed your field with oxen ?

reply

upvote

by charcircuit1 days ago|

[-]

Except with AI models it's possible to make a backup of them creating a permanent artifact of a skill.

reply

upvote

by CamperBob21 days ago|

[-]

When's the last time you shoed a horse?

The reason I delegate so much of local LLM installation and administration to Claude Code is simply because there's no point learning practical things that will work completely differently in a couple of years, or in memorizing procedures that I'll forget long before I need to perform them again.

No longer having to sweat all the details is a Good Thing, not a Bad Thing.

reply

upvote

by dofm1 days ago|

[-]

I am not sure I disagree, and I certainly don't mean to disagree very fervently.

But I think if you want to really learn to ride well, understand horses well, there might be some benefit in learning how to shoe a horse. At some level it should never only be someone else's job.

reply

upvote

by verdverm1 days ago|

[-]

At the same time, most people can drive without understanding how a car works.

reply

upvote

by coldtea22 hours ago|

[-]

Yes, and they're all the worse, more at the mercy of car companies and mechanics, and less aware of the world they live and operate in, for it...

reply

upvote

by saganus23 hours ago|

[-]

You actually do need some understanding of how a car works, no?

For example, you need to know it uses gasoline (or diesel), it requires oil changes every certain amount of time, break pad replacement, etc.

You also probably need to know that you can't operate cars over a certain amount of water, that you need a driver's license, stopping at red lights, etc.

Sure, you might not need to be a mechanic, but that's far from not understanding how a car works, which to me sounds similar to knowing how to shoe a horse, which is different than being a horse vet.

reply

upvote

by WickyNilliams1 days ago|

[-]

If I worked with horses for 8 hours a day I imagine the answer would be "recently"

reply

upvote

by psychoslave1 days ago|

[-]

Having to shoe a horse never was a general skill.

Maybe a more apt analogy would be a skill like making fire without a lighter.

reply

upvote

by sanderjd23 hours ago|

[-]

Writing software never was never a general skill either though? Or am I misunderstanding your point?

reply

upvote

by psychoslave21 hours ago|

[-]

Yes, LLM are thrown through pretty much everyone digital life whether they like it or not, it's not just devs. It might even unlock exploring things that need code that average user wouldn't have dared to do before.

reply

upvote

by coldtea22 hours ago|

[-]

>When's the last time you shoed a horse?

That skill died too, so what's your point?

reply

upvote

by CamperBob221 hours ago|

[-]

Skills sometimes do that. What's your point?

reply

upvote

by coldtea19 hours ago|

[-]

Skills are good. They shouldn't do that.

reply

upvote

by sorokod1 days ago|

[-]

Then what is the point of ddalex?

reply

upvote

by dofm1 days ago|

[-]

I think if you really don't feel the need to know the "why" of everything, sometimes this might be the right approach. It is quick, pragmatic, gets you started.

Maybe my biggest problem with the world of agentic AI, and the reason I am putting myself through learning it the way I am, is that the need to know the "why" of everything is so fundamental to me, that I don't know if there is any point to me without it.

So this is really the only way I know how to proceed.

reply

upvote

by sanderjd22 hours ago|

[-]

To me, this is just a question of specialization. Not everyone needs to be a "I understand how the system actually works" person. In fact, not many people need to be that person. But every system does need some of that person to exist!

And we happen to be discussing this on a forum where the type of people who will be the specialists for the kinda of systems we're discussing are likely to gather.

I'd be surprised if in my casual discussions out in the real world, I were to run into a lot of people who care exactly how all this works, to the extent that they want to invest significant money into hardware that allows them to run things themselves and dig into what's actually going on. But I'm not at all surprised to come across such people here! (Indeed, it would be very disappointed if I didn't!)

reply

upvote

by nazgul178 hours ago|

[-]

I think the more you know of how (many) things work, the slightly better you'll be at using them. From dishwashers to CPUs, from car engines to watercolours, from guitars to kitchen knives... You get the gist. Once you internalize a model of the thing, it becomes closer to an extension of you than a tool. You drive it better and with less friction.

reply

upvote

by sanderjd5 hours ago|

[-]

Yes agreed, but there is limited time in a life, so there is a fairly high opportunity cost to internalizing a model of many things, which scales quickly with the complexity of those things, so people rationally limit the number of things they invest their time in. For the vast majority of people, I think it makes a lot of sense for AI systems to fail to make this cut. But for most of us here, on a site for computer technologists, it almost certainly makes sense for us to learn as many of the details as we can manage.

reply

upvote

by kdkdjduxnd1 days ago|

[-]

[dead]

reply

upvote

by rusk1 days ago|

[-]

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled.

I found LM studio to be a nice starting point. Frindlier and more featureful than Ollama and not as intimidating as llama.cpp (though you will want to use that eventually)

reply

upvote

by dofm1 days ago|

[-]

LM Studio is also nice because of the way the interface explains things; parameters have explanations and hints. It has been designed by people who really care about making it understandable.

I tried Ollama but I've settled on Unsloth Studio generally; once things really settle down I'll just run the llama-server UI, which is pretty nice.

A friend is tinkering with LLMs for amusement on a 16GB Raspberry Pi 5, and when I explained that llama.cpp now had a typical web chat interface he was so happy — it's amazing what the "table stakes" are now.

reply

upvote

by m3kw97 hours ago|

[-]

What’s the use of a 4gb Gemma other than to just play with it ?

reply

upvote

by oceanplexian1 days ago|

[-]

Honestly your best bet is to buy a $20 Claude subscription, ask Claude to set it all up with Pi and llama.cpp and come back in 20 minutes after a cup of coffee. This is also a good idea because it will help set expectations of what a local model can do vs. a frontier model.

reply

upvote

by mullen23 hours ago|

[-]

This is what I did after struggling to get llama.cpp working at a decent speed on my M1 Macbook. The secret is to very specific with your needs and targeted in what you are using llama.cpp for. Mine setup is just about strictly for qwen3-coder and now, I get a fairly decent speed out of it. I also installed Cursor to check Claude and it all worked out well.

reply

upvote

by kristianp16 hours ago|

[-]

Are you talking about Qwen3 Coder 30b a3b Instruct from August 2025, which is a non-reasoning model? Or the more recent "Qwen3 Coder Next" from Feb this year with 80b params, 3b active? I found Qwen3 coder next to be quite good on openrouter [1], but couldn't run it locally.

[1] https://openrouter.ai/qwen/qwen3-coder-next

reply

upvote

by trey-jones17 hours ago|

[-]

I don't know why we're even talking about Qwen3.6 for writing code when qwen3-coder exists. My experience is there's no contest. I'm using 30b with 96k context on a dedicated server.

reply

upvote

by fouc16 hours ago|

[-]

For agentic workflows like tool use, editing codebases, multi-turn debugging?

reply

upvote

by cyanydeez1 days ago|

[-]

I've setup to local paradigms for local coding:

- opencode with it's webui

- deer-flow with it's research/powered front end

They both run websites so you don't have to baby sit them (eg, keep your mac open). I've build a pdf compressor over a few days by first having deer flow try and research the frameworks and pipeline. It stalls out because its not really a fluid programmer. Once it stalls out, I transferred it (manually for now) to opencode and it's refactoring it because it's just a collective bundle of sticks and it needs a lot of testing to tweak out the limited scop context. LLMs can't really hold large scopes (locally anyway, from what I've read from HN, it's possible with longer context).

It'll complete in a few days with maybe 3-4 hours of full attention interaction, but it's running 3x that without my attention. Obviously, if I paid more attention it'd run quicker, but since it's local, it's not pumping out large volumes of code, it's mostly looping over tests and capabilities as observed.

It's running Qwen3.6 35B MoE on a AMD 128GB strix halo. If I switched to the dense models, perhaps it'd be smarter, but the trade off seems to be much slower gen.

reply

upvote

by dofm1 days ago|

[-]

> - opencode with it's webui

Have you tried Paseo?

I have opencode in a VM, and the paseo daemon running in the VM, and then the Paseo Mac app. Really nice.

(You can also use the Opencode GUI to frame a remote opencode web interface)

reply

upvote

by c-hendricks1 days ago|

[-]

You can also just add OpenCode web as a PWA, if that's what you mean by "frame".

I'm gonna check out paseo, but am not looking forward to all the ram the agent needs + all the ram paseo needs

reply

upvote

by c-hendricks20 hours ago|

[-]

Have checked out Paseo, not sure what it offers over opencode web though. Definitely seems great if you're using other harnesses, but it seems like all it has over opencode web is split views and native apps. Neither of those really matter to me, plus you lose some opencode goodies. The preview urls are a neat idea, but our dev servers at work are mostly port independent and required to be on a certain subdomain for auth.

reply

upvote

by bsder21 hours ago|

[-]

> I totally struggled to find the right frame of mind to explore any of this stuff without feeling defeated and bamboozled. Because it's just huge, exhausting, jargon-drenched, unknowable, and I am over the hill at fifty-plus.

Hello, my brother, just know that you have a fellow passenger in life at the same age who thinks the same thing. I agree that the local stuff is helping my understanding a LOT.

However, my gut feel as someone who got to experience the TeleBomb after the DotBomb is that the obfuscation is INTENTIONAL--it's neither you nor your age. I remember asking people to explain to me what the OC-768 startup endgame was when roughly 10 OC-768 links could carry the world's traffic at the time--and everybody giving me blank looks. The AI Bubble has the EXACT same feel as the Telecom Bubble--just bigger.

What I really wish is that I could find a VPS-type provider where I could toss things into their NVIDIA/AMD machines for an hour or two. Alas, all of the providers seem to want massive paperwork and huge minimum purchases.

I can't wait for the bubble to pop so that we mere mortals can finally build with this stuff.

reply

upvote

by hughw5 hours ago|

[-]

You looked into vendors like https://www.runpod.io/?

reply

upvote

by porphyra1 days ago|

[-]

You can also run Qwen 3.6 27B dense model on DGX Spark with comparable performance [1][2] for about $4000 (Asus Ascent GX10 is $3999 at various retailers).

In theory you can also get 48GB of VRAM with, say, two 3090s, but it will take up a lot of space and generate a lot of heat compared to the Macbook Pro and GB10.

[1] https://x.com/MiaAI_lab/status/2070859135399182444

[2] https://github.com/MiaAI-Lab/Qwen3.6-27B-NVFP4-vLLM

reply

upvote

by Zetaphor16 hours ago|

[-]

Alternatively you could run it on Strix Halo for $1,000 less, and while it may be slightly slower you won't have to deal with NVIDIA's shit on Linux and worrying about having to use their custom kernels or Ubuntu.

reply

upvote

by esperent1 days ago|

[-]

> 48GB of VRAM with, say, two 3090s

So like... $2000+ just for the used GPUs? Plus I assume it's considerably more effort to get it working.

reply

upvote

by fluoridation1 days ago|

[-]

>Plus I assume it's considerably more effort to get it working.

Nah, not really. It is a little annoying in terms of space and power, though. Not every case and motherboard can support cards that big.

reply

upvote

by lee_ars22 hours ago|

[-]

The tweet you link shows "Qwen 3.6 35b NVFP4 - 256k ctx, 110 tok/s", but I'm getting only half that, around 50 tok/sec, on a DGX Spark with Qwen3.6-35B-A3B-NVFP4 (via vLLM) plus speculative decode w/EAGLE3. I'd be ecstatic to see 110 tok/sec and I wish they had some more sourcing for the exact config, because it's double what I'm getting.

edit - after actually reading the tweets (had to use xcancel) and visiting the source git repo, switching to MTP for speculative decode makes things a hell of a lot faster, and the abliterated model plus dflash makes it even faster! I'm now seeing 70-90 tok/sec for most stuff. I like!

reply

upvote

by porphyra19 hours ago|

[-]

I think Atlas might also be slightly faster than vLLM:

https://flowtivity.ai/blog/120-tok-s-1m-context-private-ai-d...

reply

upvote

by Catloafdev1 days ago|

[-]

The model they reference can be easily run with 24gb+ of VRAM, and there are other similar models capable of running easily on 16gb of VRAM. It's not like 128gb is a requirement here.

reply

upvote

by bitexploder1 days ago|

[-]

For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4, you could probably optimize it further. RAM is not a limitation but overall memory bandwidth. Q8 is slower. 35B A3B Qwen is quite speedy, but a little less accurate. With Qwen 3.6 27B dense I can squeeze a 9B parameter model and use that for fast analysis or code scanning while 27B is churning on a task in the background. It is tight, but totally reasonable.

The real sweet spot for Qwen 27B is getting it on something like a Dual 3090 system or some other config where it can blaze at 50-80 t/s and that costs well under 6K currently. It is a surprisingly capable model. Using something like GLM for orchestration, specs, task farming and then letting Qwen churn is relatively inexpensive.

Overall I recommend people try models of this class out using OpenCode and some for pay service to experiment with them and understand how they work. I find they are very useful.

Long term, I am convinced enough that if I wanted to use local models for any number of reasons I would be okay investing in a dual GPU box. The Mac is not fast enough for me and M5 Max is just too expensive relative to GPU linux box. Still, it is nice to have the models local ON the laptop and it is useful for what I care about locally.

reply

upvote

by aunty_helen23 hours ago|

[-]

I was doing some benchmarking last night on 2 3090s. The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE.

The limited context is problematic. I’m not exactly sure what it’s got available but hermes was hit and miss on a prospecting job.

It does seem to be doing useful work but it’s not API call level quality

reply

upvote

by coder54320 hours ago|

[-]

> The systems but old but I’m seeing 11tks 27b, 15tks 35b MoE

If that's accurate, then you must be doing something wrong/weird. On a single RTX 3090, I'm seeing substantially higher performance. Dual GPU won't necessarily give a ton of performance improvement, but it shouldn't hurt performance.

With llama-bench, I just measured Qwen3.6-27B at 41 tok/s and Qwen3.6-35B-A3B at 153 tok/s on one RTX 3090. (Those results are without MTP. With MTP, I'm seeing about 65 to 70 tok/s for Qwen3.7-27B.)

I'm using the unsloth UD-Q4_K_XL quant. If you're using bf16 for some reason, that could explain the low performance and inability to have enough context despite having 48GB of VRAM, I guess, but... don't do that.

reply

upvote

by aunty_helen3 hours ago|

[-]

Good to know. Might be worth updating the motherboard then, it’s limited in pcie speed.

reply

upvote

by coder54320 hours ago|

[-]

> For a MBP I have 48 GB of RAM M5 Pro. It runs at about 12-14 t/s at Q4

Are you running with MTP enabled? I have seen some people on M5 hardware report 20+ t/s on Qwen3.6-27B using MTP... and I think that was a regular M5, not even M5 Pro.

reply

upvote

by bitexploder19 hours ago|

[-]

Nope. MLX in LMStudio. The simplest config with zero tuning effort.

reply

upvote

by coder54319 hours ago|

[-]

Unsloth Studio is also very low effort, and a lot better than LM Studio in my opinion. (Performance, compatibility with Gemma 4, actually open source, etc.)

reply

upvote

by CMay23 hours ago|

[-]

At 24GB, Gemma 4 31B QAT will be better and give more concise answers. This post is mostly about unquantized results, so it's less relevant and I can't say much about as I haven't tested Qwen or Gemma via cloud API or unquantized locally. All I can say is locally, quantized in a 24GB scenario, Gemma 4 31B is better in my tests which are mostly reasoning or C programming related.

Gemma 4 is the only model series at this parameter scale I've seen correctly answer some of these. One of the answers even made me re-evaluate what I thought the correct answer was, which I did not expect.

When I look at the Artificial Analysis numbers, I can see that some things about Qwen 3.6 look inflated as a result of either metrics that weren't measured yet for Gemma 4 31B, or for metrics that just aren't going to be relevant in a lot of the essential tasks. In a lot of the relevant metrics, Gemma 4 is either better or on par.

Then once it's all quantized all those benchmark results will be hurt, and Gemma 4 QAT has better quantized performance. I think it's more competitive unquantized than people give it credit for and way better quantized than people give it credit for.

Qwen 3.6 clearly isn't legitimately bad and maybe it's quite nice at fp16, but it was a disaster quantized in a 24GB scenario by comparison.

reply

upvote

by thewebguyd1 days ago|

[-]

I'd go for at least 32GB+. It'll fit in 24GB but leaves you little to no room for context, and that's at 4-bit quantization.

If you want to run unquantized, you definitely need 128GB.

reply

upvote

by Catloafdev1 days ago|

[-]

Nobody runs unquantized, there's literally no reason to. Q8 would be the largest anyone actually runs on consumer hardware for inference.

reply

upvote

by 22 hours ago|

[-]

deleted

reply

upvote

by bityard22 hours ago|

[-]

Halving the precision of the weights is not a free lunch...

reply

upvote

by Catloafdev20 hours ago|

[-]

Q8 is virtually lossless. The quantization is much more noticeable around Q4 and below. FP16->Q8 on consumer hardware is 2x the speed at ~99.99% the quality.

reply

upvote

by rvba11 hours ago|

[-]

Any source that confirms the 99.99% quality?

reply

upvote

by bitexploder1 days ago|

[-]

It also comes down to inference speed, not "can I run this". 8-bit quant is quite a bit slower on an M5 Pro.

reply

upvote

by gchamonlive1 days ago|

[-]

[dead]

reply

upvote

by Numerlor1 days ago|

[-]

And if you go for actual GPUs it'll run much faster, I'd say 24gb may be pushing it for context, but my 5090 with 32GB VRAM is usually somewhere between 60 to 100 tok/s with mtp and 2-3k tok/s for prompt processing. I'm not sure what they cost now but it's definitely still quite far from the macbook, and there's also some other 32GB GPUs that are considerably more affordable

reply

upvote

by nok22kon1 days ago|

[-]

a computer with 24 GB VRAM is at least $3000

reply

upvote

by daemonologist1 days ago|

[-]

A 7900 XTX is about $850, and the rest of the computer basically just needs to boot Linux. You could easily build such a machine for $1500.

Even that isn't strictly necessary - you can get perfectly acceptable performance by splitting a model between multiple older 12 or 16 GB cards.

reply

upvote

by sleepyeldrazi1 days ago|

[-]

I can't speak for the US, but in Germany (where hardware is usually more expensive, not less), I got my 3090 3 months ago for 750 euro and have been running the iq4_nl 27B using q4 kv (which after recent patches in llama.cpp is in my xp indistinguishably accurate from q8 of f16) at full ctx, with MTP at 2, peaking around 70 t/s on small ctx, around 50 t/s when im around 64k and ends around 40 t/s near the cap. The rest of the PC is a 50 euro ddr3 16gb i5 4th gen box, absolutely nothing special. And this setup is often more useful than dsv4pro (and sometimes kimi, but not glm) for research and ML work.

reply

upvote

by danilocesar23 hours ago|

[-]

I can't find a 3090 for less than 2k CADs (or 1200 eur). Is this the average price in Germany? It's pretty cheap.

reply

upvote

by sleepyeldrazi10 hours ago|

[-]

I got it off kleinanzeigen, its a ebay-like site (but mostly 'pick it up yourself' instead of delivery). Looking at it right now, i do see multiple sales for 850-900. I did spot the 750 one after frequenting the site for a week or two, so it may be a bit of a 'better than average' deal, and it seems most are in the 1k euro range, but there are a handful available under.

As of writing this, it shows 24 offers between 700 and 950.

reply

upvote

by akman23 hours ago|

[-]

I'm also curious, as this could pay for a trip out there, especially if buying for friends.

reply

upvote

by throw12345678911 days ago|

[-]

But the tokens or credits are gone. MacBook stays. You can run other models on the same MacBook. What I read people burn every month on saas… for that money you break even on that MacBook in 5 months.

Edit: it’s not just “data privacy”, when you are using Claude, you are shipping EVERYTHING to Anthropic. It’s crazy.

reply

upvote

by wilsonnb31 days ago|

[-]

Companies are already shipping everything to Microsoft or Google and 17 other companies, just the cost of doing business.

reply

upvote

by throw12345678911 days ago|

[-]

Sure, but no one gets everything. Just that one.

reply

upvote

by DANmode1 days ago|

[-]

That’s at today-prices.

If the cost doubles, or 4x, which is seems to need to for them to go profitable, what then?

reply

upvote

by wahnfrieden1 days ago|

[-]

It's much slower, and often quantized

reply

upvote

by throw12345678919 hours ago|

[-]

Okay, and?

reply

upvote

by acchow1 days ago|

[-]

That $6700 is a $5000 upgrade over a base model Macbook Pro.

$5000 in US Treasuries (currently at 4.89%) yields $244.5/yr. That's more than enough to cover the annual Claude Pro subscription ($200/yr) which includes Claude Code with lots of Sonnet usage (far better than Qwen 3.6)

reply

upvote

by neonstatic22 hours ago|

[-]

I think the argument isn't that local is cheaper - it's that local is doable and delivers unparalleled privacy.

reply

upvote

by iosjunkie17 hours ago|

[-]

And your government can’t take it away on a Friday afternoon.

reply

upvote

by stymaar1 days ago|

[-]

> The article is based on running Qwen 3.6 on a 128GB MacBook Pro. For reference, a 128GB MBP currently starts at $6699 USD [0]

Qwen3.6-27B would be faster on a 3090 that costs around $1000-1200 though so I don't think it's a good counter-argument.

Op just happened to have that MacBook, but it doesn't mean it's necessary to run the model.

reply

upvote

by boutell1 days ago|

[-]

That 3090 is going to burn 750W and it will still cap you at a 4 bit quant and ~48K context. Here's someone who worked through it:

https://github.com/noonghunna/qwen36-27b-single-3090

Flies though (50-70tps is impressive for a model this smart)

I went through roughly the same process to get it working on my M2 Macbook Pro... at awful speeds of course, since models like this one are mostly bound by memory bandwidth.

reply

upvote

by stymaar1 days ago|

[-]

> That 3090 is going to burn 750W

The 3090's TPD is 350W, but given that LLM's token generation isn't compute bound, people usually undervolt these cards to reduce power consumption. IIRC you can get as low as 200-250W without any degradation. Caveat these figures are without speculative decoding and at batch size =1.

reply

upvote

by 4chandaily1 days ago|

[-]

This is correct. I have (4) 3090s in my inference server, and they are each capped at 250w. I run Qwen 3.5 122B-A10 at about 45-50tok/s on this and am quite happy with it. At idle it draws around 95-105w for all four, which is a bit high, but tolerable.

reply

upvote

by hughw23 hours ago|

[-]

My eyes glaze over reading all the AI produced verbiage.

I did find a few useful parameter settings I've already discovered using my single 3090 and ollama.

I'm just remarking that the LLMs overwhelm me with minutiae, especially as I'm working on code design. I frequently ask it to restate concisely, and that helps.

[edited to mention ollama as a nice alt]

reply

upvote

by nozzlegear1 days ago|

[-]

Just putting it out there: I run Qwen 3.6 on my M1 Mac Studio with 64gb. It's quantized and all that, but I agree with TFA: it's the sweet spot for local development right now.

reply

upvote

by dmayle1 days ago|

[-]

For that price you can put together a PC with 128GB of ram ($2000) and an RTX 5090 ($3600) and get 70-100 tokens per second instead of 45

reply

upvote

by montebicyclelo1 days ago|

[-]

Isn't the directionality important. I.e. it is currently possible to run useful / great models locally, but on high end machines; and in a few years we will likely be able to run even better models on standard machines.

reply

upvote

by razster3 hours ago|

[-]

I'm running it on my 4070 12gb with 96gb mem, I'm very happy with the results even if I have to wait a couple minutes for results. To me this is far better than I expected and will continue to use it and improve with skills.md. Pi.dev is amazing by the way.

reply

upvote

by organsnyder1 days ago|

[-]

I run Qwen 3.6 on my Framework Desktop 128GB, and it's very performant. I know Framework has had to raise the price since I preordered mine, but they're still well under half the cost of that Macbook.

reply

upvote

by SomeHacker447 hours ago|

[-]

Can you please explain how you set it up? I run it on my 129G Strix Halo under Arch with Lemonade with OpenCode and it just sits there doing barely anything unless I leave it to run over night. Then it says it thought for 13.7 seconds but was really 15 minutes. Thanks! I am using the 27B dense MTP model quantized by UnSloth with the UD-Q8_K_L if memory serves.

reply

upvote

by andy991 days ago|

[-]

I get ~55 Tok/s on my framework desktop with the 35B A3B q8 model, and so far am also very happy with the coding performance.

reply

upvote

by cyanydeez1 days ago|

[-]

did you upgrade to MTP?

reply

upvote

by imrehg12 hours ago|

[-]

On the MoE versions of these models the MTP versions have only marginal benefit. In my trials the speed-up is <20% (not the ~2x that happens with some other setup/models) and usually more like 10%. Ie. something like 13 -> 15 token/s... on my device.

I still use the MTP version as it _feels_ slightly better quality, and because the unsloth quantizations I can get have more variety to fit into the various systems at hand... but that's not for the MTP aspect, unfortunately.

In the article they did have ~2x performance on the 27B (which might be something to retry, though on my Framework that would bring it from 5 -> 10 token/s so still "excrutiating" speed, probably).

YMMV for sure.

reply

upvote

by andy994 hours ago|

[-]

That was with the MTP version

reply

upvote

by bityard22 hours ago|

[-]

There are several variants of Qwen 3.6, the MoE models are performant on Strix Halo, but the 27B dense model (the one spoken about in TFA, and generally regarded as the best of the group in terms of quality) is not so performant: https://kyuz0.github.io/amd-strix-halo-toolboxes/

reply

upvote

by elorant1 days ago|

[-]

You can get an AMD Strix Halo with half that price even after hardware price adjustments. Besides you don't need 128GB of RAM to run a 27B model.

reply

upvote

by dannyw1 days ago|

[-]

I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent. You definitely don’t 128GB. That’s the scale for 70B models at q8 or something.

reply

upvote

by dom961 days ago|

[-]

I've been running it on my 48GB MBP too and it's not particularly great. Super slow and not near enough to the quality provided by even Claude Sonnet.

reply

upvote

by doodlesdev1 days ago|

[-]

How much does one of those cost in the US? Here in Brazil, your notebook is worth as much as a used Honda Fit, which seems absolutely insane. For comparison, the ThinkPad I'm currently running cost me 1/20 of how much this MBP costs here, leaving me with over $8.000 to spend with LLM inference (if I actually spent money with that).

reply

upvote

by dannyw1 days ago|

[-]

I purchased mine for approximately $4400 AUD before the price hikes. That unit is now ~$5100 AUD.

I use my MBP essentially as my workstation, it's almost always plugged in. I have a MBA (M4, 24GB RAM) that I picked up for ~A$1500 or so, and that's an amazing daily driver. I don't do local LLM inference on that unit, I can just hit my own APIs (via LM Studio) on the MBP over Tailscale.

reply

upvote

by DrammBA15 hours ago|

[-]

> I’m running the same model on a 48GB MBP with a q4 quant and it’s pretty decent.

Context size?

reply

upvote

by shockembopper18 hours ago|

[-]

I’ve got qwen3.6 27b running on my media server atm. Given that I built on top of what I already had, it didn’t cost me nearly that amount. I’ve been running 2x 5060 ti 16gbs, and when using text only and nvfp4, I can run the model with 200k context length and roughly 50-60 toks. It’s very good, and costed me about $800 after buying the gpus from microcenter.

reply

upvote

by georgeven1 days ago|

[-]

I have a 1500 dollar machine that can run it at 50 tok/s (3 V100s)

reply

upvote

by Dig1t1 days ago|

[-]

How did you buy 3 V100's for $1500??

reply

upvote

by sixdimensional14 hours ago|

[-]

Not OP and just guessing, but probably SXM2 GPU modules for the V100. Those can be acquired fairly inexpensively, but there is work to do to get them working together and the V100 has some limitations on the types of models you can run.

reply

upvote

by jeffybefffy51920 hours ago|

[-]

I still dont trust the Anthopic and OpenAI are not training on my code. I even just thinking keeping track of what code you have received in prompts and to train/not train on it seems like an impossibly difficult task.

reply

upvote

by andrekandre19 hours ago|

[-]

am i right in assuming your code is closed-source?

i'd expect anything on github for example to be already in their training set or is training on actual usage more useful to them?

reply

upvote

by redox991 days ago|

[-]

I bought 2 used 3090s some years ago for $500 each. They're probably a bit more expensive now, but I guess for something like $2000 you can build a barebones 2x3090 PC which will be way faster than a Macbook. (you're fine with very basic hardware outside the GPUs)

reply

upvote

by stared21 hours ago|

[-]

All experiments with Qwen 3.6 required no more than 48GB Apple Silicon. I believe you can go even further with more aggressive quantizations - one can go down even further.

In any cases, from the economic point of view, running models on laptops make little sense. Even at the pure cost of energy consumption, it might be hard to beat pricing at tokens generated at scale.

At the same time, it is a breaktrough, that will change the game. Previously such vibe coding on consumer device was not hard or costly - it was impossible.

reply

upvote

by pimeys10 hours ago|

[-]

Yes. It is very expensive now. I'm still so so happy I decided last summer to bite the bullet and pre-ordered the Framework Desktop 128GB model.

I paid 2424 euros in total for this machine. And it can easily run the models discussed in the comments and in the article. It's tiny, and runs CachyOS like a champ. Over 4000 euros less than the price you listed.

We can all send a thank you letter for our friendly billionaires such as Sam Altman for the price situation we're in today: https://www.mooreslawisdead.com/post/sam-altman-s-dirty-dram...

reply

upvote

by trentor1 days ago|

[-]

Runs fine on 2x4080s or on two 5060/5070s with 16GBVRAM... and faster than on the mac.

reply

upvote

by dvduval1 days ago|

[-]

Absolutely for the average developer the token speed is just going to be too slow for it to be workable. I think we’re looking at 2028 when memory becomes cheaper again and they’ll be a lot more people using local models.

reply

upvote

by cyanydeez1 days ago|

[-]

AMD started their 128GB Halo Strix at a pretty damn good point at ~2.5k; I got mine after the first memory bump at $3k.

I think you might be a little to into the stew here.

reply

upvote

by zdragnar1 days ago|

[-]

I got mine at the same price point, and I've been pretty pleased with it. Tailscale lets me use it from my ultrabook / lightweight laptop, no burning lap or crazy fan noises. Desktops with the amd ai+ 395 are still fairly affordable for what they can do.

I haven't tried it with https://lemonade-server.ai/ yet but I just might give it a shot.

reply

upvote

by organsnyder1 days ago|

[-]

I'm running Lemonade on Nixos on my Framework Desktop. I had been trying other tools out before finding Lemonade, but Lemonade really made it plug-and-play.

reply

upvote

by Insanity1 days ago|

[-]

But you have to factor in that this device will last you 5-10 years. That said, I wouldn't spend almost $7k USD on this macbook lol.

reply

upvote

by petilon1 days ago|

[-]

Memory requirements of newer models will increase, so while the hardware may last 10 years it won't be able to run the latest models for 10 years.

reply

upvote

by roadside_picnic1 days ago|

[-]

My experience working in the open model space pretty deeply (both LLMs and diffusion models) for years now is that it is not quite as simple as that.

In the open model space an insane amount of effort goes into getting more powerful models to run with the same or less RAM. For example in the diffusion world many things that could not be run on easily under 24GB of VRAM actually run much better today with much less VRAM than they did a few years ago. You can do many things today with 8-16GB of VRAM that would not have been possible. At the same time the most advanced open models, like LTX 2.3 for video gen, still seem to respect 24GB of VRAM as the upper bound.

Similarly the standard "big" but localish open model for LLMs back in the day was Llama 3 70B, this was both a much worse and much larger model than Qwen 3.6 27B

So in two different spaces I've witnessed the "RAM required to run the best" decreasing or at least remaining stable, while the performance being achieved in both areas is astounding (LTX 2.3 is faster, better and more capable than the Wan 2.2 model that held popularity before it).

The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

reply

upvote

by prima-facie1 days ago|

[-]

> The biggest thing to watch out for is not just RAM/VRAM but memory bandwidth. You can try to "future proof" yourself with lots of RAM, but if it's 400 GB/S you're still constrained to smaller models.

I'm thinking of getting a SoC machine with 128GB RAM but the bandwidth is limited to 256 GBps. Would you even consider such a machine a decent investment, or should I wait for the newer gen of chips? Thanks!

reply

upvote

by roadside_picnic22 hours ago|

[-]

It depends on your use case. There's a lot of hype around machines like the DGX spark (I'm assuming this is the type of device you're referring to) because they look awesome, and are priced reasonably well. However all of these have notoriously low memory bandwidth despite the high ram.

These devices, especially the DGX line, are fantastic if you are interested in low-level CUDA programming. The DGX spark can be used to prototype CUDA code/libraries for GPUs that most of us couldn't think about affording. If you want to learn how to program for datacenter level GPUs then these are the best way to get that at home. Sure your code will run very slow compared to the real thing, but you can take that code and, theoretically, run it on the real thing. For anything else though, I feel there are better options.

If you're interested in pure inference I'm pretty partial to Apple devices. The M4 Max gets you 546 GB/s, the M5 MAX 614 GB/s, and the M3 ultra (you'd have to buy used at this point) 819 GB/s. Plus you have a very useful computer even if you realize you don't want a full time home inference server. Additionally these devices require very low power (if you're running high end consumer GPUs you do have to think about what your energy costs are per hour and how warm you like your room).

If you're interested inference and training, or already have a pretty beefy desktop PC, or simply demand the most token/s you can get, then GPUs are the way to go. The downside is they're still pretty memory restricted (but honestly the options for what you can run on any RTX N090 are pretty good). You'll get blazing inference and prefill speeds on these devices. The only down side is, if you are using them heavily, you will see it on your energy bill and feel it in your room.

The "should I wait" question is also potentially applicable. The world of consumer hardware is looking increasingly bleak (and expensive) but if Apple does release a new "Ultra" model we could be looking at inference speeds very close to GPUs (there's still limitations to these devices that makes training preferable on GPU)

reply

upvote

by prima-facie19 hours ago|

[-]

Thanks for the detailed response, I really appreciate it.

What I had in mind was an AMD Strix Halo machine, but it seems to have none of the advantages you mentioned. It's neither high bandwidth, nor does it have CUDA support, nor does it have support from the big OEMs. All the boards are from relatively obscure Chinese vendors.

It seems like all the major OEMs have rallied behind Nvidia, if you look at the upcoming RTX Spark laptops.

reply

upvote

by petilon1 days ago|

[-]

> insane amount of effort goes into getting more powerful models to run with the same or less RAM

The same can be said about operating system memory requirements. I am sure Linux and Windows kernel developers can confirm. Yet 30 years ago Solaris used to run comfortably in 16 MB of RAM, today you need 512 times that to run Linux.

reply

upvote

by regularfry5 hours ago|

[-]

Nah. There are already models at every size on the scale. If you want to run an open 1T model today, you can.

What's going to happen is that the capability at any given size point is going to get better over time as new training regimes cram more into the available space. A 27b model released next year will be better than a 27b model this year (else why release it?). Hardware will get more useful, not less.

reply

upvote

by Insanity1 days ago|

[-]

You raise a fair point, but I'm not convinced it'll offer a meaningful difference in performance as long as we're stuck with the current AI paradigm.

reply

upvote

by bluGill1 days ago|

[-]

Will they? Or will we find ways to optimize models and need less? Only time will tell.

reply

upvote

by simonw1 days ago|

[-]

It can't run the latest models today - GLM-5.2 class models already need 1TB+ of RAM.

... but, the models that WILL run on 128GB (or 64GB or even 32GB) models today are a huge improvement on the best models that would run in the same amount of memory six months ago.

reply

upvote

by johndough23 hours ago|

[-]

    > GLM-5.2 class models already need 1TB+ of RAM.

If you quantize GLM-5.2 to 4 bit, you can do it in less than 500GB: https://huggingface.co/unsloth/GLM-5.2-GGUF (table on the right)

If you find three finds that also have a 128GB MacBook, you can chain them together (the MacBooks, not your friends) and make it work.

You could also run GLM-5.2 on a single MacBook if you stream the active parameters from disk, but even with speculative decoding, you'd probably only get in the order of 1 token per second, so this is not really practical for most applications.

reply

upvote

by godwinsonsucks23 hours ago|

[-]

[dead]

reply

upvote

by naikrovek20 hours ago|

[-]

Available models aren’t really trending upward in size. Not like I thought they would, anyway.

They’re trending to be the right size to be good.

Qwen3.6-35B is not as good as Qwen3.6-27B. The larger model is faster, but a lot dumber; it gets caught in loops, makes crazy mistakes, and is just not as good. It’s bigger, but it is nowhere near as good as the 27B variant.

reply

upvote

by zargon17 hours ago|

[-]

Qwen3.6-35B-A3B is worse than 27B because it's an MoE and 27B is dense. 35B only passes each token through 3B of its total parameters, whereas 27B sends each token through all 27B parameters.

reply

upvote

by cyanydeez1 days ago|

[-]

I think you have too much faith in context AGI.

at 128GB, you can find almost it's entire context for Qwen3.6 35B MoE.

Again, I think you have too much faith in extrapolation. It's like you got a baby at 0 months, then measured it at 12 months and expect it to be a giant.

reply

upvote

by someperson1 days ago|

[-]

In 5-10 years, incremental cloud tokens will be far cheaper (likely but not guaranteed).

reply

upvote

by jubilanti1 days ago|

[-]

[flagged]

reply

upvote

by colinsane1 days ago|

[-]

i like that people are taking the privacy argument seriously, after however many decades. i think there are other arguments to be made for running these locally which are less settled, but IMO the Fable debacle drives it home: the surest way to embrace this technology without worry that it will be taken away from you down the road is to physically own the compute.

reply

upvote

by r_lee1 days ago|

[-]

if you need to ensure that, then just back up the model and buy hardware if the need arises

reply

upvote

by colinsane23 hours ago|

[-]

that's somewhere between saying "use Android, just switch to Graphene if/when they lock it down", and saying "just switch to postmarketOS/Ubuntu Touch/whatever flavor of Linux takes off".

i've watched friends try that route; i've been through this before. taking a downgrade is never fun: if it's a thing you're likely to care about in the future, then sometimes it's better to place yourself in the right ecosystem early.

reply

upvote

by r_lee22 hours ago|

[-]

I just don't see how with the whole open weight system this situation would happen or that it'd be likely enough to warrant this

in terms of privacy, yes that's a real application, but someone taking it all away? I don't see it happening.

it's not an OS or a device, it's just a box/thing that runs a model, it's really commodity stuff we're talking about

more realistic concern would be that the open labs wouldn't be able to compete in the future thus development ends, but that means you can't host models that don't come out so...

again maybe I misunderstood but I just don't see why this would be worth it just for that one concern

reply

upvote

by ricardobayes23 hours ago|

[-]

Oh definitely. I've seen GLM 5.2 go for around $4 per million output tokens.

reply

upvote

[-]

deleted

reply

upvote

by oldfuture1 days ago|

[-]

a lot of credits? we can’t predict any price change for them

reply

upvote

by ant6n13 hours ago|

[-]

Doesnt it run on the Macbook Neo... just slower?

reply

upvote

by AnimalMuppet1 days ago|

[-]

How many credits would it buy? How long would it take to use them up? What's the payback period?

From what I understand, for a developer, $5000/month is maybe the high end, but $5000/year is fairly standard. (Is that accurate?) So if it pays back in 15 months, that's pretty decent. If it pays back in two months, that's spectacular.

reply

upvote

by dminik1 days ago|

[-]

Using some rough napkin (well, spreadsheet) math, if you ran Qwen 27B for every minute every day at the current price of $0.195/$1.56 with a 2:1 input to output ratio (eg. agentic coding) at the advertised 22 tps it would take you just about 11 years to get to ~$5000 spent.

Disclaimer: There's a 35% sale from Alibaba right now. And I'm not accounting for input tokens going faster than output tokens.

reply

upvote

by eli1 days ago|

[-]

Are you comparing the cost of hosted Opus to running Qwen 3.6 locally? That doesn't really seem fair.

reply

upvote

[-]

deleted

reply

upvote

by h4ny1 days ago|

[-]

[flagged]

reply

upvote

by dang22 hours ago|

[-]

Yikes, you broke the site guidelines badly with this post. Could you please review https://news.ycombinator.com/newsguidelines.html and stick to them?

You're welcome to make your substantive points thoughtfully, just not aggressively.

reply

upvote

by kllrnohj1 days ago|

[-]

> maybe tell us how much a non-Apple system that you can run that (probably similarly or faster) would cost?

Ryzen AI Max 395+ with 128GB of unified memory can be found around $3-4k.

But 27B isn't that large, either, especially if you are ok with the quantized models. So this laptop choice seems to more be a "because they had it" rather than "this is what's necessary for this particular workflow"

reply

upvote

by h4ny1 days ago|

[-]

That's my point. You can run Qwen3.6 27B with MTP and whatever else you want to bolt onto it at 256k context for much less than even a Ryzen AI Max 395+ with 128GB would cost. Even unquantized you don't need 128 GB so given your comment and the downvotes maybe I didn't word my original comment properly for this?

reply