undefined

points

[-]

I believe that local models are a necessary extension of the personal computer and I imagine that one could have had similar criticisms of early personal computers.

by pmontra14 hours ago|

parent|

[-]

Of course the early MSDOS PCs where loud and power hungry. I can't remember the specs but according to Wikipedia the IBM PC with a 80286 had a 192 Watt power supply. I don't remember if by then we had internal hard disks or we still had to buy a case as large as the one of the PC with a 10 or 20 MB disk inside. It was handy to raise the monitor further up.

by theshrike7912 hours ago|

prev|

[-]

My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.

This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.

It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.

by redrove11 hours ago|

parent|

[-]

> My dream would be a local model that can do, say, 80% of the day to day tasks I need; "how does X Handler connect to Y storage?", "commit that feature, but leave out the bits that relate to billing" etc.

Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.

>It would have 99% reliable tool calling

I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).

>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere

This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.

[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...

[1] https://github.com/SeraphimSerapis/tool-eval-bench

by walthamstow10 hours ago|

parent|

prev|

[-]

Claude kind of has this already in their Advisor feature. I don't think I've seen it elsewhere. Open harnesses could add this feature and call out to big boy models when required. It's a really great idea.

by girvo9 hours ago|

parent|

[-]

It’s a lot harder to get right than it sounds. I’ve been trying to as a Pi extension, but models are biased to think they’re better than they actually are.

So far the best results I’ve got have been using a much smaller local model as a simple classifier, that makes a call based on the system prompt and incoming prompt where to route it. It works okay, still a long way to go though

by i_idiot14 hours ago|

prev|

[-]

> Unlike the way they are hyped sometimes, as fantastic tools for coding and agentic local work.

They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.

by throw31082212 hours ago|

parent|

[-]

> I think most people do not need SOTA

SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.

by trey-jones9 hours ago|

parent|

[-]

Exactly. I have sort of a fetish for trying to make things smaller by trimming out things that aren't needed. Unfortunately this skill has been largely useless since forever, because hardware improves to the point that these optimizations are trivial:

Network Bandwidth, Storage space and speed, memory capacity. While all of these were worth optimizing for at a point in history, that point is behind us. It's probably a reasonable expectation that it will eventually be true for VRAM.

by regularfry11 hours ago|

prev|

[-]

I've been getting 40-50t/s out of qwen3.6:27b on a 4090 limited to 350W with the MTP changes that went in. That comes out at 8.75J/t at the upper end. No idea how that compares with anything else out there. I'd expect a 5090 to be a bit cheaper because it'd be faster within the same power limit.

by sanderjd14 hours ago|

prev|

[-]

But that's current hardware. What about future hardware? What about hardware optimized for inference? What about hardware optimized to run a particular model?