It would have 99% reliable tool calling - and most importantly - the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere.
This way all of the simple stuff would be done on-device, gathering data, figuring out the context of the problem etc. And when that's done, the "smart" model would come in to work on the issue when all of the easy stuff is already done.
It feels super stupid that my /commit skill calls an online model when that is something a local model can 100% do. Mostly this is a harness issue though and mostly solvable.
Qwen 3.6 27B can do that today, but setup properly and in a good quant, I run an autoround [0] with weights in int8 and attention heads in f16 on a single RTX 6000 Pro Blackwell Max-Q via vllm with mtp=2 and full context, --max-num-seqs 3, KV in f16, mamba f32.
>It would have 99% reliable tool calling
I managed to score 93/100 in tool-eval-bench [1]. For me this is very good already, at least in the pi coding harness I've never had an issue that wasn't auto-fixed in the next turn(s).
>the ability to go "this task is beyond my skills" and refer to a Big Boy Online Model in a gigantic datacenter somewhere
This is heavy on the harness engineering side I think, but also quite contrary to the nature of LLMs today. If you figure this out I'd love to know.
[0] https://huggingface.co/Minachist/Qwen3.6-27B-INT8-AutoRound/...
So far the best results I’ve got have been using a much smaller local model as a simple classifier, that makes a call based on the system prompt and incoming prompt where to route it. It works okay, still a long way to go though
They really are fantastic for a lot of use cases and I think most people do not need SOTA. When I run that qwen model in my measly 4070 12 GB for my personal email agent that I build and experiment with, I need privacy more than anything else. It does a great job. Even for coding tasks, given you know how to use them instead of dumping a grand plan, it's great.
SOTA can code but can also prove theorems and teach you about music theory or ancient Greece's substrate language or botany. Speaking in tens of different languages. I wonder how many hundreds of billions of parameters can be saved just by removing much of the general knowledge parts while keeping logical and programming abilities the exact same.
Network Bandwidth, Storage space and speed, memory capacity. While all of these were worth optimizing for at a point in history, that point is behind us. It's probably a reasonable expectation that it will eventually be true for VRAM.