https://x.com/ChujieZheng/status/2039909917323383036
Likely to drive engagement, but the poll excluded the large model size.
If you’re decoding multiple streams, it will be 17b per stream (some tokens will use the same expert, so there is some overlap).
When the model is ingesting the prompt (“prefilling”) it’s looking at many tokens at once, so the number of active parameters will be larger.
Those 17B might be split among multiple experts that are activated simultaneously
Experts are just chunks of each layers MLP that are only partially activated by each token, there are thousands of “experts” in such a model (for Qwen3-30BA3, it was 48 layers x 128 “experts” per layer with only 8 active at each token)
Full (non-quantized, non-distilled) DeepSeek runs at 1-2 tok/sec. A model half the size would probably be a little faster. This is also only with the basic NUMA functionality that was in llama.cpp a few months ago, I know they’ve added more interesting distribution mechanisms recently that I haven’t had a chance to test yet.
So I understand why they wouldn't want to go open weight, but on the other hand, open weight wins you popularity/sentiment if the model is any good, researchers (both academic and other labs) working on your stuff, etc etc. Local-first usage is only part of the story here. My guess is Qwen 3.5 was successful enough that now they want to start reaping the profits. Unfortunately most of Qwen 3.5's success is because it's heavily (and successfully!) optimized for extremely long-context usage on heavily constrained VRAM (i.e. local) systems, as a result of its DeltaNet attention layers.
This is somewhat depressing - needing a couple of thousand bucks worth of ram just to run your chat app and code/text editor and API doco tool and forum app and notetaking app all at the same time...
Either you're in Africa, southeast Asia or south/central Amarica.
How do you even afford internet?
Using UD-IQ4_NL quants.
Getting 13 t/s. Using it with thinking disabled.
Probably too slow for chat, but usable as a coding assistant.
AMD threadripper pro 9965WX, 256gb ddr5 5600, rtx 4090.