Lately I've been wondering too just how large these proprietary "ultra powerful frontier models" really are. It wouldn't shock me if the default models aren't actually just some kind of crazy MoE thing with only a very small number of active params but a huge pool of experts to draw from for world knowledge.
I am getting 10tok/sec on a 27B of Qwen3.5 (thinking, Q4, 18GB) on an M4/32GB Mac Mini. It’s slow.
For a 9B (much smaller, non-thinking) I am getting 30tok/sec, which is fast enough for regular use if you need something from the training data (like how to use grep or Hemingways favorite cocktail).
I’m using LMStudio, which is very easy and free (beer).