upvote
A GPU with 24GBs of RAM is mostly useful for running a very carefully squeezed Qwen3.6 27B (4-bit Unsloth quants, 8-bit K/V cache, possibly MTP, 128k context). This is a fun little model that's smart enough to do debugging, refactoring, and implementing "clean" specs that don't force it to make complicated design choices. I've seen it rip through a 9-year-old Terraform AWS config, and (without using the network) correctly identify nearly everything that would need to be upgraded or migrated for modern AWS. But if I give it some poorly conceived spec with lurking design headaches, then it goes on an endless thinking binge and ultimately fails.

Speed-wise, I don't have numbers, but it feels subjectively faster than Opus in Claude Code. YMMV.

Once you go above "a used 3090 at a decentish price", then I strongly recommend renting cloud GPUs or at least testing models using paid APIs. This allows testing your use case before spending piles of money.

reply
Generation is basically just memory bandwidth math.

Each token has to read all the active weights. I think that's around 40B parameters active. At a 4-bit quant that's 20GB. With 100GB/s (replace with whatever your bandwidth is) and you get 5 tokens per second.

reply
And with MTP (or other speculation techniques) you can ~double that.
reply
MTP on a MoE is hit or miss. If you're bottlenecked on memory, MTP can increase the number of active experts (like any batch processing would), which can eat away gains from it.
reply