upvote
Today's free models are not really bigger when you account for the use of MoE (with ever increasing sparsity, meaning a smaller fraction of active parameters), and better ways of managing KV caching. You can do useful things with very little RAM/VRAM, it just gets slower and slower the more you try to squeeze it where it doesn't quite belong. But that's not a problem if you're willing to wait for every answer.
reply
yeah, but I mean more like the old setups where you'd just load a model on a 4090 or something, even with MoE it's a lot more complex and takes more VRAM, right? like it just seems not justifiable for most hobbyists

but maybe I'm just slightly out of the loop

reply
With sparse MoE it's worth running the experts in system RAM since that allows you to transparently use mmap and inactive experts can stay on disk. Of course that's also a slowdown unless you have enough RAM for the full set, but it lets you run much larger models on smaller systems.
reply