upvote
CPU-MoE still helps with mmap. Should not overly hurt token-gen speed on the Mac since the CPU has access to most (though not all) of the unified memory bandwidth, which is the bottleneck.
reply
I'll try to use that, but llama-server has mmap on by default and the model still takes up the size of the model in RAM, not sure what's going on.
reply
Try running CPU-only inference to troubleshoot that. GPU layers will likely just ignore mmap.
reply
For sure I was running on autopilot with that reply. Though in Q4 I would expect it to fit, as 24B-A4B Gemma model without CPU offloading got up to 18GB of VRAM usage
reply