undefined

points

[-]

That "traditional" setup is the recommended setup for running large MoE models, leaving shared routing layers on the GPU to the extent feasible. You can even go larger-than-system-RAM via mmap, though at a non-trivial cost in throughput.

by 21 hours ago|

prev|

[-]

deleted

by khimaros21 hours ago|

prev|

[-]

Strix Halo is another option