undefined

points

[-]

Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.

by xrd14 hours ago|

prev|

[-]

This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

by nextaccountic13 hours ago|

parent|

[-]

But in this case he used a cpu too

by segmondy15 hours ago|

prev|

[-]

checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.