upvote
Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.
reply
This is a good place to start reading about dual gpus.

https://github.com/noonghunna/club-3090/blob/master/docs/DUA...

reply
But in this case he used a cpu too
reply
checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.
reply