Hacker News
new
past
comments
ask
show
jobs
points
by
nextaccountic
15 hours ago
|
comments
by
nodja
14 hours ago
|
next
[-]
Pipeline parallelism. Instead of splitting layers by row/column. You split at the layer edges. So instead of having this huge bottleneck of bandwidth you only need to transfer about 4KB per token when changing devices on a model like Qwen 3 30BA3.
reply
by
xrd
14 hours ago
|
prev
|
next
[-]
This is a good place to start reading about dual gpus.
https://github.com/noonghunna/club-3090/blob/master/docs/DUA...
reply
by
nextaccountic
13 hours ago
|
parent
|
[-]
But in this case he used a cpu too
reply
by
segmondy
15 hours ago
|
prev
|
[-]
checkout llama.cpp, the entire point of the project is for us mere mortals and GPU poor.
reply