I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.
A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.
3090 llama.cpp (container in VM)
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL 105 t/s
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL 103 t/s
Still slow compaired to the ggml-org/gpt-oss-20b-GGUF 206 t/s
But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.
To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.
A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...
For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.
LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.
As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.