Pull quotes:
AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.
[snip]
> The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us
[snip]
> AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.
etc etc. The whole thing is worth reading.
I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)
But the NVidia stack has improved too.
if they had this management attitude, they wouldn't have been so far behind so as to need this action in the first place!
> “Are we afraid of our competitors? No, we’re completely unafraid of our competitors,” said Taylor. “For the most part, because—in the case of Nvidia—they don’t appear to care that much about VR. And in the case of the dollars spent on R&D, they seem to be very happy doing stuff in the car industry, and long may that continue—good luck to them.
https://arstechnica.com/gadgets/2016/04/amd-focusing-on-vr-m...
"car industry" is linked to the GPU-accelerated self-driving car work, ie, making neural networks run fast on GPUs: https://arstechnica.com/gadgets/2016/01/nvidia-outs-pascal-g...
Maybe Amazon is an example how this happens even to hardware divisions within software/logistics companies
ROCm works great too, the only issue i have had is that my machine froze a couple of times as it used 100% of the graphics and the OS had nothing left. Since moving to vulcan i stopped getting these errors apart from a little UI slowdown when i had 4 models loaded at the same time taking turns.
Im also on a i7 6700 with 32gb DDR4 so im sure that is causing more slowdowns then the graphics card.
Anthropic did retire an interview take-home assignment involving optimising inference on exotic hardware, because Claude could one shot a solution, but that was clearly a whiteboard hypothetical instead of a real system with warts, issues and nuance.
int8 quantization seems like it's almost supported, but not quite. speeds drop to a fraction of full precision speed and the server seems like it intermittently hangs. int4 quantization not supported. fp8 quantization not supported.
again, maybe AMD is just being lazy with what they've provided, but it's not a great look.
right now the fastest smart model i can run is full precision qwen3-32b. with 120 parallel requests (short context) i'm getting PP @ 4500 tokens/sec and TG @ 1300 tokens/sec
From the papers I've read and the labs that I have worked in personally, I would say that most scientists developing Deep learning solutions use CUDA for GPU acceleration