undefined

This is somewhat out of date (Dec 2024), but gives you some idea of how far behind AMD was then: https://newsletter.semianalysis.com/p/mi300x-vs-h100-vs-h200...

Pull quotes:

AMD’s software experience is riddled with bugs rendering out of the box training with AMD is impossible. We were hopeful that AMD could emerge as a strong competitor to NVIDIA in training workloads, but, as of today, this is unfortunately not the case. The CUDA moat has yet to be crossed by AMD due to AMD’s weaker-than-expected software Quality Assurance (QA) culture and its challenging out of the box experience.

[snip]

> The only reason we have been able to get AMD performance within 75% of H100/H200 performance is because we have been supported by multiple teams at AMD in fixing numerous AMD software bugs. To get AMD to a usable state with somewhat reasonable performance, a giant ~60 command Dockerfile that builds dependencies from source, hand crafted by an AMD principal engineer, was specifically provided for us

[snip]

> AMD hipBLASLt/rocBLAS’s heuristic model picks the wrong algorithm for most shapes out of the box, which is why so much time-consuming tuning is required by the end user.

etc etc. The whole thing is worth reading.

I'm sure it has (and will continue to) improved since then. I hear good things about the Lemonade team (although I think that is mostly inference?)

But the NVidia stack has improved too.

by _vertigo10 hours ago|

parent|

[-]

That’s insane. There should be a big team of people at AMD whose whole job is just to dogfood their stuff for training like this. Speaking of which, Amazon is in the same boat, I’m constantly surprised that Amazon is not treating improving Inferentia/Trainium software as an uber-priority. (I work at Amazon)

by chii8 hours ago|

parent|

[-]

> There should be a big team of people at AMD whose whole job is just to dogfood their stuff

if they had this management attitude, they wouldn't have been so far behind so as to need this action in the first place!

by nl7 hours ago|

parent|

[-]

I'll just leave this here from 10 years ago:

> “Are we afraid of our competitors? No, we’re completely unafraid of our competitors,” said Taylor. “For the most part, because—in the case of Nvidia—they don’t appear to care that much about VR. And in the case of the dollars spent on R&D, they seem to be very happy doing stuff in the car industry, and long may that continue—good luck to them.

https://arstechnica.com/gadgets/2016/04/amd-focusing-on-vr-m...

"car industry" is linked to the GPU-accelerated self-driving car work, ie, making neural networks run fast on GPUs: https://arstechnica.com/gadgets/2016/01/nvidia-outs-pascal-g...

by wongarsu2 hours ago|

parent|

prev|

[-]

Hardware companies being terrible at software is the norm. Nvidia is one of the rare companies that can successfully execute both.

Maybe Amazon is an example how this happens even to hardware divisions within software/logistics companies

by whywhywhywhy3 hours ago|

parent|

prev|

[-]

I mean the fact there isn’t even today may speak to why AMD isn’t the contender it should be by this point.

by moritonal5 hours ago|

parent|

prev|

[-]

Anecdotal but over several years with an AMD GPU in my desktop I've tried multiple times to do real AI work and given up every time with the AMD stack.

by calgoo4 hours ago|

parent|

[-]

Im running fine on my AMD 7800xt 16gb... Yes memory is a bit limited, but apart from the i have found that it works great using Vulcan in LM studio for example.

ROCm works great too, the only issue i have had is that my machine froze a couple of times as it used 100% of the graphics and the OS had nothing left. Since moving to vulcan i stopped getting these errors apart from a little UI slowdown when i had 4 models loaded at the same time taking turns.

Im also on a i7 6700 with 32gb DDR4 so im sure that is causing more slowdowns then the graphics card.

by djhn9 hours ago|

parent|

prev|

[-]

Yet another reason to doubt claims that ”software is solved”.

Anthropic did retire an interview take-home assignment involving optimising inference on exotic hardware, because Claude could one shot a solution, but that was clearly a whiteboard hypothetical instead of a real system with warts, issues and nuance.

by electroglyph8 hours ago|

parent|

prev|

[-]

i'm doing inference on a free mi300x instance from AMD right now. not sure if the software stack is just old or what, but here's what i've observed: stuck on an old version of vllm pre-Transformers 5 support. it lacks MoE support for qwen3 models. oss-120b is faaaar slower than it should be.

int8 quantization seems like it's almost supported, but not quite. speeds drop to a fraction of full precision speed and the server seems like it intermittently hangs. int4 quantization not supported. fp8 quantization not supported.

again, maybe AMD is just being lazy with what they've provided, but it's not a great look.

right now the fastest smart model i can run is full precision qwen3-32b. with 120 parallel requests (short context) i'm getting PP @ 4500 tokens/sec and TG @ 1300 tokens/sec

by bean4697 hours ago|

parent|

prev|

[-]

> Do labs even use CUDA?

From the papers I've read and the labs that I have worked in personally, I would say that most scientists developing Deep learning solutions use CUDA for GPU acceleration

by uberduper13 hours ago|

parent|

prev|

[-]

amd gpus compete but they lack the interconnect. NVLink performance is a huge deal for training.

by f6v3 hours ago|

parent|

prev|

[-]

I don’t know what’s a chicken and what’s an egg here. But ROCm support is often missing or experimental even in very basic foundational libraries. They need someone else to double down on using their chips and just break the software support out of the limbo.

by 0-_-015 hours ago|

parent|

prev|

[-]

What I hear is that getting your network to work on AMD is a huge pain.

by dnadler14 hours ago|

parent|

[-]

Yeah, historically it’s been software that’s limited AMD here. Not surprised to hear that may still be the issue. NVidia’s biggest edge was really CUDA.

by otabdeveloper41 hours ago|

parent|

[-]

CUDA is a complete and utter piece of shit software. It's just that it is a tiny bit less of a shitshow than the alternatives.

by oomuinio11 hours ago|

parent|

prev|

[-]

[dead]

by zombiwoof6 hours ago|

parent|

prev|

[-]

[dead]

by zobzu8 hours ago|

prev|

[-]

even google doesnt only use TPUs.

by danpalmer8 hours ago|

parent|

[-]

Google is in a different position to others in that they're the only frontier lab with a cloud infra business. It obviously makes sense to sell GPUs on cloud infra as people want to rent them. In that respect Google buys a ton of GPUs to rent out.

What's unclear to me is how much Google uses GPUs for their own stuff. Yes Gemini runs on GPUs now, so that Google can sell Gemini on-prem boxes (recent release announced last week), but is any training or inference for Gemini really happening on GPUs? This is unclear to me. I'd have guessed not given that I thought TPUs were much cheaper to operate, but maybe I'm wrong.

Caveat, I work at Google, but not on anything to do with this. I'm only going on what's in the press for this stuff.

by johndough28 minutes ago|

parent|

[-]

> Gemini on-prem boxes (recent release announced last week)

Do you have any more information on this? I only found this article about it: https://venturebeat.com/technology/googles-gemini-can-now-ru...

It mentions that Gemini can run on eight NVIDIA GPUs, but not which GPU and which Gemini model. Either way, this puts an upper bound of 288 * 8 = 2304 GB on the size of the Gemini model, which as far as I know has been a secret until now.

by bartwr1 hours ago|

parent|

prev|

[-]

I have most likely outdated info, I left Google Research 4y ago. Back then, available TPU instances were plenty and GPU scarce. Nobody wanted to mess with an immature crashing compiler and very steep performance cliffs (performance was excellent only if you stayed within the guardrails, and being outside was supported and not even resulting in a warning - as it was so common in code). But I believe most of it has changed for the better for TPUs.

by yjadsfgasdf15 hours ago|

prev|

[-]

[dead]