undefined

points

[-]

One of the rust-gpu maintainers here. Haven't officially heard from anyone at AMD but we've had chats with many others. Happy to talk with whomever! I would imagine AMD is focusing on ROCm over Vulkan for compute right now as their pure datacenter play, which makes sense.

We've started a company around Rust on the GPU btw (https://www.vectorware.com/), both CUDA and Vulkan (and ROCm eventually I guess?).

Note that most platform developers in the GPU space are C++ folks (lots of LLVM!) and there isn't as much demand from customers for Rust on the GPU vs something like Python or Typescript. So Rust naturally gets less attention and is lower on the list...for now.

by shmerl2 days ago|

parent|

[-]

I see, thanks. Would be good if Vulkan was pushed more as an approach for this since others are GPU specific.

by MobiusHorizons2 days ago|

prev|

[-]

From the readme:

> Note: This project is still heavily in development and is at an early stage.

> Compiling and running simple shaders works, and a significant portion of the core library also compiles.

> However, many things aren't implemented yet. That means that while being technically usable, this project is not yet production-ready.

Also projects like rust gpu are built on top of projects like cuda and ROCm they aren’t alternatives they are abstractions overtop

by shmerl2 days ago|

parent|

[-]

I think Rust GPU is built on top of Vulkan + SPIR-V as their main foundation, not on top of CUDA or ROCm.

What I meant more is the language of writing GPU programs themselves, not necessarily the machinery right below it. Vulkan is good to advance for that.

I.e. CUDA and ROCm focus on C++ dialect as GPU language. Rust GPU does that with Rust and also relies on Vulkan without tying it to any specific GPU type.

by markisus2 days ago|

parent|

[-]

The article mentions Triton for this purpose. I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.

by shmerl2 days ago|

parent|

[-]

> I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.

You could argue about CPU architectures the same, no? Yet compilers solve this pretty well most of the time.

by fc417fc8022 days ago|

parent|

[-]

Sort of not really. Compilers are fantastic for the typical stuff and that includes the compilers in the CUDA/ROCm/Vulkan/etc stacks. But on the CPU for the rare critical bits where you care about every last cycle or other inane details for whatever reason you're often all but forced to fall back on intrinsics and microarch specific code paths.

by shmerl2 days ago|

parent|

[-]

Yeah, that's why I said most of the time. Sometimes even for CPUs things need assembly. But no one stops you using GPU assembly either when needed I suppose? It should not be the default approach probably.

by pjmlp2 days ago|

prev|

[-]

Because the people that care want C++, Fortran, Python and Julia, which already enjoy a rich ecosystem.

by HarHarVeryFunny2 days ago|

prev|

[-]

If you don't want/need to program at lowest level possible, then Pytorch seems the obvious option for AMD support, or maybe Mojo. The Triton compiler would be another option for kernel writing.

by shmerl2 days ago|

parent|

[-]

I don't think that's something that can be pitched as a CUDA alternative. Just different level.

by HarHarVeryFunny1 days ago|

parent|

[-]

Triton, while a compiler, generates code at a lower level than CUDA or ROCm.

The machine code that actually runs on NVidia and AMD GPUs respectively are SASS and AMDGCN, and in each case there is also an intermediate level of representation:

CUDA -> PTX -> SASS

ROCm -> LLVM-IR -> AMDGCN

The Triton compiler isn't generating CUDA or ROCm - it generates it's own generic MLIR intermediate representation, which then gets converted into PTX or LLVM-IR, with vendor-specific tools then doing the final step.

If you are interested in efficiency and wanted to write high level code, then you might be using Pytorch's torch.compile, which then generates Triton kernels, etc.

If you really want to squeeze the highest performance out of an NVIDA GPU then you would write in PTX assembler, not CUDA, and for AMD in GCN assembler.