undefined

points

[-]

Triton, while a compiler, generates code at a lower level than CUDA or ROCm.

The machine code that actually runs on NVidia and AMD GPUs respectively are SASS and AMDGCN, and in each case there is also an intermediate level of representation:

CUDA -> PTX -> SASS

ROCm -> LLVM-IR -> AMDGCN

The Triton compiler isn't generating CUDA or ROCm - it generates it's own generic MLIR intermediate representation, which then gets converted into PTX or LLVM-IR, with vendor-specific tools then doing the final step.

If you are interested in efficiency and wanted to write high level code, then you might be using Pytorch's torch.compile, which then generates Triton kernels, etc.

If you really want to squeeze the highest performance out of an NVIDA GPU then you would write in PTX assembler, not CUDA, and for AMD in GCN assembler.