upvote
The article mentions Triton for this purpose. I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.
reply
> I don’t think you will get maxed out performance on the hardware though because abstraction layers won’t let you access the fastest possible path.

You could argue about CPU architectures the same, no? Yet compilers solve this pretty well most of the time.

reply
Sort of not really. Compilers are fantastic for the typical stuff and that includes the compilers in the CUDA/ROCm/Vulkan/etc stacks. But on the CPU for the rare critical bits where you care about every last cycle or other inane details for whatever reason you're often all but forced to fall back on intrinsics and microarch specific code paths.
reply
Yeah, that's why I said most of the time. Sometimes even for CPUs things need assembly. But no one stops you using GPU assembly either when needed I suppose? It should not be the default approach probably.
reply