Mojo and Modular's Max platform would do to heterogeneous compute what LLVM did to programming language development. People who dismiss the real value offering here know nothing. Modular have already raised $350m+ from industry giants (including Nvidia and Google) to solve this, and I believe they will.
Why should they? CUDA is a GPGPU paradigm, AMD/Apple/Intel all ship diverse raster-focused hardware, and TPUs are a systolic array. How much can you realistically expect to abstract with unified primitives? How much performance do you perceive to be left on the table with native CUDA-based implimentations?
Pytorch's abstractions answer this by ignoring raster hardware conventions entirely. The underlying ATen library is basically a CUDA wrapper, which is not much of a surprise since nobody else is willing to standardize a better alternative. We learned as much when OpenCL died, and now that Khronos is riding into the sunset it's unlikely we'll even see that level of paltry early-2010s cooperation. Mojo really should have taken Vulkan's lessons to heart; you need stakeholders to succeed, simply "disrupting" the proprietary status quo is a recipe for coming dead last in adoption rates.
> People who dismiss the real value offering here know nothing.
So explain the value, then. This is not an "optimize this IR for MIPS and x86" problem, the Lattner Fairy can't shoehorn shaders into every CUDA Compute Capability to make raster GPUs a viable GPGPU platform. If you followed Geohot's gradual descent into (sadly, quite literal) insanity then this would have been glaringly obvious from the offset. Tinygrad has an IR, industry-scale support, multiplat deployment, and it's still a dumpster fire. The project exacerbated all of the issues in ROCm and Metal, without contributing to any form of upstream cooperation between CUDA's competitors. If Mojo goes the same route with a more ambitious goal, they'll end up entrenching CUDA and obsoleting themselves. As much as people hate to admit it, CUDA is less of a software moat and more of a hardware one.
Ah, it seems impossible to you. These are very different hardwares... It is hard enough to make compatibility among different hardwares of the same vendor. Very difficult to imagine building primitives for hardwares with completely different memory layouts.
> How much performance do you perceive to be left on the table with native CUDA-based implimentations?
Zero is the idea. And I wasn't saying there should be a native cuda-based implementation, I'm asking you to imagine how much easier everything would have been if Cuda was cross-platform without any performance or ergonomic penalties.
Mojo is a foundational step here. The big HOW is powerful parametric programming. So much information could be passed during compile time which the compiler uses to specialize.