upvote
And for not staying behind, Intel and AMD are doing similar efforts, and then we have the whole CPython JIT finally happening after so many attempts.

Not to mention efforts like GraalPy and PyPy.

And all these efforts work today in Windows, which is quite relevant in companies where that is the assigned device to most employees, even if the servers run Linux distros.

I keep wondering if this isn't going to be another Swift for Tensorflow kind of outcome.

reply
The CPython JIT has barely had any impact on its performance. CPython is always going to be dog slow.
reply
Of course, it is still on baby steps and has to be explicitly enabled when installing the right build.

It only has to be good enough, to keep the ecosystem going, and the porting cost not be worthwhile, when Mojo finally reaches parity.

reply
My understanding from speaking with a few Tile IR devs on dates is that its primary motivation was providing better portability for programming tensor cores than PTX offers. Nobody ever told me they saw it as a response to anything other than customer feedback.
reply
People keep mistaking Mojo as good syntax for writing GPU code, and so imagine Nvidia's Python frameworks already do that. But... would CuTile work on AMD GPUs and Apple Silicon? Whatever Nvidia does will still have vendor lock-in.
reply
Indeed, but Intel and AMD are also upping their Python JIT game, and in the end Mojo code isn't portable anyway.

You always need to touch the hardware/platform APIs at some level, because even if the same code executes the same, the observed performance, or in the case of GPUs the numeric accuracy has visible side effects.

reply
It is portable in that you can write code to target multiple platforms in the same codebase. Mojo has powerful compile-time metaprogramming that allows you to tell the compiler how to specialise using a compile-time conditional, e.g. https://github.com/modular/modular/blob/9b9fc007378f16148cfa...

Of course, this won't be necessary in most cases if you're building on top of abstractions provided by Modular.

You don't get this choice using vendor-specific libraries; you're locked into this or that.

reply
Yes you do, you get PyTorch or whatever else, built on top of those vendor-specific libraries.

That is the thing with Mojo, when it arrives as 1.0, the LLM progress and the investment that is being done in GPU JITs for Python, make it largely irrelevant for large scale adoption.

Sure some customers might stay around, and keep Modular going, the gold question is how many.

reply
Pytorch is built on an amalgamation of these different frameworks, not on one of them used to target different vendors.
reply
The point still stands as middleware.
reply
Have you ever wondered how much work would have been saved by the Pytorch team if they could have used just Cuda for all the platforms they support? If they didn't have to write compatibility abstractions or layers, and instead just focused on the problem of training neural networks? What if all the primitives they used from Cuda and cuDNN worked just as well on AMD GPUs, Apple GPUs, and probably Google's TPUs as they did on Nvidia GPUs?

Mojo and Modular's Max platform would do to heterogeneous compute what LLVM did to programming language development. People who dismiss the real value offering here know nothing. Modular have already raised $350m+ from industry giants (including Nvidia and Google) to solve this, and I believe they will.

reply
Yes, because one of my hobbies was graphics programming for a long time, and I keep observing all the time how FOSS folks misunderstand the games industry, what gets talked at GDC and IGDA events, isn't one API to rule them all.
reply
> What if all the primitives they used from Cuda and cuDNN worked just as well on AMD GPUs, Apple GPUs, and probably Google's TPUs as they did on Nvidia GPUs?

Why should they? CUDA is a GPGPU paradigm, AMD/Apple/Intel all ship diverse raster-focused hardware, and TPUs are a systolic array. How much can you realistically expect to abstract with unified primitives? How much performance do you perceive to be left on the table with native CUDA-based implimentations?

Pytorch's abstractions answer this by ignoring raster hardware conventions entirely. The underlying ATen library is basically a CUDA wrapper, which is not much of a surprise since nobody else is willing to standardize a better alternative. We learned as much when OpenCL died, and now that Khronos is riding into the sunset it's unlikely we'll even see that level of paltry early-2010s cooperation. Mojo really should have taken Vulkan's lessons to heart; you need stakeholders to succeed, simply "disrupting" the proprietary status quo is a recipe for coming dead last in adoption rates.

> People who dismiss the real value offering here know nothing.

So explain the value, then. This is not an "optimize this IR for MIPS and x86" problem, the Lattner Fairy can't shoehorn shaders into every CUDA Compute Capability to make raster GPUs a viable GPGPU platform. If you followed Geohot's gradual descent into (sadly, quite literal) insanity then this would have been glaringly obvious from the offset. Tinygrad has an IR, industry-scale support, multiplat deployment, and it's still a dumpster fire. The project exacerbated all of the issues in ROCm and Metal, without contributing to any form of upstream cooperation between CUDA's competitors. If Mojo goes the same route with a more ambitious goal, they'll end up entrenching CUDA and obsoleting themselves. As much as people hate to admit it, CUDA is less of a software moat and more of a hardware one.

reply
> Why should they? CUDA is a GPGPU paradigm, AMD/Apple/Intel all ship diverse raster-focused hardware, and TPUs are a systolic array. How much can you realistically expect to abstract with unified primitives?

Ah, it seems impossible to you. These are very different hardwares... It is hard enough to make compatibility among different hardwares of the same vendor. Very difficult to imagine building primitives for hardwares with completely different memory layouts.

> How much performance do you perceive to be left on the table with native CUDA-based implimentations?

Zero is the idea. And I wasn't saying there should be a native cuda-based implementation, I'm asking you to imagine how much easier everything would have been if Cuda was cross-platform without any performance or ergonomic penalties.

Mojo is a foundational step here. The big HOW is powerful parametric programming. So much information could be passed during compile time which the compiler uses to specialize.

reply
Interesting, how big impact is CuTile?
reply