undefined

points

[-]

i don't know for sure, but i suspect what makes the tinygrad stuff slow isn't the macos host driver itself. i think they're doing something very similar to what i'm doing, which is just mapping the PCI BARs to userspace, then they have a bunch of python code that drives the GPU.

this is only speculation, but i think the big thing that makes tinygrad slow is that the tinygrad inference engine has not really been optimized much for all these open LLM models. probably most of the work has gone towards optimizing the stack for george's self-driving hardware company. since you can't just run the existing CUDA kernels on their engine, that makes things a lot tougher, engineering-wise.

i am actually curious if my project could share a macos host driver with them. i think it would need some changes, but it seems like there's a lot of overlap