This is typical of what happens any time I try to run something written in Python. It may be easier than setting up an NVIDIA GPU, but that's a low bar.
(Sorry I’ll excuse myself now…
Python is awesome in many ways, one of my favourite languages, but unless you are happy with venv manipulation (or live in Conda), it's often a nightmare that ends up worse than DLL-hell.
Python is by no means alone in this or particularly egregious. Having been a heavy Perl developer in the 2000s, I was part of the problem. I didn't understand why other people had so much trouble doing things that seemed simple to me, because I was eating, breathing, and sleeping Perl. I knew how to prolong the intervals between breaking my installation, and how to troubleshoot and repair it, but there was no reason why anyone who wanted to deploy, or even develop on, my code base should have needed that encyclopedic knowledge.
This is why, for all their faults, I count containers as the biggest revolution in the software industry, at least for us "backend" folks.
mlx is similar to numpy/pytorch, but only for Apple Silicon.
mlx-lm is a llama.cpp equivalent, but built on top of mlx.
What's your tokens/sec rate (and on which device)?
I don't have a token/second figure to hand but it's fast enough that I'm not frustrated by it.
MLX looks really nice from the demo-level playing around with it I've done, but I usually stick to jax so, you know, I can actually deploy it on a server without trying to find someone who racks macs.