upvote
That was exactly my same question. Then I finished reading the post. The reason is pretty clear, and written in the post: it is faster than ollama+mlx.
reply
how much faster?
reply
I was benchmarking different models, different engines, and different draft models, I posted a video on twitter, and people started asking about the setup in the final screen recording. So the blog post isn't so much "how a beginner should setup something" it's "here's the setup I posted in the video".

Original video: https://x.com/Freerunnering/status/2065275403548168398

And in the blog post there is a table showing the different speeds I got from different engines.

Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All from "the same" model.

reply
ollama is a wrapper on top of llama.cpp, and it makes llama.cpp slower, why use it?

Also Ollama has other issues (like forgetting what it really is - a wrapper).

reply