undefined

points

[-]

That was exactly my same question. Then I finished reading the post. The reason is pretty clear, and written in the post: it is faster than ollama+mlx.

by sleepybrett23 hours ago|

parent|

[-]

how much faster?

by freerunnering16 hours ago|

parent|

[-]

I was benchmarking different models, different engines, and different draft models, I posted a video on twitter, and people started asking about the setup in the final screen recording. So the blog post isn't so much "how a beginner should setup something" it's "here's the setup I posted in the video".

Original video: https://x.com/Freerunnering/status/2065275403548168398

And in the blog post there is a table showing the different speeds I got from different engines.

Slowest combo was 38.1 tk/s, and the fastest was 72.2 tk/s. All from "the same" model.

by krzyk10 hours ago|

prev|

[-]

ollama is a wrapper on top of llama.cpp, and it makes llama.cpp slower, why use it?

Also Ollama has other issues (like forgetting what it really is - a wrapper).