And didn't Ollama independently ship a vision pipeline for some multimodal models months before llama.cpp supported it?
The project is just a bit underwhelming overall, it would be way better if they just focused on polishing good UX and fine-tuning, starting from a reasonably up-to-date version of what llama.cpp provides already.
There is no reason to ever use ollama.
I just checked their docs and can't see anything like it.
Did you mistake the command to just download and load the model?
Actually that shouldn't be a question, you clearly did.
Hint: it also opens Claude code configured to use that model
Hmm, the fact that Ollama is open-source, can run in Docker, etc.?
In some places in the source code they claim sole ownership of the code, when it is highly derivative of that in llama.cpp (having started its life as a llama.cpp frontend). They keep it the same license, however, MIT.
There is no reason to use Ollama as an alternative to llama.cpp, just use the real thing instead.
I've benchmarked this on an actual Mac Mini M4 with 24 GB of RAM, and averaged 24.4 t/s on Ollama and 19.45 t/s on LM Studio for the same ~10 GB model (gemma4:e4b), a difference which was repeated across three runs and with both models warmed up beforehand. Unless there is an error in my methodology, which is easy to repeat[1], it means Ollama is a full 25% faster. That's an enormous difference. Try it for yourself before making such claims.
[1] script at: https://pastebin.com/EwcRqLUm but it warms up both and keeps them in memory, so you'll want to close almost all other applications first. Install both ollama and LM Studio and download the models, change the path to where you installed the model. Interestingly I had to go through 3 different AI's to write this script: ChatGPT (on which I'm a Pro subscriber) thought about doing so then returned nothing (shenanigans since I was benchmarking a competitor?), I had run out of my weekly session limit on Pro Max 20x credits on Claude (wonder why I need a local coding agent!) and then Google rose to the challenge and wrote the benchmark for me. I didn't try writing a benchmark like this locally, I'll try that next and report back.
llama.cpp is about 10% faster than LM studio with the same options.
LM studio is 3x faster than ollama with the same options (~13t/s vs ~38t/s), but messes up tool calls.
Ollama ended up slowest on the 9B, Queen3.5 35B and some random other 8B model.
Note that this isn't some rigorous study or performance benchmarking. I just found ollama unnaceptably slow and wanted to try out the other options.