upvote
[delayed]
reply
I started with ollama and now I'm using llama.cpp/llama-server's Router Mode that allows you to manage multiple models through a single server instance.

One thing I haven't figured out: Subjectively, it feels like ollama's model loading was nearly instant, while I feel like I'm always waiting for llama.cpp to load models, but that doesn't make sense because it's ultimately the same software. Maybe I should try ollama again to convince myself that I'm not crazy and that ollama's model loading wasn't actually instant.

reply
You don't need to compile it yourself though? Unless you want CUDA support on Linux I guess, dunno why you'd need such a silly thing though:

https://github.com/ggml-org/llama.cpp/releases

reply