upvote
wanna chime in, recently tried vLLM to consume a NVFP4 Gemma4 safetensor model and see how the batching can show up in nice t/s numbers. it's slow to start, it's Linux only, it doesn't like WSL much, ended up with either old or nightly container builds, I more or less have given up. Appreciate how llama.cpp simply works and does things fast and obvious
reply