upvote
Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.

1 - https://github.com/ggml-org/llama.cpp/blob/master/docs/build...

reply
Your link seems to be describing a runtime environment variable, it doesn't need a separate build from source. I'm not sure though (1) why this info is in build.md which should be specific to the building process, rather than some separate documentation; and (2) if this really isn't CUDA-specific, why the canonical GGML variable name isn't GGML_ENABLE_UNIFIED_MEMORY , with the _CUDA_ variant treated as a legacy alias. AIUI, both of these should be addressed with pull requests for llama.cpp and/or the ggml library itself.
reply
You are right that it is an environment variable, and that's how I have it set in my nix config. Thanks for correcting that.

Unfortunately llama.cpp is somewhat notorious for having lackluster docs. Most of the CLI tools don't even tell you what they are for.

reply
Hmm. Perhaps there's a niche for a "The Missing Guide to llama.cpp"? Getting started, I did things like wrapping llama-cli in a pty... and only later noticing a --simple-io argument. I wonder if "living documents" are a thing yet, where LLMs keep an eye on repo and fora, and update a doc autonomously.
reply
I hadn't tried that, thanks! I found simply defining GGML_CUDA_ENABLE_UNIFIED_MEMORY, whether 1, 0, or "", was a 10x hit to 2 t/s. Perhaps because the laptop's RAM is already so over-committed there. But with the much smaller 4B Qwen3.5-4B-Q8_0.gguf, it doubled performance from 20 to 40+ t/s! Tnx! (an old Quadro RTX 3000 rather than an iGPU)
reply