Have you tried running llama.cpp with Unified Memory Access[1] so your iGPU can seamlessly grab some of the RAM? The environment variable is prefixed with CUDA but this is not CUDA specific. It made a pretty significant difference (> 40% tg/s) on my Ryzen 7840U laptop.
Your link seems to be describing a runtime environment variable, it doesn't need a separate build from source. I'm not sure though (1) why this info is in build.md which should be specific to the building process, rather than some separate documentation; and (2) if this really isn't CUDA-specific, why the canonical GGML variable name isn't GGML_ENABLE_UNIFIED_MEMORY , with the _CUDA_ variant treated as a legacy alias. AIUI, both of these should be addressed with pull requests for llama.cpp and/or the ggml library itself.
Hmm. Perhaps there's a niche for a "The Missing Guide to llama.cpp"? Getting started, I did things like wrapping llama-cli in a pty... and only later noticing a --simple-io argument. I wonder if "living documents" are a thing yet, where LLMs keep an eye on repo and fora, and update a doc autonomously.
I hadn't tried that, thanks! I found simply defining GGML_CUDA_ENABLE_UNIFIED_MEMORY, whether 1, 0, or "", was a 10x hit to 2 t/s. Perhaps because the laptop's RAM is already so over-committed there. But with the much smaller 4B Qwen3.5-4B-Q8_0.gguf, it doubled performance from 20 to 40+ t/s! Tnx! (an old Quadro RTX 3000 rather than an iGPU)