So you can do:
ollama launch pi --model gemma4:26b
And it launches and points to the local model in one command. pi seems to do some setting caching too, because after doing the above once I can just do `pi` and it's already setup to the local model.One thing to keep in mind is that you do not need to fully fit the model in memory to run it. For example, I'm able to get acceptable token generation speed (~55 tok/s) on a 3080 by offloading expert layers. I can't remember the prompt processing speed though, but generally speaking people say prompt processing is compute bound, so benefits more from an actual GPU.