undefined

points

by kgeist1 days ago |

comments

by Mil0dV1 days ago|

[-]

Does this also apply to gemma's 26B-A4B vs say Qwens 35B-A3B?

I'm not sure if I can make the 35B-A3B work with my 32GB machine

by green7ea16 hours ago|

parent|

[-]

It should be easy with a Q4 (quantization to 4 bits per weight) and a smallish context.

You won't have much RAM left over though :-/.

At Q4, ~20 GiB

https://huggingface.co/unsloth/Qwen3.5-35B-A3B-GGUF

by rhdunn12 hours ago|

parent|

[-]

For llama-server (and possibly other similar applications) you can specify the number of GPU layers (e.g. `--n-gpu-layers`). By default this is set to run the entire model in VRAM, but you can set it to something like 64 or 32 to get it to use less VRAM. This trades speed as it will need to swap layers in and out of VRAM as it runs, but allows you to run a larger model, larger context, or additional models.