undefined

points

[-]

It's on the page:

  Precision  Quantization Tag File Size
  1-bit      UD-IQ1_M         10 GB
  2-bit      UD-IQ2_XXS       10.8 GB
             UD-Q2_K_XL       12.3 GB
  3-bit      UD-IQ3_XXS       13.2 GB
             UD-Q3_K_XL       16.8 GB
  4-bit      UD-IQ4_XS        17.7 GB
             UD-Q4_K_XL       22.4 GB
  5-bit      UD-Q5_K_XL       26.6 GB
  16-bit     BF16             69.4 GB

by Aurornis15 hours ago|

parent|

[-]

Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

by Glemllksdf15 hours ago|

parent|

[-]

Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

by dragonwriter14 hours ago|

parent|

[-]

Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.

by est14 hours ago|

parent|

prev|

[-]

I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

by tredre310 hours ago|

parent|

[-]

Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors

by huydotnet14 hours ago|

parent|

prev|

[-]

They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types

by arcanemachiner8 hours ago|

parent|

prev|

[-]

Just start with q4_k_m and figure out the rest later.

by palmotea16 hours ago|

parent|

prev|

[-]

Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.

by JKCalhoun14 hours ago|

parent|

prev|

[-]

"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

by mtklein14 hours ago|

parent|

[-]

Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.

by adrian_b8 hours ago|

parent|

prev|

[-]

The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.

It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.

In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.

BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.

Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.

Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.

by Gracana14 hours ago|

parent|

prev|

[-]

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

by 14 hours ago|

parent|

prev|

[-]

deleted

by WithinReason14 hours ago|

parent|

prev|

[-]

yes, it has 8 exponent bits like float32 instead of 6 like float16

by tommy_axle15 hours ago|

prev|

[-]

Pick a decent quant (4-6KM) then use llama-fit-params and try it yourself to see if it's giving you what you need.

by gunalx12 hours ago|

parent|

[-]

I habe found llama-fit sometimes just selects a way to conservative load with VRAM to spare.

by zozbot23416 hours ago|

prev|

[-]

Should run just fine with CPU-MoE and mmap, but inference might be a bit slow if you have little RAM.

by Ladioss15 hours ago|

prev|

[-]

You can run 25-30b model easily if you use Q3 or Q4 quants and llama-server with a pretty long list of options.

by trvz16 hours ago|

prev|

[-]

If you have to ask then your GPU is too small.

With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.

by coder54315 hours ago|

parent|

[-]

Not true. With a MoE, you can offload quite a bit of the model to CPU without losing a ton of performance. 16GB should be fine to run the 4-bit (or larger) model at speeds that are decent. The --n-cpu-moe parameter is the key one on llama-server, if you're not just using -fit on.

by boppo112 hours ago|

parent|

[-]

I've been way out of the local game for a while now, what's the best way to run models for a fairly technical user? I was using llama.cpp in the command line before and using bash files for prompts.

by adrian_b8 hours ago|

parent|

[-]

Running llama-server (it belongs to llama.cpp) starts a HTTP server on a specified port.

You can connect to that port with any browser, for chat.

Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.

by palmotea16 hours ago|

parent|

prev|

[-]

> If you have to ask then your GPU is too small.

What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?

by giobox16 hours ago|

parent|

[-]

It's worth noting now there are other machines than just Apple that combine a powerful SoC with a large pool of unified memory for local AI use:

> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...

> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...

> https://frame.work/products/desktop-diy-amd-aimax300/configu...

etc.

But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.

by jchw16 hours ago|

parent|

prev|

[-]

32 GiB of VRAM is possible to acquire for less than $1000 if you go for the Arc Pro B70. I have two of them. The tokens/sec is nowhere near AMD or NVIDIA high end, but its unexpectedly kind of decent to use. (I probably need to figure out vLLM though as it doesn't seem like llama.cpp is able to do them justice even seemingly with split mode = row. But still, 30t/s on Gemma 4 (on 26B MoE, not dense) is pretty usable, and you can do fit a full 256k context.)

When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)

by zozbot23416 hours ago|

parent|

[-]

New versions of llama.cpp have experimental split-tensor parallelism, but it really only helps with slow compute and a very fast interconnect, which doesn't describe many consumer-grade systems. For most users, pipeline parallelism will be their best bet for making use of multi-GPU setups.

by jchw15 hours ago|

parent|

[-]

Yeah, I was doing split tensor and it seemed like a wash. The Arc B70s are not huge on compute.

Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.

I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)

I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.

by zozbot23415 hours ago|

parent|

[-]

You could fit your HEDT with minimum RAM and a combination of Optane storage (for swapping system RAM with minimum wear) and fast NAND (for offloading large read-only data). If you have abundant physical PCIe slots it ought to be feasible.

by dist-epoch14 hours ago|

parent|

prev|

[-]

NVIDIA 5070 Ti can run Gemma 4 26B at 4-bit at 120 tk/s.

Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.

by jchw13 hours ago|

parent|

[-]

Unfortunately it really is running this slow with Llama.cpp, but of course that's with Vulkan mode. The VRAM capacity is definitely where it shines, rather than compute power. I am pretty sure that this isn't really optimal use of the cards, especially since I believe we should be able to get decent, if still sublinear, scaling with multiple cards. I am not really a machine learning expert, I'm curious to see if I can manage to trace down some performance issues. (I've already seen a couple issues get squashed since I first started testing this.)

I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.

A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.

by nyrikki7 hours ago|

parent|

[-]

Parallelism can be tricky and always has a cost, but don't discount the 3090 which is more expensive these days in that price bracket.

3090 llama.cpp (container in VM)

    unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL  105 t/s
    unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL  103 t/s

Still slow compaired to the

    ggml-org/gpt-oss-20b-GGUF 206 t/s

But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.

There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.

To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.

A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...

For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.

LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.

As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.

by TechSquidTV16 hours ago|

parent|

prev|

[-]

My Mac Studio with 96GB of RAM is maybe just at the low end of passable. It's actually extremely good for local image generation. I could somewhat replace something like Nano Banana comfortably on my machine.

But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.

For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.

by slopinthebag14 hours ago|

parent|

[-]

You just need to adjust your workflow to use the smaller models for coding. It's primarily just a case of holding them wrong if you end up with worse outputs.

by layer816 hours ago|

parent|

prev|

[-]

It’s also doable with AMD Strix Halo.

by bfivyvysj16 hours ago|

parent|

prev|

[-]

A bit like asking how long is a piece of string.

by latentsea16 hours ago|

parent|

[-]

It's twice as long as from one end to the middle.

by palmotea16 hours ago|

parent|

prev|

[-]

More like "about how long of a string do I need to run between two houses in the densest residential neighborhood of single-family homes in the US?"

by angoragoats16 hours ago|

parent|

prev|

[-]

Macs with unified memory are economical in terms of $/GB of video memory, and they match an optimized/home built GPU setup in efficiency (W/token), but they are slow in terms of absolute performance.

With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.

To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).

by utilize180816 hours ago|

parent|

prev|

[-]

Obviously going to depend on your definition of "decent". My impression so far is that you will need between 90GB to 100GB of memory to run medium sized (31B dense or ~110B MoE) models with some quantization enabled.

by cjbgkagh16 hours ago|

parent|

[-]

I’m running Gemma4 31B (Q8) on my 2 4090s (48GB) with no problem.

by Glemllksdf15 hours ago|

parent|

[-]

I have the same setup but tried paperclip ai with it and it seems to me that either i'm unable to setup it properly or multiply agents struggle with this setup. Especially as it seems that paperclip ai and opencode (used for connection) is blowing up the context to 20-30k

Any tips around your setup running this?

I use lmstudio with default settings and prioritization instead of split.

by cjbgkagh13 hours ago|

parent|

[-]

I asked AI for help setting it up. I use 128k context for 31B and 256k context for 26B4A. Ollama worked out of the box for me but I wanted more control with llama.cpp.

My command for llama-server:

llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000

by littlestymaar16 hours ago|

parent|

prev|

[-]

No, GP is excessively restrictive. Llama.cpp supports RAM offloading out of the box.

It's going to be slower than if you put everything on your GPU but it would work.

And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.

by FusionX16 hours ago|

parent|

prev|

[-]

Aren't 4bits model decent? Since, this is an MOE model, I'm assuming it should have respectable tk/s, similar to previous MOE models.

by gunalx12 hours ago|

parent|

prev|

[-]

Running q3 xss with full and quantizised context as options on a 16gb gpu and still has pretty decent quality and fitting fine with up to 64k context.