upvote
Additional VRAM is needed for context.

This model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.

reply
Isn't that some kind of gambling if you offload random experts onto the CPU?

Or is it only layers but that would affect all Experts?

reply
Pretty sure all partial offload systems I’ve seen work by layers, but there might be something else out there.
reply
I really want to know what does M, K, XL XS mean in this context and how to choose.

I searched all unsloth doc and there seems no explaination at all.

reply
Q4_K is a type of quantization. It means that all weights will be at a minimum 4bits using the K method.

But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.

The S/M/L/XL is what tells you how many tensors get to use more bits.

The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).

Here's an example of the contents of a Q4_K_:

    S
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  136 tensors
    llama_model_loader: - type q5_0:   43 tensors
    llama_model_loader: - type q5_1:   17 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   55 tensors
    M
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   15 tensors
    llama_model_loader: - type q8_0:   83 tensors
    L
    llama_model_loader: - type  f32:  392 tensors
    llama_model_loader: - type q4_K:  106 tensors
    llama_model_loader: - type q5_0:   32 tensors
    llama_model_loader: - type q5_K:   30 tensors
    llama_model_loader: - type q6_K:   14 tensors
    llama_model_loader: - type q8_0:   84 tensors
reply
They are different quantization types, you can read more here https://huggingface.co/docs/hub/gguf#quantization-types
reply
Just start with q4_k_m and figure out the rest later.
reply
Thanks! I'd scanned the main content but I'd been blind to the sidebar on the far right.
reply
"16-bit BF16 69.4 GB"

Is that (BF16) a 16-bit float?

reply
Yes, it's a "Brain float", basically an ordinary 32-bit float with the low 16 mantissa bits cut off. Exact same range as fp32, lower precision, and not the same as the other fp16, which has less exponent and more mantissa.
reply
The IEEE standard FP16 is an older 16-bit format, which has balanced exponent and significand sizes.

It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.

In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.

BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.

Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.

Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.

reply
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.

reply
deleted
reply
yes, it has 8 exponent bits like float32 instead of 6 like float16
reply