If you download the release day quants with a tool that doesn’t automatically check HF for new versions you should check back again in a week to look for updated versions.
Some times the launch day quantizations have major problems which leads to early adopters dismissing useful models. You have to wait for everyone to test and fix bugs before giving a model a real evaluation.
For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.
For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.
1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made
2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter
3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed
Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.
We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(
HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8
But agreed it's probs better to place them in a table
The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.
Though chat templates seem like they need a better solution. So many issues, seems quite fragile.
For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(
We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc
Users of the quantized model might be even made to think that the model sucks because the quantized version does.
An imperfect analogy might be the Linux kernel. Linus publishes official releases as a tagged source tree but most people who use Linux run a kernel that has been tweaked, built, and packaged by someone else.
That said, models often DO come from the factory in multiple quants. Here's the FP8 quant for Qwen3.6 for example: https://huggingface.co/Qwen/Qwen3.6-35B-A3B-FP8
Unsloth and other organizations produce a wider variety of quants than upstream to fit a wider variety of hardware, and so end users can make their own size/quality trade-offs as needed.
Qwen did release an fp8 version, which is a quantized version.
- Why is Qwen's default "quantization" setup "bad" - Who is Unsloth? - Why is his format better? What gains does a better format give? What are the downsides of a bad format? - What is quantization? Granted, I can look up this myself, but I thought I'd ask for the full picture for other readers.
https://unsloth.ai/docs/basics/unsloth-dynamic-2.0-ggufs is what might be helpful. You might have heard 1bit dynamic DeepSeek quants (we did that) - not all layers can be 1bit - important ones are in 8bit or 16bit, and we show it still works well.
Unsloth releases lower-quality versions of the model (Qwen in this case). Think about taking a 95% quality JPEG and converting it to a 40% quality JPEG.
Models are quantized to lower quality/size so they can run on cheaper/consumer GPUs.
Precision Quantization Tag File Size
1-bit UD-IQ1_M 10 GB
2-bit UD-IQ2_XXS 10.8 GB
UD-Q2_K_XL 12.3 GB
3-bit UD-IQ3_XXS 13.2 GB
UD-Q3_K_XL 16.8 GB
4-bit UD-IQ4_XS 17.7 GB
UD-Q4_K_XL 22.4 GB
5-bit UD-Q5_K_XL 26.6 GB
16-bit BF16 69.4 GBThis model is a MoE model with only 3B active parameters per expert which works well with partial CPU offload. So in practice you can run the -A(N)B models on systems that have a little less VRAM than you need. The more you offload to the CPU the slower it becomes though.
Or is it only layers but that would affect all Experts?
I searched all unsloth doc and there seems no explaination at all.
But if you're willing to give more bits to only certain important weights, you get to preserve a lot more quality for not that much more space.
The S/M/L/XL is what tells you how many tensors get to use more bits.
The difference between S and M is generally noticeable (on benchmarks). The difference between M and L/XL is less so, let alone in real use (ymmv).
Here's an example of the contents of a Q4_K_:
S
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 136 tensors
llama_model_loader: - type q5_0: 43 tensors
llama_model_loader: - type q5_1: 17 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 55 tensors
M
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 15 tensors
llama_model_loader: - type q8_0: 83 tensors
L
llama_model_loader: - type f32: 392 tensors
llama_model_loader: - type q4_K: 106 tensors
llama_model_loader: - type q5_0: 32 tensors
llama_model_loader: - type q5_K: 30 tensors
llama_model_loader: - type q6_K: 14 tensors
llama_model_loader: - type q8_0: 84 tensorsIs that (BF16) a 16-bit float?
It has been initially supported by GPUs, where it is useful especially for storing the color components of pixels. For geometry data, FP32 is preferred.
In CPUs, some support has been first added in 2012, in Intel Ivy Bridge. Better support is provided in some server CPUs, and since next year also in the desktop AMD Zen 6 and Intel Nova Lake.
BF16 is a format introduced by Google, intended only for AI/ML applications, not for graphics, so initially it was implemented in some of the Intel server CPUs and only later in GPUs. Unlike FP16, which is balanced, BF16 has great dynamic range, but very low precision. This is fine for ML but inappropriate for any other applications.
Nowadays, most LLMs are trained preponderantly using BF16, with a small number of parameters using FP32, for higher precision.
Then from the biggest model that uses BF16, smaller quantized models are derived, which use 8 bits or less per parameter, trading off accuracy for speed.
Yes, however it’s a different format from standard fp16, it trades precision for greater dynamic range.
With 16 GB you'll be only able to run a very compressed variant with noticable quality loss.
You can connect to that port with any browser, for chat.
Or you can connect to that port with any application that supports the OpenAI API, e.g. a coding assistant harness.
What's the minimum memory you need to run a decent model? Is it pretty much only doable by people running Macs with unified memory?
> https://www.dell.com/en-us/shop/cty/pdp/spd/dell-pro-max-fcm...
> https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
> https://frame.work/products/desktop-diy-amd-aimax300/configu...
etc.
But yes, a modern SoC-style system with large unified memory pool is still one of the best ways to do it.
When I get home today I totally look forward to trying the unsloth variants of this out (assuming I can get it working in anything.) I expect due to the limited active parameter count it should perform very well. It's obviously going to be a long time before you can run current frontier quality models at home for less than the price of a car, but it does seem like it is bound to happen. (As long as we don't allow general purpose computers to die or become inaccessible. Surely...)
Right now I'm only able to run them in PCI-e 5.0 x8 which might not be sufficient. But, a cheap older Xeon or TR seems silly since PCI-e 4.0 x16 isn't theoretically more bandwidth than PCI-e 5.0 x8. So it seems like if that is really still bottlenecked, I'll just have to bite the bullet and set up a modern HEDT build. With RAM prices... I am not sure there is a world where it could ever be worth it. At that point, seems like you may as well go for an obscenely priced NVIDIA or AMD datacenter card instead and retrofit it with consumer friendly thermal solutions. So... I'm definitely a bit conflicted.
I do like the Arc Pro B70 so far. Its not a performance monster, but it's quiet and relatively low power, and I haven't run into any instability. (The AMDGPU drivers have made amazing strides, but... The stability is not legendary. :)
I'll have to do a bit of analysis and make sure there really is an interconnect bottleneck first, versus a PEBKAC. Could be dropping more lanes than expected for one reason or another too.
Arc Pro B70 seems unexpectedely slow? Or are you using 8-bit/16-bit quants.
I've heard that vLLM performs much better, scaling particularly better in the multi GPU case. The 4x B70 setup may actually be decent for the money given that, but probably worth waiting on it to see how the situation progresses rather than buying on a promise of potential.
A cursory Google search does seem to indicate that in my particular case interconnect bandwidth shouldn't actually be a constraint, so I doubt tensor level parallelism is working as expected.
3090 llama.cpp (container in VM)
unsloth/Qwen3.6-35B-A3B-GGUF:UD-Q4_K_XL 105 t/s
unsloth/gemma-4-26B-A4B-it-GGUF:UD-Q4_K_XL 103 t/s
Still slow compaired to the ggml-org/gpt-oss-20b-GGUF 206 t/s
But on my 3x 1080 Ti 1x TITAN V getto machine I learned that multi gpu takes a lot of tuning no matter what. With the B70, where Vulkan has the CPU copy problem, and SYCL doesn't have a sponsor or enough volunteers, it will probably take a bit of profiling on your part.There are a lot of variables, but PCIe bus speed doesn't matter that much for inference, but the internal memory bandwidth does, and you won't match that with PCIe ever.
To be clear, multicard Vulkan and absolutely SYCL have a lot of optimizations that could happen, but the only time two GPUs are really faster for inference is when one doesn't have enough ram to fit the entire model.
A 3090 has 936.2 GB/s of (low latency) internal bandwidth, while 16xPCIe5 only has 504.12, may have to be copied through the CPU, have locks, atomic operations etc...
For LLM inference, the bottleneck just usually going to be memory bandwidth which is why my 3090 is so close to the 5070ti above.
LLM next token prediction is just a form of autoregressive decoding and will primarily be memory bound.
As I haven't used the larger intel GPUs I can't comment on what still needs to be optimized, but just don't expect multiple GPUs to increase performance without some nvlink style RDMA support _unless_ your process is compute and not memory bound.
But I don't need Nano Banana very much, I need code. While it can, there's no way I would ever opt to use a local model on my machine for code. It makes so much more sense to spend $100 on Codex, it's genuinely not worth discussing.
For non-thinking tasks, it would be a bit slower, but a viable alternative for sure.
With this model, since the number of active parameters is low, I would think that you would be fine running it on your 16GB card, as long as you have, say 32GB of regular system memory. Temper your expectations about speed with this setup, as your system memory and CPU are multiple times slower than the GPU, so when layers spill over you will slow down.
To avoid this, there's no need to buy a Mac -- a second 16GB GPU would do the trick just fine, and the combined dual GPU setup will likely be faster than a cheap mac like a Mac mini. Pay attention to your PCIe slots, but as long as you have at least an x4 slot for the second GPU, you'll be fine (LLM inference doesn't need x8 or x16).
Any tips around your setup running this?
I use lmstudio with default settings and prioritization instead of split.
My command for llama-server:
llama-server -m /models/gemma-4-26B-A4B-it-UD-Q8_K_XL.gguf -ngl 99 -sm layer -ts 10,12 --jinja --flash-attn on --cont-batching -np 1 -c 262144 -b 4096 -ub 512 -ctk q8_0 -ctv q8_0 --host 0.0.0.0 --port 8080 --timeout 18000
It's going to be slower than if you put everything on your GPU but it would work.
And if it's too slow for your taste you can try the quantized version (some Q3 variant should fit) and see how well it works for you.
Also you need to check your context size, Ollama default to 4K if <24 Gb of VRAM and you need 64K minimum if you want claude to be able to at least lift a finger.
It's incomparably faster than any other model (i.e. it's actually usable without cope). Caching makes a huge difference.
Maybe I just don't understand how quantization works, but I thought quantization was a very nasty problem involving a lot of plumbing
for the most recent example, as of April 16, 2026 (today)
Turboquant isnt still added to GGUF
2. Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were under optimized, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space
3. MiniMax 2.7 - we swiftly fixed it due to NaN PPL - we found the issue in all quants regardless of provider - so it affected everyone not just us. We wrote a post on it, and fixed it - others have taken our fix and fixed their quants, whilst some haven't updated.
Note we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.
Unfortunately sometimes quants break, but we fix them quickly, and 95% of times these are out of our hand.
We swiftly and quickly fix them, and write up blogs on what happened. Other providers simply just take our blogs and fixes and re-apply, re-use our fixes.
We already fixed ours. Bart hasn't yet but is still working on it following our findings.
blk.61.ffn_down_exps in Q4_K or Q5_K failed - it must be in Q6_K otherwise it overflows.
For the others, yes layers in some precision don't work. For eg Qwen3.5 ssm_out must be minimum Q4-Q6_K.
ssm_alpha and ssm_beta must be Q8_0 or higher.
Again Bart and others apply our findings - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...
The 4th is Google themselves improving the chat template for tool calling for Gemma.
https://github.com/ggml-org/llama.cpp/issues/21255 was another issue CUDA 13.2 was broken - this was NVIDIA's CUDA compiler itself breaking - fully out of our hands - but we provided a solution for it.