undefined

upvote

points

by Aurornis15 hours ago |

upvote

by danielhanchen15 hours ago|

[-]

We re-uploaded Gemma4 4 times - 3 times were due to 20 llama.cpp bug fixes, which we helped solve some as well. The 4th is an official Gemma chat template improvement from Google themselves, so these are out of our hands. All providers had to re-fix their uploads, so not just us.

For MiniMax 2.7 - there were NaNs, but it wasn't just ours - all quant providers had it - we identified 38% of bartowski's had NaNs. Ours was 22%. We identified a fix, and have already fixed ours see https://www.reddit.com/r/LocalLLaMA/comments/1slk4di/minimax.... Bartowski has not, but is working on it. We share our investigations always.

For Qwen3.5 - we shared our 7TB research artifacts showing which layers not to quantize - all provider's quants were not optimal, not broken - ssm_out and ssm_* tensors were the issue - we're now the best in terms of KLD and disk space - see https://www.reddit.com/r/LocalLLaMA/comments/1rgel19/new_qwe...

On other fixes, we also fixed bugs in many OSS models like Gemma 1, Gemma 3, Llama chat template fixes, Mistral, and many more.

It might seem these issues are due to us, but it's because we publicize them and tell people to update. 95% of them are not related to us, but as good open source stewards, we should update everyone.

reply

upvote

by evilduck14 hours ago|

[-]

I just wanted to express gratitude to you guys, you do great work. However, it is a little annoying to have to redownload big models though and keeping up with the AI news and community sentiment is a full time job. I wish there was some mechanism somewhere (on your site or Huggingface or something) for displaying feedback or confidence in a model being "ready for general use" before kicking off 100+ GB model downloads.

reply

upvote

by danielhanchen14 hours ago|

[-]

Hey thanks - yes agreed - for now we do:

1. Split metadata into shard 0 for huge models so 10B is for chat template fixes - however sometimes fixes cause a recalculation of the imatrix, which means all quants have to be re-made

2. Add HF discussion posts on each model talking about what changed, and on our Reddit and Twitter

3. Hugging Face XET now has de-duplication downloading of shards, so generally redownloading 100GB models again should be much faster - it chunks 100GB into small chunks and hashes them, and only downloads the shards which have changed

reply

upvote

by ssrshh5 hours ago|

[-]

If you would know - is this also why LM Studio and Ollama model downloads often fail with a signature mismatch error?

reply

upvote

by evilduck11 hours ago|

[-]

Ah thanks, I wasn't aware of #3, that should be a huge boon.

reply

upvote

by CamperBob213 hours ago|

[-]

Best policy is to just wait a couple of weeks after a major model is released. It's frustrating to have to re-download tens or hundreds of GB every few days, but the quant producers have no choice but to release early and often if they want to maintain their reputation.

Ideally the labs releasing the open models would work with Unsloth and the llama.cpp maintainers in advance to work out the bugs up front. That does sometimes happen, but not always.

reply

upvote

by danielhanchen13 hours ago|

[-]

Yep agreed at least 1 week is a good idea :)

We do get early access to nearly all models, and we do find the most pressing issues sometimes. But sadly some issues are really hard to find and diagnose :(

reply

upvote

by sowbug15 hours ago|

[-]

Please publish sha256sums of the merged GGUFs in the model descriptions. Otherwise it's hard to tell if the version we have is the latest.

reply

upvote

by danielhanchen15 hours ago|

[-]

Yep we can do that probs add a table - in general be post in discussions of model pages - for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/discussions...

HF also provides SHA256 for eg https://huggingface.co/unsloth/MiniMax-M2.7-GGUF/blob/main/U... is 92986e39a0c0b5f12c2c9b6a811dad59e3317caaf1b7ad5c7f0d7d12abc4a6e8

But agreed it's probs better to place them in a table

reply

upvote

by sowbug14 hours ago|

[-]

Thanks! I know about HF's chunk checksums, but HF doesn't publish (or possibly even know) the merged checksums.

reply

upvote

by danielhanchen14 hours ago|

[-]

Oh for multi files? Hmm ok let me check that out

reply

upvote

by 10 hours ago|

[-]

deleted

reply

upvote

by zargon13 hours ago|

[-]

Why do you merge the GGUFs? The 50 GB files are more manageable (IMO) and you can verify checksums as you say.

reply

upvote

by sowbug11 hours ago|

[-]

I admit it's a habit that's probably weeks out of date. Earlier engines barfed on split GGUFs, but support is a lot better now. Frontends didn't always infer the model name correctly from the first chunk's filename, but once llama.cpp added the models.ini feature, that objection went away.

The purist in me feels the 50GB chunks are a temporary artifact of Hugging Face's uploading requirements, and the authoritative model file should be the merged one. I am unable to articulate any practical reason why this matters.

reply

upvote

by solomatov8 hours ago|

[-]

Just curious, the fixes are not about weights but about templates, am I right?

reply

upvote

by magicalhippo11 hours ago|

[-]

Appreciate the work of your team very much.

Though chat templates seem like they need a better solution. So many issues, seems quite fragile.

reply

upvote

by dist-epoch14 hours ago|

[-]

What do you think about creating a tool which can just patch the template embedded in the .gguf file instead of forcing a re-download? The whole file hash can be checked afterwards.

reply

upvote

by danielhanchen14 hours ago|

[-]

Sadly it's not always chat template fixes :( But yes we now split the first shard as pure metadata (10MB) for huge models - these include the chat template etc - so you only need to download that.

For serious fixes, sadly we have to re-compute imatrix since the activation patterns have changed - this sadly makes the entire quant change a lot, hence you have to re-download :(

reply

upvote

by embedding-shape15 hours ago|

[-]

Not to mention that almost every model release has some (at least) minor issue in the prompt template and/or the runtime itself, so even if they (not talking unsloth specifically, in general) claim "Day 0 support", do pay extra attention to actual quality as it takes a week or two before issues been hammered out.

reply

upvote

by danielhanchen15 hours ago|

[-]

Yes this is fair - we try our best to communicate issues - I think we're mostly the only ones doing the communication that model A or B has been fixed etc.

We try our best as model distributors to fix them on day 0 or 1, but 95% of issues aren't our issues - as you mentioned it's the chat template or runtime etc

reply

upvote

by fuddle14 hours ago|

[-]

I don't understand why the open source model providers don't also publish the quantized version?

reply

upvote

by danielhanchen14 hours ago|

[-]

They sometimes do! Qwen, Google etc do them!

reply

upvote

by i5heu11 hours ago|

[-]

Thank you very much for this comment! I was not aware of that.

reply

upvote

by canarias_mate12 hours ago|

[-]

[dead]

reply