undefined

points

by Wowfunhappy21 hours ago |

comments

by gcr20 hours ago|

[-]

There are two forms of compression relevant to LLMs:

1. Reduce the number of parameters

2. Reduce the resolution of each parameter (quantization)

For 1, changing the architecture is typically only possible by the labs producing the models, which is why each OSS model release tends to feature a small number of carefully chosen model sizes (for example, Gemma4 comes in e2B, e4B, 12B, 26Ba4B, and 31B sizes).

Generally, models with higher parameter counts have more world knowledge. For coding models, this shows up as a stronger command of uncommon libraries/languages. Very small models (<20B) also lack “smarts.”

Reducing the resolution of each parameter is easier which is why lots of practitioners have their own quantizations, but this makes it harder for a model to “think” fluently. Interacting with heavily quantized models feels like interacting with someone who didn’t get any sleep the night before.

Models that have higher-fidelity quantization take more RAM and have higher “smarts,” but don’t necessarily have more world knowledge. Models with aggressive quantization tend to be more likely to make rookie mistakes, emit malformed tool calls, get stuck in loops, or even exhibit signs of “neuroticism” / “distress” in their thinking tokens.

Parameter counts = world knowledge, quantization = “smarts.”

This is a soft rule of thumb, the difference isn’t very strong.

by SirMadam20 hours ago|

prev|

[-]

SOTA LLM specific compression achieves around ~54%! https://arxiv.org/abs/2505.06252v3

by throwdbaaway16 hours ago|

prev|

[-]

On ZFS with zstd compression, I am getting 1.34x compressratio for the BF16 weights (across multiple models).

Here's the du output for GLM-5.2:

    $ du -s -BG /cube/models/zai-org/GLM-5.2/
    1099G   /cube/models/zai-org/GLM-5.2/

by walrus0112 hours ago|

prev|

[-]

> ...a bit of an odd question: how well do LLMs losslessly compress, as in for cold storage?

TBH this is like the near last ranking consideration in cost for being able to download and run this. Even though HDD and SSD prices have gone nuts as a result of the recent demand/shortage, it's not like 1.5TB of space costs a lot.

Even if you fed it into xzip with the most cpu intensive compression options and it didn't compress at all (eg: like trying to xzip an AV1 video, or whatever), it's still the cost of a single fast food hamburger meal in $/TB. The real concern is the RAM to run it.

But anyways, anecdotally, many 16-bit full precision GGUF files will compress to about 65% of original size with default xz options. I have a log here showing that's what IBM Granite 4.1 30b compressed to, which I'm keeping around but in lukewarm storage.

by redox9920 hours ago|

prev|

[-]

Probably not at all, considering weights are randomly initialized.