upvote
For those who don't bother to click through profiles, Jeff really knows what he's talking about. Much of Meta/FAIR + community benefits from his code.
reply
I really love HN for this reason. Full of some of the brightest minds on the internet. Often the comments have very interesting information, instead of stupid knee jerk reactions to post titles.
reply
Thanks Jeff -- can you point me to something written up about rANS? All I find on line is turbulence modeling solutions; I presume this is not what you're referring to.

As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?

The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.

reply
I don't know of any great write-ups unfortunately, but the rANS you're looking for is range asymmetric numeral systems.
reply
There are lots of materials about ANS, e.g. gathered here: https://encode.su/threads/2078-List-of-Asymmetric-Numeral-Sy...
reply
> if you multiplied everything - hyperparameters, initialized weights, training data, etc in a network by 10^6 things will still work more or less the same since the upper range is hardly used (with the possible exception of some small number of special functions)

I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.

reply
The Deepseek v3 paper details a quantisation method of scaling after matmul but before accumulation to improve precision, this is different than normal GEMM as operations are left till the end, can read more in chapter 3.3 of the paper below.

https://arxiv.org/html/2412.19437v2#S3

reply
Note to others reading along: in the last appendix page the OP paper reports DFloat11 reduces tokens/sec by ~2-3x for the Llama-3.1-8b and Qwen-2.5-14b/32b and Mistral-small-24b models (throughput penalty not reported for others).

Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.

Classic comp sci tradeoff between space and speed, no free lunch, etc.

reply
Was bfloat a mistake then? Wasn't the point of it to increase dynamic range?

At least the cost to truncate and zero fill is small.

reply
That let you think if we can rewind the time, maybe we should just allocate one more bit for half precision (6 exp, 9 mantissa) and not doing this bfloat16 thing.
reply
Thanks for the fantastic explanation!

Would it be more efficient to calculate some kind of per-model or per-layer mean, and then only specify standard deviations, maybe by fp8 or smaller?

reply
Do you think there’s a call for introducing an even smaller float that can pack more values into a SIMD register? Like a 12 bit?
reply
The latest GPUs and TPUs support fp8. It's a big part of the efficiency gain in the latest systems. Blackwell also supports fp4.
reply