As we know, quantizations are a critical tool for local LLM runners; RAM is typically the gating factor. Are you aware of other better lossless compression of BF16 weights out there?
The reason I ask is this Dfloat11 seems relatively easy to plug in to existing quantization workflows, but you seem dismissive of the paper -- I presume it's my gap in understanding, and I'd like to understand.
I doubt that very much. Thing is that inputs are multiplied with weights and added together in a neural network layer, and then the output becomes the input of the next layer in a cycle that can repeat up to a hundred times or more. When you get to the final output layer that 10^6 factor has been applied so many times that it has snowballed to a 10^600 factor.
Using DFloat11, tokens/sec was higher only when compared relative to running inference with some layers offloaded to CPU.
Classic comp sci tradeoff between space and speed, no free lunch, etc.
At least the cost to truncate and zero fill is small.
Would it be more efficient to calculate some kind of per-model or per-layer mean, and then only specify standard deviations, maybe by fp8 or smaller?