undefined

points

[-]

Unlike quantization, dimensionality reduction/low rank approximation, distillation etc, lossless compression is an always-correct addition to any ML system as you are computing the same thing you did before, the only question is if it is fast enough to not cause substantial bottlenecks and if the achievable compression ratio is high enough to be useful.

Floating point is just an inefficient use of bits (due to excessive dynamic range), especially during training, so it will always be welcome there. Extreme quantization techniques (some of the <= 4-bit methods, say) also tend to increase entropy in the weights limiting the applicability of lossless compression, so lossless and lossy compression (e.g., quantization) sometimes go against each other.

If you have billions in dollars in inference devices, even reducing the number of devices you need for a given workload by 5% is very useful.

by danielmarkbruce1 days ago|

parent|

[-]

"always correct"...

by striking2 days ago|

prev|

[-]

Is GPU memory size really changing that quickly? For that matter, is model size?

by kadushka2 days ago|

parent|

[-]

What's rapidly changing are quantization algorithms, and hardware features to support those algorithms. For example, Blackwell GPUs support dynamic FP4 quantization with group size 16. At that group size it's close to lossless (in terms of accuracy metrics).

by latchkey2 days ago|

parent|

prev|

[-]

Both AMD and Nvidia are dumping more and more memory into their GPUs.

MI300x is 192GB HMB3, MI325x is 256 HMB3e, MI355x should be 288 HBM3e (and support FP4/6).

by NBJack2 days ago|

parent|

[-]

The professional side of things, yes. For consumer grade GPUs, despite the trends in gaming markets otherwise needing such, the values have stagnated a bit.

by latchkey2 days ago|

parent|

[-]

I'm NDA with AMD and sadly can't mention details, but I can say the future is promising.

by NBJack1 days ago|

parent|

[-]

Music to my ears. The entire market needs more competitors. As a happy Ryzen owner, I look forward to it.

As long as AMD fixes the damn driver issues I've seen for over a decade.

by DrillShopper2 days ago|

parent|

prev|

[-]

I hope AMD cracks the CUDA Problem soon

by latchkey2 days ago|

parent|

[-]

I'm personally really excited about this solution: https://docs.scale-lang.com/

by danielmarkbruce2 days ago|

parent|

prev|

[-]

Yes, yes.

Nvidia about to release blackwell ultra with 288GB. Go back to maybe 2018 and max was 16gb if memory serves.

DeepSeek recently release a 670 gb model. A couple years ago Falcon's 180gb seemed huge.

by spoaceman77772 days ago|

parent|

[-]

I'd assume that, in the context of LLM inference, "recent" generally refers to the Ampere generation and later of GPUs, when the demand for on board memory went through the roof (as, the first truly usable LLMs were trained on A100s).

We've been stuck with the same general caps on standard GPU memory since then though. Perhaps limited in part because of the generational upgrades happening in the bandwidth of the memory, rather than the capacity.

by danielmarkbruce2 days ago|

parent|

[-]

Bandwidth is going up too. "It's not doubling every 18 months and hence it's not moving" isn't a sensible way to view change.

A one time effective 30% reduction in model size simply isn't going to be some massive unlocker, in theory or in practice.