upvote
It does not; the decompression is memory to memory, one tensor at a time, so it’s worse. They claim less than 200 GB/s on an A100, and their benchmarks suggest it’s somewhere between 1.5-4x slower at batch size 1 depending on GPU and model. This overhead of course mostly disappears with a large enough batch size.

Other lossless codecs can hit 600 GB/s on the same hardware, so there should be some room for improvement. But A100’s raw memory bandwidth is 1.6 TB/s

reply
My mental model is saying it might do, much like on slow hard drives DoubleSpace in DOS slightly sped up loading data from disk.
reply
If the model is 70% the size, it will be 1/0.7 = 1.43x the speed.
reply