undefined

points

[-]

Sorting, compression algorithm +level, and data types can all have an impact. I noted elsewhere that a Boolean is getting represented as an integer. That’s one bit vs 1-4 bytes.

There is also flexibility in what you define as the dataset. Skinnier, but more focused tables could be space saving vs a wide table that covers everything -will probably break compressible runs of data.

by xnx4 hours ago|

prev|

[-]

Parquet has a few compression option. Not sure which one they are using.

by hirako20004 hours ago|

parent|

[-]

Plus isn't the least wasteful format, native duckdb for instance compacts better. That's not just down to the compression algorithm, which as you say got three main options for parquet.

by boznz1 hours ago|

prev|

[-]

.. and Remove all the political shit-slop since COVID/AI and it's probably under a gig.