undefined

points

[-]

You can, and it's actually great if you store little "headers" etc to tell you those offsets. Their design doesn't seem super amenable to it because it appears to be one file, but this is why a system that actually intends to scale would break things up. You then cache these headers and, on cache hit, you know "the thing I want is in that chunk of the file, grab it". Throw in bloom filters and now you have a query engine.

Works great for Parquet.

by Sirupsen8 hours ago|

prev|

[-]

Yep! Other than random reads (~p99=200ms on larger ranges), it's essential to get good download performance of a single file. A single (range) request can "only" drive ~500 MB/s, so you need multiple offsets.

https://github.com/sirupsen/napkin-math

by UltraSane10 hours ago|

prev|

[-]

Amazon S3 Select enables SQL queries directly on CSV, JSON, or Apache Parquet objects, allowing retrieval of filtered data subsets to reduce latency and costs

by staticassertion9 hours ago|

parent|

[-]

S3 Select is, very sadly, deprecated. It also supported HTTP RANGE headers! But they've killed it and I'll never forgive them :)

Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).

by UltraSane6 hours ago|

parent|

[-]

Wow I didn't know that. To be fair now that S3 tables exists it is rather redundant.