undefined

points

[-]

How does this work in a production setup? Can this be set up like a server, or is it mostly for individual users to play around with data?

by orthoxerox14 hours ago|

parent|

[-]

The idea is that you treat data storage and data processing as two distinct tasks. You have your data in S3 or HDFS or a local directory and you run DuckDB on whatever single-node compute you have: a local machine or a container in a cluster.

There are companies that write cluster computing engines with duckdb as the byte-cruncher at their heart, but usually it's more like NumPy, Pandas or Polars on steroids. Or SQLite, but for running OLAP queries.

by DanielHB11 hours ago|

parent|

prev|

[-]

In my previous job (working with electric vehicles) we had a AWS batch job that pulled all data from S3[1] into containers (one container per vehicle) and then push that data into duckdb then run some basic queries and data analysis.

The key thing is that this scaled horizontally pretty much forever, since each vehicle had a fixed amount of data per year we could tightly control the performance characteristics of the analysis. Adding more vehicles didn't make things slower, just linearly more expensive.

I vaguely remember the data from those containers also being used to process some aggregate analysis (like the each vehicle-container would output some data that would be consumed by another job that did aggregates). But I don't remember the specifics.

[1]: I believe we used JSONL or parquet format, but I didn't work in that part of the stack directly

by blackoil14 hours ago|

parent|

prev|

[-]

It is an OLAP db. So you can have a pipeline storing data in parquet files in S3. And then use DuckDB to directly query on it.

by jdw6414 hours ago|

prev|

[-]

Then it definitely makes sense. Scientists usually handle a lot of CSV files. Thank you