undefined

points

by steve_adams_8615 hours ago |

comments

by jkubicek15 hours ago|

[-]

What do you use it for? I’m perpetually interested in using DuckDB, but it doesn’t seem to do anything I need.

by medvezhenok14 hours ago|

parent|

[-]

Basically like a locally hosted Snowflake - it only shines if you have enough data to analyze (100 MB - 100 GB is probably the sweet-spot range - less than that and the benefits are small, more than that and you risk flying too close to the sun with memory usage).

It has connectors for Postgres & other stores, so I find it faster to connect to a Postgres instance, pull all of the data from a table (even if the table is like 50GB - if you have 30 cores on the machine it will pull from Postgres using 30 cores in parallel, so it will only take a minute or two) - and then any analytical queries on the data are 10+ times faster in DuckDB over native Postgres (GROUP BY, regexp_replace, count(distinct...) etc).

by sceadu5 hours ago|

parent|

[-]

In my experience it works OK with spilling to disk so I haven't had too much of a concern with memory usage... previously I had issues with it OOM'ing and failing (or maybe this was a skill issue?), but haven't had that happen recently.

by orthoxerox14 hours ago|

parent|

prev|

[-]

All kinds of data processing. For example, you download a million rows of metrics and load them in Excel to build pivot tables. It works, but now it's a billion rows. If you know SQL, it's a snap to point DuckDB at the source CSV or JSON and get the results in a second.

by skeeter20203 hours ago|

parent|

prev|

[-]

the taste that hooked me: the next time you have a bunch of json data, csvs or other data - local or remote - and someone wants some charts (for me it was "productivity" metrics from Jira combined with a bunch of other stuff). First it is very easy/fast to load this data; DuckDB has a very liberal parsing engine and good connectors. Second, I used to worry a lot about my table definitions and cleaning data before structuring it. Not anymore! With DuckDB I find myself iteratively transforming data and creating new tables, combining sources, converting columns, slicing/dicing/rotating. It's very easy to "remix" data and there are functions or extensions for everything you might want to do. There's so little friction to get started that I've found it just naturally becomes the multitool in my toolbox.

THis will give you some experience and you'll start to see applicable problem spaces for DuckDB in product areas, especially anything with BI or DW.

by steve_adams_8614 hours ago|

parent|

prev|

[-]

The most interesting use case lately has been using it as the transformation and validation engine for a CLI that handles scientific data. Some datasets are small and could have been handled at the application layer, but some are quite massive (especially genomic data). DuckDB bundles with the CLI and travels around any platform, is super lightweight, allows for easily running in CI, on a user’s machine, against datasets of all sizes, and so on.

There are other embeddable options out there but I found DuckDb fit better for the potentially massive datasets, and also because of how naturally it ingests the types of data we work with, some of its unique features, and how trivial it was to learn and integrate with the project.

Otherwise I use it almost daily for doing guardrailed data exploration with LLMs. I prefer SQL over random DSLs in AWS or Sentry or what have you. I’ll ingest the data I need and just run SQL against it. I mentioned in another comment that I’ll tend to store more useful data (especially data I export routinely, like infra cost reports) on S3 and use a Rill instance to do basic exploration in a GUI (it will query remote parquet files).

by raihansaputra9 hours ago|

parent|

prev|

[-]

throwing in my 2 cents: It just replaced pandas for me. It's just so much easier to write sql against csv/json/whatever format data in jupyter/marimo notebooks through duckdb rather than reasoning through pandas. SQL is far more natural for me, and agents also work through it easily.

by skeeter20203 hours ago|

parent|

[-]

really learning SQL (syntax, boolean logic, how queries are broken down, etc) way back in uni has been the single biggest pay-off of my entire career.

by wiredfool8 hours ago|

parent|

prev|

[-]

Few different use cases, other than just a general swiss army knife for vaguely tabular data.

* fastapi + duckdb + parquet for the backend for a relatively high profile website

* wasm duckdb + react for a few visualization websites

* yaml driven ETL from lots of sources, principally ugly spreadsheets, into usable data. More T than E or L really

by edweis14 hours ago|

parent|

prev|

[-]

I personally find it useful to search logs with AI

by steve_adams_8614 hours ago|

parent|

[-]

Yes, it’s amazing for giving rails and structure to data so you can be sure an LLM is making more sense than it might with grep and jq. It also allows a little more sanity at scale with jobs like this. You can get pretty crazy with parquet in S3 with an engine like duckdb. And it’s dirt cheap to keep that stuff hanging around for future reference and sanity checking your understanding of things.

For data I reference frequently, and especially which I know will grow over time, I’ve started using Rill because it makes ad-hoc exploration very smooth and low-friction.

My process tends to be something like:

1. Explore logs or some other at least somewhat structured dataset

2. Use Claude to find useful patterns and determine how I might benefit from this data in ways I wasn’t yet aware

3. See how often it’s useful for decision making

4. If it’s frequently useful, formalize it as a view in my Rill instance and refine the models to maximize their utility

by hilariously8 hours ago|

parent|

prev|

[-]

Honestly as someone whose super SQL focused and spends less time focusing on python I just can write generic SQL to transform things in memory to do whatever I want, its very helpful for that.