upvote
I guess I have pandas brain because I definitely want to drop duplicates, 100% of the time I'm worried about duplicates and 99% of the time the only thing I want to do with duplicates is drop them. When you've got 19 columns it's _really fucking annoying_ if the tool you're using doesn't have an obvious way to say `select distinct on () from my_shit`. Close second at say, 98% of the time, I want to a get a count of duplicates as a sanity check because I know to expect a certain amount of them. Pandas makes that easy too in a way SQL makes really fucking annoying. There are a lot of parts on pandas that made me stop using it long ago but first class duplicates handling is not among them.

And the API is vastly superior to SQL is some respects from a user perspective despite being all over the place in others. Dataframe select/filtering e.g. df = df[df.duplicated(keep='last')] is simple, expressive, obvious, and doesn't result in bleeding fingers. The main problem is the rest of the language around it with all the indentations, newlines, loops, functions and so on can be too terse or too dense and much hard to read than SQL.

reply
Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.

And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.

reply
Oooh buddy how's the view from that ivory tower??

But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.

reply
Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.

Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!

reply
I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.
reply
I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!
reply
Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.
reply
> Duplicates in source data are almost always a sign of bad data modeling

Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.

What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).

reply
You articulate your case well, thank you!

I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.

reply
When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.
reply
Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic
reply
On reflection I think it's possible I may have missed the potential positive value of the post a bit. Maybe analyzing pandas gets you down to a set of data frame primitives that is helpful to build any API. Maybe the API you start with doesn't matter. I don't know. When somebody works hard to make something original, you should try to see the value in it, even if the approach is not one you would expect to be helpful.

I stand by my warnings against using pandas as a foundation for thinking about tabular data manipulation APIs, but maybe the work has value regardless.

reply
> There's no reason to invent a completely new API for them

Yes there is: SQL is one of many possible ways to interact with tabular data, why should it be the only one? R data frames literally pioneered an alternative API. Dplyr is fantastic for many reasons, one of those being that people like the verb-based approach

Furthermore I argue that dplyr is not particularly similar to SQL in the way you actually use it and how it's actually interpreted/executed.

As for the rest I feel like you're just stating your preferences as fact.

reply
> just by porting SQL to your language

You make it sound like writing an SQL parser and query engine is a trivial task. Have you ever looked at the implementation of a query engine to see what’s actually involved? You can’t just ‘build on SQL’, you have to build a substantial library of functions to build SQL on top of.

reply
Also it's not like dplyr is anything close to a "port" of SQL. You could in theory collect dplyr verbs and compile them to SQL, sure. That's what ORMs typically do, and what the Spark API does (and its descendants such as Polars).

"Porting" SQL to your language usually means inventing a new API for relational and/or tabular data access that feels ergonomic in the host language, and then either compiling it to SQL or executing it in some kind of array processing backend, or DataFusion if you're fancy like that.

reply
dplyr straightforwardly transpiles to SQL through the dbplyr package, so it's semantically pretty close to a port, even though the syntax is a bit different (better).
reply
Amen.

The author takes the 4 operations below and discusses some 3-operation thing from category theory. Not worth it, and not as clear as dplyr.

> But I kept looking at the relational operators in that table (PROJECTION, RENAME, GROUPBY, JOIN) and thinking: these feel related. They all change the schema of the dataframe. Is there a deeper relationship?

reply
I couldn’t agree more. But at the same time I try to stay quiet about it because SQL is the diamond in the rough that 95% of engineers toss into the trash. And I want minimal competition in a tight job market.
reply
"The only tool I'm willing to use is a hammer, and by god I'll turn everything into nails."
reply
SQL only works on well defined data sets that obey relational calculus rules. Pandas is a power tool for dealing with data as you find it. Without Pandas you are stuck with tools like Excel.
reply