undefined

points

[-]

You articulate your case well, thank you!

I always warn people (particularly junior people) though that blindly dropping duplicates is a dangerous habit because it helps you and others in your organization ignore the causes of bad data quickly without getting them fixed at the source. Over time, that breeds a lot of complexity and inefficiency. And it can easily mask flaws in one's own logic or understanding of the data and its properties.

by michaelbarton3 hours ago|

parent|

[-]

Exactly. It’s not that getting rid of duplicates is bad, is that they may be a symptom of something worse. E.g. incorrect aggregation logic

by DangitBobby7 hours ago|

parent|

prev|

[-]

When I'm in pandas (or was, I don't use it anymore) I'm always downstream of some weird data process that ultimately exported to a CSV from a team that I know has very lax standards for data wrangling, or it is just not their core competency. I agree that duplicates are a smell but they happen often in the use-cases that I'm specifically reaching to pandas for.

by getnormality8 hours ago|

prev|

[-]

Duplicates in source data are almost always a sign of bad data modeling, or of analysts and engineers disregarding a good data model. But I agree that this ubiquitous antipattern that nobody should be doing can still be usefully made concise. There should be a select distinct * operation.

And FWIW I personally hate writing raw SQL. But the problem with the API is not the data operations available, it's the syntax and lack of composability. It's English rather than ALGOL/C-style. Variables and functions, to the extent they exist at all, are second-class, making abstraction high-friction.

by doug_durham5 hours ago|

parent|

[-]

Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.

by DangitBobby7 hours ago|

parent|

prev|

[-]

Oooh buddy how's the view from that ivory tower??

But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.

by getnormality7 hours ago|

parent|

[-]

Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.

Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!

by DangitBobby7 hours ago|

parent|

[-]

I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.

by getnormality7 hours ago|

parent|

[-]

I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!

by mamcx6 hours ago|

parent|

prev|

[-]

> Duplicates in source data are almost always a sign of bad data modeling

Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.

What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).