undefined

points

[-]

Duplicates are a sign of reality. Only where you have the resources to have dedicated people clean and organize data do you have well modeled data. Pandas is a power tool for making sense of real data.

by DangitBobby7 hours ago|

prev|

[-]

Oooh buddy how's the view from that ivory tower??

But seriously I'm not in always in control of upstream data, I get stuff thrown over to my side of the fence by an organization who just needs data jiggled around for one-off ops purposes. They are communicating to me via CSV file scraped from Excel files in their Shared Drive, kind of thing.

by getnormality7 hours ago|

parent|

[-]

Do what you gotta do, but most of my job for the past decade has been replacing data pipelines that randomly duplicate data with pipelines that solve duplication at the source, and my users strongly prefer it.

Of course, a lot of one-off data analysis has no rules but get a quick answer that no one will complain about!

by DangitBobby7 hours ago|

parent|

[-]

I updated my OG comment for context. As an org we also help clients come up with pipelines but it's just unrealistic to do a top-down rebuild of their operations to make one-off data exports appeal to my sensibilities.

by getnormality7 hours ago|

parent|

[-]

I agree, sometimes data comes to you in a state that is beyond the point where rigor is helpful. And for some people that kind of data is most of their job!

by mamcx6 hours ago|

prev|

[-]

> Duplicates in source data are almost always a sign of bad data modeling

Nope. Duplicates in source data(INPUT) is natural, correct and MUST be supported or almost all data become impossible.

What is the actual problem is the OUTPUT. Duplicates on the OUTPUT need to be controlled and explicit. In general, we need in the OUTPUT a unique rowby a N-key, but probably not need it to be unique for the rest, so, in the relational model, you need unique for a combination of columns (rarely, by ALL of them).