Yes, data can contain subtle errors that are expensive and difficult to find. But the 2nd error in the article was so obvious that a bright 10 year would probably have spotted it.
But sometimes the "provenance" of the data is important. I want to know whether I'm getting data straight from some source (even with errors) rather than having some intermediary make fixes that I don't know about.
For example, in the case where maybe they flipped the latitude and longitude, I don't want them to just automatically "fix" the data (especially not without disclosing that).
What they need to do is verify the outliers with the original gas station and fix the data from the source. But that's much more expensive.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it, than a random person looking into that dataset. So I do not think it is unreasonable to have people in organisations take a second look into the datasets they publish.
The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it
You'd think so, but just like most other systems, systems are often inherited or not thought out, so the understanding is external and we can't assume expertise within.This can skew the dataset and lead to misinterpreted results, if which rows are wrong is not completely random.
Eg if all data from a specific location (or year etc) comes wrong, then this kind of cleaning would just completely exclude this location, which depending on the context may or may not be a problem. Or if values come wrong above a specific threshold. Or any other way that the errors are not in some way randomly distributed.
Removing data is never a neutral choice, and it should always be taken into consideration (which data is removed).
Absolutely. If you have obviously wrong data your choices are generally:
1. Leave the bad data in.
2. Leave the bad data in and flag it as suspect.
3. Omit the dad data.
4. Correct the bad data.
Which is the best choice depends on context and requires judgement. But I find it hard to imagine any situation where option 1 is the right choice.
Obviously the best solution is to do basic validation as the data is entered, so that people can't add a location in the Indian ocean to a UK dataset. It seems rather negligent that they didn't do this.
If you want something to blame, blame the system that allowed the data to be bad in the first place. You're pointing your finger at the wrong people and it's unreasonable of you to call them negligent.
Messy data is a signal. You're wrong to omit signal.
A better solution is to add a field to indicate that "the row looks funny to the person who published the data". Which, I guess is useful to someone?
But deleting data or changing data is effectively corrupting source data, and now I can't trust it.