upvote
I do not disagree with that, but I am not sure what "raw data" means in some cases like the ones the article talks about. The 1.700.000 is no less or more raw than 1.700,000. Most probably somebody messed up some decimals somewhere, or somebody imported a csv in excel and it misinterpreted the numbers due to different settings. Similar to swapped longitude/latitude. That sounds different to me than, let's say, noisy temperature data from sensors. Rather, it seems more like issues that arose at the point of merging datasets together, which is already far from the data being raw.

The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it, than a random person looking into that dataset. So I do not think it is unreasonable to have people in organisations take a second look into the datasets they publish.

reply
When I say "raw", what I'm referring to is the preservation of the data's chain of custody. If I'm looking at the data with an intent to sue the respective government agency, then I have strong legal reasons to make sure that the data isn't modified. If I start from open data for example, the gov agency will have their data person sign an affidavit making this very clear and I will lose my case basically immediately.

  The issue imo is that a person closer to the point the data was collected or merged is probably better equipped with understanding of what may be wrong with it
You'd think so, but just like most other systems, systems are often inherited or not thought out, so the understanding is external and we can't assume expertise within.
reply