upvote
It's exceptionally difficult to avoid the data being de-anonymised.

If an 'anonymised' medical record says the person was born 6th September 1969, received treatment for a broken arm on 1 April 2004, and received a course of treatment in 2009 after catching the clap on holiday in Thailand - that's enough bits of information to uniquely identify me.

And medical researchers are usually very big on 'fully informed consent' so they can't gloss over that reality, hide it in fine print or obsfucate it with flowerly language. They usually have to make sure the participants really understand what they're agreeing to.

It might still work out fine, of course - 95% of people's medical histories don't contain anything particularly embarrassing, so you might be able to get plenty of participants anyway.

reply
... received a course of treatment in 2009 after catching the clap on holiday in Thailand

Yeah, sorry about that

reply
In my experience with health data, the dates are usually offset by a random but constant amount for each person (e.g. id 12345 will have all their dates shifted by +5 weeks) to avoid identification by dates.

Unfortunately the sequence of treatments and locations are usually enough to identify someone, especially if it's a rarer condition.

reply
Location data is very readily available, so you can easily correlate visits to a health facility with a treatment, and even with an offset, you can probably uniquely identify someone with 4 visits depending on the size of the medical facility.
reply
I had access to several health datasets for my research in the past. Date of birth was rarely given, especially for the bigger projects where there were more resources to allocate to privacy protection. Neither was date of death, location, or visits to a health facility with a treatment. Typically the relevant variables are age (in years), treatment type and possibly number of cycles. Probably insufficient to identify someone without access to hospital records. But if you have that, you have all these data anyways.

Most researchers likely would want to summarize these data in a similar way anyway, so this works out nicely.

reply
The people who agreed to contribute their biodata did not consent to that.

If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though

reply
You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.

In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.

reply
One of the most important "con"'s is that without controls, fewer people will allow their data to be included in the data sets.
reply
That's a very important point. The people who opt out first are typically not a random fraction of the population, and this makes it much harder to make any analyses with the resulting datasets: it gets very hard to know if your analyses are representative of the population, or not.
reply
This is why it was such a big deal when that researcher at Cleveland State misappropriated UKBB data for a race-science study with Emil Kirkegaard. After he was fired, people on Twitter were all like "this is just suppression of science", but the reality is that what they did, contravening UKBB rules, constituted potentially an existential threat to the whole program.
reply
'Anonymisation' schemes are a little like encryption, in that they just get monotonically weaker over time as people work out attacks. But the attacks tend to be much worse. I work in academic open data publishing, and the netflix prize (https://arxiv.org/abs/cs/0610105) hangs over our heads.

But what this illustrates to me is that researchers are just really careless, despite everything we make them agree to in data transfer agreements. It seems absurd to have little cubicles like this https://safepodnetwork.ac.uk/ (think Mission Impossible 1) but I do despair.

reply
They need to sell the data to fund the project
reply
This.
reply
Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.
reply
Really don't think this is any issue given the post we are commenting on...
reply