Monday, April 12, 2010

Anonymizing patient records for genomics

This article in the Journal NATURE points to a nice Risk Analysis and Mitigation plan to allow researchers access to genetic information and the diagnosis codes known for the patient. They have even added a mitigation to assure that small populations in diagnosis code pools don't happen through low thresholds and grouping. 
To solve this problem, the new method allows researchers to set two parameters: the minimum number of patients (k) that should have the same set of codes, and a 'utility policy' which specifies how codes should be linked in the anonymized data. More

I really like the approach taken as it takes a look at what the minimal information desired and determines through a risk assessment how to achieve that goal. From my read they realized that they simply needed to know what the known diagnosis values were, they didn't need demographics or other indirect identifiers. At least that is all they say they are taking in the article.

I like this approach because it follows nicely the approach that I outlined in De-Identification is highly contextual. I hope that the ONC when they test re-identification of protected data looks carefully at this output, and process they used to come to this conclusion. I do not expect that their output is reusable because De-Identification is highly contextual.

Surely more investigation needs to be done, but I like that this group was willing to think critically about what the minimal information that they needed for success.

2 comments:

  1. I heard an interesting discussion on deidentifying data last week. An investigator said his interpretation of the law is that a randomly generated identifier is the only way to guarantee anonymity -- that other algorithms can (and have been) hacked. At the same time, that still may not be enough to mask a patient with a rare genetic disease, as above.

    ReplyDelete
  2. Yes a good pseudonym generator that assigns what looks like random strings and doesn't actually have a pattern to it should be used. But a pseudonym is only needed if one needs to link data across multiple instances in time. In the above case they simply want a flash of the data at a given time, so they don't even need a pseudonym.

    What has been cracked multiple times is that many mistakes have been made through leaving in too much information. It is amazing what someone using public information can do to re-identify very little information. Unfortunately healthcare is very specific to sex, age and physical environment which turns out is enough information.

    This is why I was glad that in the case given, they really looked at what they minimally needed to know and stuck with that only. They then did risk assessment to further work through issues. Neither of these steps is common, especially the second.

    ReplyDelete