Healthcare Exchange Standards: Anonymizing patient records for genomics

Monday, April 12, 2010

Anonymizing patient records for genomics

This article in the Journal NATURE points to a nice Risk Analysis and Mitigation plan to allow researchers access to genetic information and the diagnosis codes known for the patient. They have even added a mitigation to assure that small populations in diagnosis code pools don't happen through low thresholds and grouping.

To solve this problem, the new method allows researchers to set two parameters: the minimum number of patients (k) that should have the same set of codes, and a 'utility policy' which specifies how codes should be linked in the anonymized data. More

I really like the approach taken as it takes a look at what the minimal information desired and determines through a risk assessment how to achieve that goal. From my read they realized that they simply needed to know what the known diagnosis values were, they didn't need demographics or other indirect identifiers. At least that is all they say they are taking in the article.

I like this approach because it follows nicely the approach that I outlined in De-Identification is highly contextual. I hope that the ONC when they test re-identification of protected data looks carefully at this output, and process they used to come to this conclusion. I do not expect that their output is reusable because De-Identification is highly contextual.

Surely more investigation needs to be done, but I like that this group was willing to think critically about what the minimal information that they needed for success.

2 comments:

KateApril 13, 2010 at 10:08 AM
I heard an interesting discussion on deidentifying data last week. An investigator said his interpretation of the law is that a randomly generated identifier is the only way to guarantee anonymity -- that other algorithms can (and have been) hacked. At the same time, that still may not be enough to mask a patient with a rare genetic disease, as above.
ReplyDelete
Replies
John MoehrkeApril 13, 2010 at 9:36 PM
Yes a good pseudonym generator that assigns what looks like random strings and doesn't actually have a pattern to it should be used. But a pseudonym is only needed if one needs to link data across multiple instances in time. In the above case they simply want a flash of the data at a given time, so they don't even need a pseudonym.

What has been cracked multiple times is that many mistakes have been made through leaving in too much information. It is amazing what someone using public information can do to re-identify very little information. Unfortunately healthcare is very specific to sex, age and physical environment which turns out is enough information.

This is why I was glad that in the case given, they really looked at what they minimally needed to know and stuck with that only. They then did risk assessment to further work through issues. Neither of these steps is common, especially the second.
ReplyDelete
Replies

Add comment

HL7®, HEALTH LEVEL SEVEN®, CARE CONNECTED BY HL7®, CCD®, CDA®, FHIR®, and GREENCDA™ are trademarks owned by Health Level Seven International. HL7®, HEALTH LEVEL SEVEN®, CARE CONNECTED BY HL7®, CCD®, CDA®, and FHIR® are registered with the United States Patent and Trademark Office.

Surely there are other copyright and trademarks that I should recognize, but everyone else seems to be reasonable; expecting readers of blogs know that I am not trying to claim or take ownership of their copyright and trademarks.

Pages

Monday, April 12, 2010

Anonymizing patient records for genomics

2 comments: