I am involved in much of the standards work in this space, actively working in IHE on a handbook and ISO on updates to the core standard on the subject. In all of these cases we are trying to make the 'art' of de-identification more measurable, repeatable, and approachable. Too often it is seen as too hard, more often it is seen as simple and thus mistakes are made. The goal I have is to make it clear.
Why De-Identify?First, one must understand that de-identification is just a method of lowering risk. The only way to get risk to zero is to have zero data. Even one data-element that one might consider to be purely clinical data does narrow down the population. Just to indicate that the weight of the subject is 203lbs will tell you much about the subject, if that value is 3lbs and you know the subject is a premature-baby, and if it is 403lbs it is clear you have limited the population. The first point is that all data are potentially identifiable, some data are less so.
Second, one must recognize that some data are outright Direct Identifiers. These data are in no uncertain terms identifiers. Full-Name is the most obvious. A Direct Identifier is something that is publicly known (knowable), therefore full-addresses, phone numbers, credit-card-numbers, and drivers-license-numbers. These items clearly can't be included in the de-identified data set. So they each need to be identified as a risk to be mitigated.
There are also a class of data that can be used in combination with other data in the data-set to identify a subject. Such as postal-codes, sex, date-of-birth, hospital identifier, or date-of-procedure. These are risky to be left in, so they need to be identified as potential to be mitigated.
The task of De-Identification is much like chemistry, bio-chemistry sometimes. One must understand the elements and how they interact. One must use various tools to separate or modify the elements. Each chemical process results in something useful for the purpose it was created. Some combinations of chemicals are very volitile, others benign, but all must be given respect.
De-Identification ProcedureThe procedure is simple. Ill include only the high-level, each step is more involved than I indicate here.:
- Identify what it is you want to do with data. This is your use-case. What are critical data attributes, and what are acceptable tolerances for each data attribute. You need to justify each element you want. You must also identify the acceptable level of risk, which includes assessment of the authorizations you have.
- Identify ALL of the data elements that you have. This is the data set that has not been de-identified. It might be a database, it might be a stream. You must identify all of the data, not just the data you are worried about. You then classify each attribute: Direct, Indirect, or simple data. Note that any unstructured data, otherwise known as free-text, must be considered Direct Identifier.
- Apply Mitigations, in theory. Given the use-case details you created in (1) and the data-element inventory you created in (2); apply the de-identification tools. (a) Redact - delete element, (b) Fuzz - modify within tolerance, (c) generalize - broader terms, or (d) replace - pseudonym. These are clearly not all the tools but the large categories of tools.
- Assess risk, in theory. How correlated are the data to a subject? Is this level of risk acceptable to the policy identified in (1)? Don't change your policy, that is the easy way out. Continue to apply mitigations. If further mitigations results in data that are not useful to your use-case, then you might need to change something else.
- Apply Mitigations to data-set and validate the results. As with any design-of-experiments one must be able to prove your theory. Is the resulting data just as de-identified as you expected? Is the resulting data useful for your use-case?