Sunday, June 29, 2014

De-Identifying free-text

Each time I blog on De-Identification, I get questions about free-text fields.  The focus on free-text fields is covered in ISO 25237 healthcare specification on De-Identification, in DICOM, and in the IHE Handbook on De-Identification. They all cover the more broad concept of “Non-Structured Data Variables”. This includes free-text fields, but also recognizes other data that are not well constrained upon input and storage. So this includes Voice Recordings, Images, and even calls out the medical imaging standards of DICOM. DICOM also addresses this problem space, pointing out that historically it is common for the Radiology image to have burned-into the image identifying and routing information.

Why is Free-Text a concern?

So the specific problem with non-struct
ured data elements/attributes/variables are that they could contain Direct Identifiers, Indirect Identifiers, or simply non-identifying data. It is the very fact that they are non-structured that results in this non-deterministic situation. Often times these fields are simple text-editing fields where the clinician can write anything they want. They might be prompted to enter relevant information, like description of the disposition of the patient. However without restrictions, the clinician could have put the patient name into the field.

Free-text is not safe

This is where the Intended Use-case of the resulting data-set comes into play. If there is no need that comes from the Intended Use-case for these non-structured data fields, then simply dropping them. My second rule of De-Identification is that by default you get ZERO data elements. The intended use-case needs to justify everything that is provided. Most of the time the value of a free-text field is of no value. Often times they are included simply to future-proof a workflow. That is there is a free-text field to handle ‘anything else’. So deleting the free-text field is really the most likely right way to handle it.

Useful Free-text

However there are times where for example a study is being done, and those clinicians participating are instructed to put critical study information into a normally unused free-text field. Thus the field content is critical to the Intended Use-case of the resulting data-set. In this case, we do know that the otherwise free-text field, really might contain some structured data. So the easy thing to do is pre-process the free-text fields to extract out the information that is needed by the Intended Use-case of the resulting data-set into a structured and coded entry. Throw away the rest of the free-text field. Treat the new coded entry according to the normal processing rules, which does mean it must be determined if it is a Direct Identifier, Indirect Identifier, or non-identifying data; and one must also look at the values resulting to determine if they might themselves identify an individual.

What is important here is to recognize that you have converted the free-text into structured and coded values. So, ultimately you are not passing free-text, the free-text field is destroyed.

Unstructured Image

It is much less likely that you can post-process images or voice into structured data, but I am not going to say it can’t be done. DICOM has a very comprehensive treatment of De-Identifying DICOM objects.  Specifically Chapter E of Part 15

5. The de-identifier should ensure that no identifying information that is burned in to the image pixel data either because the modality does not generate such burned in identification in the first place, or by removing it through the use of the Clean Pixel Data Option; see Section E.3. If non-pixel data graphics or overlays contain identification, the de-identifier is required to remove them, or clean them if the Clean Graphics option is supported. See Section E.3.3 The means by which burned in or graphic identifying information is located and removed is outside the scope of this standard.


I am a fan of De-Identification, when used properly and for the right reason. However De-Identification is not the only tool to be used, sometimes data simply should be properly managed, including Access Controls and Audit Controls. This same conclusion is true of data that are De-Identified, that is unless you end up with the null-set then you will have some risk that needs to be properly managed, including Access Controls and Audit Controls.

Free-text fields, all fields that you don't know have a specific structure to them, need to be treated carefully. Best case is to delete their content, but if you need part of the content then parse that information out into structured and coded values, discarding the original free-text.

Friday, June 27, 2014

De-Identification: process reduce risk of identification of entries in a data-set

It has been a very active De-Identification month. There have been many blog articles lately, some complaining about other blogs. Other blogs saying how over blown re-identification is. Many of these were inspired by the USA White-house "President Council of Advisors on Science and Technology (PCAST)" that produced an interesting paper on "Big Data: A Technological Perspective".

I would like to say first: YOU ALL ARE RIGHT, yet also sligntly twisted in your perspective.

Whenever the topic of De-Identification comes up, I am quick to remind the audience that "The only truly de-identified data are the null-set!". It is important that everyone understand that as long as there are any data, there is some risk. This is not unlike encryption, at the extremes, in that brute force can crack all encryption, but the 'key' (pun intended) is to make it so hard to brute force that the cost is simply too expensive (lengthy). Unlike encryption, the maturity and effectiveness of de-identification is much less. 

There are plenty of cases where someone thought they had done a good enough job of de-identifying, only to be proven wrong. These cases are really embarrasing to those of us that are trying to use de-identification. But these cases almost always fail due to poor execution of the 'de-identification process'.

De-Identification is a process to reduce risk.

I have been working on the revision of the ISO 25237 healthcare specification on De-Identification. We are making it even more clear that this is just a risk reduction, not an elimination of risk. Often times the result of a de-identification process is a data-set that still has some risk. Thus the de-identification process must consider the Security and Privacy controls that will manage the resulting data-set. It is rare to lower the risk so much that the data-set needs no ongoing security controls.

The following is a visualization of this  process. This shows that the top-most concept is de-identification, as a process.  This process utilizes sub-processes: Pseudonymization and/or Anonymization. These sub-processes use various tools that are specific to the type of data element they operate on, and the method of risk reduction.

The presumption is that zero data are allowed to pass through the system. Each element must be justified by the intended use of the resulting data-set. This intended use of the data-set greatly affects the de-identification process.


De-Identification might leverage Pseudonymization where longitudinal consistency is needed. This might be to keep a bunch of records together that should be associated with each other, where without this longitudinal consistency they might get disassociated. This is useful to keep all of the records for a patient together, under a pseudonym. This also can be used to assure that each time data are extracted into a de-identified set that new entries are also associated with the same pseudonym. In Pseudonymization the algorithm used might be intentionally reversible, or intentionally not-reversible. A reversible scheme might be a secret lookup-table that where authorized can be used to discover the original identity. In non-reversable is a temporary table might be used during the process, but is destroyed when the process completes.


Anonymization is the process and set of tools used where no longitudinal consistency is needed. The Anonymization process is also used where Pseudonymization has been used to address the remaining data attributes. Anonymization utilizes tools like Redaction, Removal, Blanking,  Substitution, Randomization, Shifting, Skewing, Truncation, Grouping, etc.

Each element allowed to pass must be justified. Each element must present the minimal risk, given the intended use of the resulting data-set. Thus where the intended use of the resulting data-set does not require fine-grain codes, a grouping of codes might be used.

Direct and Indirect Identifiers

De-Identification process identifies three kinds of data: Direct identifiers, that by themselves identify the patient; Indirect identifiers, that provide correlation when used with other indirect or external knowledge; and non-identifying data, the rest of the data. Some also refer to indirect identifiers as 'pseudo identifiers'.

Usually a de-identification process is applied to a data-set, made up of entries that have many attributes. For example a spreadsheet, made up of rows of data organized by column.

The de-identification process, including pseudonymization and anonymization, are applied to all the data. Pseudonymization generally are used against direct identifiers, but might be used against indirect identifiers, as appropriate to reduce risk while maintaining the longitudinal needs of the intended use of the resulting data-set. Anonymization tools are used against all forms of data, as appropriate to reduce risk.

IHE De-Identification Handbook

Books on De-Identification

I just finished reading, and highly recommend the book "Anonymizing Health Data: Case Studies and Methods to Get You Started", by Khaled El Emam. This is a good read for someone needing to understand the de-identification domain. It is not a reference, or deep instructional document. I presume his other books cover that. There are some really compelling examples in this book, real-world examples. There is also a very nicely done explanation, high-level explanation, of the quantitative mechanism to assess residual risk on a resulting data-set. Such as K-anonymity.

References to Blog articles

Friday, June 6, 2014

FW: IHE ITI Published: PDQm, SeR, and De-Id Handbook

 I will further explain these new supplements and handbook in later posts

IHE IT Infrastructure Technical Framework Supplements Published for Public Comment

The IHE IT Infrastructure Technical Committee has published the following supplements to the IHE IT Infrastructure Technical Framework for public comment in the period from June 6 through July 5, 2014:
  • Patient Demographics Query for Mobile (PDQm) 
  • Secure Retrieve (SeR)
The documents are available for download at Comments submitted by July 5, 2014 will be considered by the IHE IT Infrastructure Technical Committee in developing the trial implementation versions of the supplements. Comments can be submitted at

The committee has also published the following Handbook:
  • De-Identification
    • De-Identification Mapping (Excel file)
The documents are available for download at Comments on all documents are invited at any time and can be submitted at