Tuesday, August 10, 2010

Data Classification - a key vector enabling rich Security and Privacy controls

In this article I will introduce the concept of Data Classification, what it means to Healthcare IT, and how it can enable rich Security and Privacy controls. There are many aspects to rich Security and Privacy controls, so Data Classification is not THE answer, just like Consent is not THE answer. These many aspects of Security and Privacy controls work together.

The concept of Data Classification is not new, it has been around for likely thousands of years. The idea is that some data is rather PUBLIC and can be exposed to anyone, while other data is very SECRET the kind of data one would only share with a very small number of people. I am not going to say that this concept has been around since the XXX ages, because I don't know and don't think it is important to look it up. I am sure you can look down in the comments and someone will have done the research and they should get credit. I suspect this could be seen back thousands of years.

There is much talk in the US healthcare initiatives about "Segmentation". Segmentation is typically what ones does to physically separate different Data Classifications. I think it is being spoken about in the US healthcare initiatives more as a direct synonym to what the classic Security community sees as Data Classifications. Even if Segmentation is the classic idea of physically separating data of different Data Classifications then we need to understand Data Classifications. I will assume that the word "Segmentation" is being used so as to not to be confused with Military Classification Levels. They have the same goal, but there are individuals that can't grok.

One of the models that has received the most visibility (ahem) is the Military "Classification Levels". We have all seen movies where there is a piece of paper or folder with the word "Top Secret" stamped across it. This is a 'Classification' of the content of the paper/folder into one of the various Classifications. Typically there are 5 classifications that we are use to seeing: Unclassified, Restricted, Confidential, Secret, and Top Secret.  We all know from the movies that Unclassified documents are available to everyone, yet Top Secret documents can be seen only by a very few.

This label on the document is 'metadata' that is carried by the document. This metadata is an outcome of the classification step, and is an independent step from provisioning of a user into Roles and roles into Permissions. Thus there is independent processes for classifying data, and for assigning permissions that a specific user has.This use of metadata also decouples the data from the transportation mechanism, as no matter what way that document is transported the metadata that explains the classification is attached.

Confidentiality Code

Healthcare Standards have our own version of this, the confidentialityCode. Looking at this today, we would likely pick a different name, but this is the one we got. Almost without exception healthcare standards have a confidentialityCode associated with all objects. This is true of HL7 CDA (CCD), HL7 v3 messages, and HL7 v2 messages. This is true of DICOM objects. This is true of IHE XD* metadata about documents. This is a universal place to hold the sensitivity classification of the object. This is a way that a sender can inform a receiver of how to handle the data in their access controls. This is a way that a publisher of documents can indicate the sensitivity classification of a document.

Unfortunately the vocabulary for the confidentialityCode needs some work. I say unfortunately, but in reality this reflects that we have not been able to look at this value very consistently. So today the vocabulary for confidentiality code is part Sensitivity Classification, and part Confidentiality Classification. These are not necessarily incompatible ideas, but they are trying to encode in the same value something different. Where as Sensitivity Classification would say this is "Secret", the confidentiality code would explain why it is secret (i.e. HIV).

My experience is that most implementations understand this, and choose a small subset of the existing confidentialityCodes and use them as Sensitivity Classifications, similar to the Military Classification.Note that like with the Military Classifications, these are clearly increasing in sensitivity and don't expose 'why' the classification is given. There are other codes, but they are inconsistent with Segmentation. I recommend against using the other HL7 confidentialityCodes (e.g. HIV, PSY, SDV).

Useful Value-Set from the HL7 ConfidentialityCode vocabulary (2.16.840.1.113883.5.25)
L -- Low -- Low sensitivity,
N -- Normal -- Normal clinical data to be handled by normal 'good health care practice' rules
R -- Restricted -- Restricted clinical data, restricted to those having a care relationship
T -- Taboo -- Information not to be discussed with the patient

The process for establishing an objects confidentialityCode must be inclusive of the healthcare organizations evaluation of how sensitive the information is AND how the patient would evaluation how sensitive the information is (See Magic section below).


Add we even have seen where documents with high classification levels are "Redacted", and the Redacted version of the document is considered of a lower classification level. For example when someone makes a request for release of a highly classified document, large segments of the document may be blocked out. These large blocks of text might be specific names of individuals that were involved but who were not considered important to the message of the text.

When the identifiers are being Redacted, we in healthcare should see this as De-Identification. That is the idea that one can remove the identifiers from health information and thus the information becomes less 'sensitive'. I very much believe in the ability to use De-Identification to provide new-life to the data, but also know many ways in which it can go wrong. See De-Identification is highly contextual.

Clinical trials is a domain where there is funding and motivation to establish meaningful redaction and transform profiles that can provide automatic processing. It has been common for years to use manual Redaction (Paper and Film) on the data shared with a clinical trial. This is not just for Privacy reasons, but also for the clinical trial integrity (blinding of factors other than the one being studied). This experience with Redaction has resulted in updates to standards such as DICOM supplement 143. An important distinction that clinical trials has is that it is very clear what the purpose of the specific clinical trial, what data is needed, and what data is not needed; this is not true in general healthcare.

Removal of the sensitive information is a very useful tool. Done wrong, it can be a very bad tool. There are plenty of examples of where someone has 'thought' they had done proper redaction, but it turned out to be simply a display layer where the actual sensitive information was still present. 


Redaction can be used in a different way that is highly useful. One might remove the most sensitive information from a document leaving just the critical health values. This form of Redaction is often seen as a "Transform" of the data. When the data is in XML form, this form of Redaction is quite easy. One must be very careful to only allow information into the transform that is well-known to be safe. This means that any free-text fields need to be blocked.

One of the most talked about health documents today is a C32. A C32 is a Medical Summary, the content may include administrative (e.g., registration, demographics, insurance, etc.) and clinical (problem list, medication list, allergies, test results, etc) information. Generally this is a minimum of clinical information that is needed by almost every care giver. Although this document is not about a specific episode of care, it could contain information that is highly sensitive. It is possible to provide two documents, one with the highly sensitive information and a transform that has removed this highly sensitive information. The system doing the publication of the document knows best what information is specifically sensitive, so it is uniquely qualified to publish these two different documents. These documents could then be Segmented or Classified differently.

In the context of XDS, the original document could be registered with a "R". The Transform could be registered using the XDS relationship of Transform with a "N" classification. In this way someone can see that the two documents are related and access the one that the user is authorized to view. This publication of two independent but linked documents may be preferred by those that want to Segmentation as there is a clear line between the two documents. This eliminates the need to perform a transform at retrieve time, and reduces the risk of accidental inappropriate disclosure. For example the audit log clearly indicates which of the two documents were used vs relying on a real-time transform.

Enabling Policy

Just like how the Military Classifications allow them to separate the act of classifying data from authorizing individuals, so does the confidentialityCode. So, lets look at a simple Truth-Table. Note that I added "Research Information" . This just to show that the limited set of confidentiatlityCodes can be extended. So imagine information marked with Research Information is specifically published for research use. This Truth-Table also shows the results of a set of Policies.

.                                   ConfidentialityCode

Functional Role
L N R T Research Information
Administrative Staff X X
Dietary Staff X
General Care Provider X X
Direct Care Provider X X X
Emergency Care Provider X X
Researcher X
Patient or Legal Representative X X X

It is very possible for a Patient Consent to directly influence these. For example the patient may be given the ability to provide their Preferences for this Truth-Table, and if accepted that would become the Policy for their data. Thus the Consent Policy becomes the Truth-Table for that patient's data.

There could be a different Truth-Table for "Emergency Mode" of operation, such as under natural disaster like Katrina.


The unsaid magic in all of this is that there is no well-known mechanism today to automate the Data Classification. We have the place to store and communicate the Data Classification in the confidentialityCodes, but we don't have rules to use to determine if the object should be described as R rather than N. We have Consent Directives that can explain how to handle these differently categorized data, capturing the truth-table. But we do not have a good way to determine for any given object, what should the confidentialityCode be. We leave this as a task for anyone publishing data. The one creating a document or object is the one most aware of the information that is in the document or object. Today this is the individual or system that is expected to make the hard decision. In XD*, the confidentialityCode is mandatory. Most simply say "N".

We do have some regulatory assistance. In the USA 42 CFR Part 2 "Confidentiality of Alcohol and Drug Abuse Patient Records" makes it clear the types of things that we should certainly mark as "R". But even with this very specific regulation text, it is hard to be exact. Often times when one thinks they have a well-known set of rules, a doctor will come along and indicate that the combination of three factors would indicate that a patient is HIV positive. The combination become really hard to track down.

This Magic is likely to continue for sometime. Is this a bad thing? It is bad in that it is a part of the process that is not deterministic, so yes we want it fixed. Many organizations have tried to come up with a way to do this. Ultimately we would like to encapsulate the rules that patient wants used to determine the Data Classification in a Patient Preference or Consent. This is ongoing standards work.

Usually the magic is the publishing doctor selecting a radio-button: R vs N. They are usually right.


For now, I just want to get confidentialityCode used. Having information differentiated by the Data Classification enables Role-Based-Access-Control. Without it, all data is seen as viewable by everyone.

My Blog


    1. Glen,

      Always a good to remind everyone that once someone has a copy of the health information, they have a copy. This is not only a fact of reality, but in Medical care this is a requirement of Medical Records.

      As to reclassification of the data, this can be done in different ways. The most inclusive is to publish a NEW document with modified metadata with the confidentialityCode and have this REPLACE the prior version. This would be relying on Policy that would restrict access to deprecated documents. Another option is through some metadata update, such as the new IHE XDS supplement. The metadata update also supports a delete, as in really delete. There are also administrative channels that under special circumstances can handle this even deeper. This of course only helps future accesses to the data object.

    2. I'd like to see your military data classification analogy extended beyond managing confidentiality to integrity as well, just as Biba took the same object classification paradigm in the Bell-LaPadula model to try to maintain the trustworthiness of data. While I wouldn't suggest trying to leverage the single confidentialityCode for more than one purpose, at some future state of IT-enabled health care when an electronic health record is really an aggregation of data from multiple sources, it would be great to have some assertion of validity or accuracy or integrity built in to the data, particularly if that data might be used in clinical decision support.

    3. Steve,

      I think what you are proposing is already a foundational aspect of CDA documents themselves, and of most transports that healthcare uses. For example the IHE XD* family of profiles has metadata attributes that would be used for this purpose regardless of the document format, including a SHA1 hash of the document, Size of the document, Author of the document, etc...

      In addition, for those that need non-repudiation Healthcare standards organizations have endorsed the use of Digital Signatures. Specifically within IHE is the Document Digital Signature Profile, which is a profile of XAdes/XML-Digital Signature.

    4. There is, IMHO, a large gap between technical integrity of the sort the standards you cite provide and an actual assertion of integrity or validity. For example, in the NHIN environment under the DURSA participants are obligated to ensure only that the data they send is a faithful copy of the data they have stored, but this says nothing about the underlying integrity of the data they have, the errors it might contain, or even what set of users has access to the data stored by the entity. With visions for personal health records that allow individuals/patients to add to or edit their own records, a downstream consumer might want to know which data in an aggregate view was provided by the patient and which came from a clinician, clearinghouse, health plan, or other source, as the user's reliance on the data might (and should) be influenced by its source. There's two separate (and very hard) problems here, neither of which can really be addressed by health IT standards alone: 1) underlying data quality in health records and 2) the potential for inadvertent or malicious introduction of erroneous data into health record systems that then propagate those errors through data sharing.

    5. Steve,

      There are may levels of integrity controls that address different areas. The integrity control that the NHIN-Exchange enforces is transport integrity control, and is intended only to assure that what went in one end of the communications pipe came out the other end. This is an important integrity control, but is not the only important integrity control.

      The XD* metadata controls are there to indicate that the document that you got is the document that was registered. That is that when the Document Source published that they had a document, the size and hash were a specific value. By the time you pull a copy, the Document Consumer can use this hash to prove that the document they got is the one that was registered. This proves long-term storage integrity (with some specific use-case exceptions).

      The next layer of XD* metadata values inform about authorship, practice setting, legal authenticator, language, author facility, and various date/times. I think this is the set of values that you are looking for. This would tell you what kind of a organization or person published the information. If it was a patient at a PHR, then the patient is the author. The problem you might be pointing out is that often times a health information exchange will not require that these values are filled out. If they are not there, or are filled with dummy values that it is very correct in that you have no idea the provenance of the document. I would not blame a doctor for not trusting a document that they don't have good information about where an who about the document. My point is that these metadata values are there to use, and should not be made optional and should not be ignored.

      The next level of integrity and authenticity controls would be inside the document. Such as those inside a CDA/CCD document. These are VERY important, and why I push for structured/coded documents over the use of things like PDF. But some content is legacy, which is a good reason to mandate the XD* metadata.

      And the final level, is a document based digital signature. This signature not only is a technical integrity control, but also indicates the purpose of the signature. Today there are no operational environments that I know of that are requiring a document digital signature. This is mostly because all the various costs involved in digital signatures is more expensive than the value that signature brings. I expect National health exchanges will reinvigorate this discussion.

    6. Great post. On the topic of "integrity/authenticity", I think you might also be interested in ... don't do that!, which talks about reporting the kind of clinical decision making associated with the data.

    7. John, thanks for this great post.

      One minor comment: In the enabling policy table, shouldn't the Dietary Staff and Care Providers also have access to Low sensitivity data?

    8. very nice post. Thank you.
      I think I now understand the issue: there is no unique way to apply sensitivity levels to individual objects within a document based on Confidentialitycode for the document.
      You say this is ongoing standards work, which SDO is working on this?


    9. Madjid,

      There is confidentialCode metadata at many levels of HL7 objects. So you can have different confidentialityCodes on different sections of a CDA document. But, one must still pick a confidentialityCode for the whole document. This is likely the most restrictive classification found inside.

      The other point I am trying to make across the blog is to ask a practical question of the granularity that is reasonable today, vs what is possible in theory. I believe that as we create cross-enterprise health information exchanges, we should start with the document as the smallest object that we are going to try to control. This should not be seen as a move to never support smaller objects, as mentioned above. I just want to get the discussion and implementations going.

    10. Auto categorization of data based on file type, content, metadata, etc., helps knowledge professionals to analyze fragmented data and make storing, processing, and retrieving it easier.