Thursday, July 3, 2014

Book: The Lean Startup

I am sure that GE is not alone in adopting the concepts of ‘The Lean Startup’ by Eric Ries. As a GE employee I am surrounded by this, especially as a Design Engineer, especially one that works on leading-edge products and standards development. There is no internal meeting where the key-words of The Lean Startup are not spoken. So learning what these terms really mean is important.

The Lean Startup is a fantastic book. I was afraid that it would be unreasonable, too focused on the loose-and-carefree world of startups. It however is nothing of the kind. It very carefully distinguishes the methodology of The Lean Startup. It carefully shows how to use and when not to use the methodology.

What is very nice about the book is the narrative methodology. For each topic covered it includes an illustrative example or two. Often including a failure case in addition to a successful case. Adding to the credibility the author uses his failures most of the time.

I work in a Medical Device vendor, that is highly structured around ‘the waterfall’ design methodology. The clear apparent conflict between this and The Lean Startup methodology are covered. Turns out this has been thought through, and there are ways to get the best of both worlds.

I highly suggest reading the whole book. I learned stuff that is just not learnable through the Executive overview or common training. In reading the book I learned that many people are using the terminology wrong, while others are using it right. The most abused word is “Pivot” and “MVP”. I now can distinguish people just using the buzzwords from those that are using the proper concepts. I am now much more comfortable with the business and design organization changes; as they are clearly using the proper concepts.

Lucky for me GE has a book service where I can get these business purpose books free. The cool part is that most of these books come in audio form, or e-book form. I got this one in audio form, which is not as nice as Their audio form is a bunch of MP3 files. This would have been nice with my old MP3 player, but now days I am using an iPhone-5. I couldn’t figure out how to get simple MP3 files to the iPhone, I refuse to load iTunes. So I ended up with a web based hack. Ultimately this hack didn’t hurt too bad. The MP3 files were all about an hour and a half long, just perfect for my workout. Added benefit is the book is read by the author.

Wednesday, July 2, 2014

PCAST - Big Data: A Technological Perspective

The USA White-house "President Council of Advisors on Science and Technology (PCAST)" has produced an interesting paper on "Big Data: A Technological Perspective". It is a worthy read, and I think much of it is rather level-headed and good advice to the President. I highly recommend reading it. It is a nice layout for a college level course in big-data privacy.

USA Centric viewpoint is bad:

It however is extremely USA centric. This is to be expected, it is being written by PCAST. This is however that USA centric thinking that lead the NSA and FBI to do things that are not internationally friendly, especially privacy. Thus now the USA based businesses are not looked at by the international community as an appropriate place to store or to allow to manage their data. This short-sighted viewpoint is killing the USA market for big-data. I think this is the most important policy change that must happen. The USA government must behave better, especially regarding international perspectives. The USA can be both Friendly to Privacy, and Friendly to Business. These two are not in conflict.

Data Collection:

The paper argues
Recommendation 1. Policy Attention should focus more on the actual uses of big data and less on its collection and analysis.
That there should not be regulations on data collection, that it is improper use of data that should be regulated. I agree with the outcome of this statement, but not the means. The outcome is that regulation is needed to punish poor use of data. This follows the principle I have explained on regulations, regulations need to be written to the outcome not the technology. Note their Recommendation 2 is specifically on the topic of regulating outcomes, not the technology.

What I don't like about Recommendation 1 is that is it presumes that all data that is collected will be perfectly protected. I don't see how anyone can presume that any data is going to be perfectly protected. There are breaches of data all the time. The Recommendation totally ignores all the unintended-use caused by a breach. I ague this happens more than the other uses.

This Recommendation 1 is implicitly guided by the concept that although we might not have a use for the data we are collecting today, there might be a use for it in the future. If we don't collect it now, it won't be there in the future. Storage space for unnecessary data is cheap. I understand this business intent. I just think that the risks to exposure are higher than the future benefit of undefined use.

I would simply augment Recommendation 1 to guide for gathering the minimum data that is necessary for the intended use.

De-Identification is good when used right:

I have already commented on the topic of De-Identification and anonymization, even for free-text. This is a case where the report outright says that De-Identification should not be used as it is too easily defeated, especially where data fusion is applied.
Anonymization is increasingly easily defeated by the very techniques that are being developed for many legitimate applications of big data. In general, as the size and diversity of available data grows, the likelihood of being able to re‐identify individuals (that is, re‐associate their records with their names) grows substantially. While anonymization may remain somewhat useful as an added safeguard in some situations, approaches that deem it, by itself, a sufficient safeguard need updating.
I don't think that this concept is in conflict with what I say. In the case of the PCAST report, they are taking the perspective that data should be allowed to be gathered in full fidelity where it is highly protected. If one is going to highly protect it, then it doesn't benefit from the reduced risk that de-identification brings. To this point I agree. It is far better to protect the original data, than to expose the data that are poorly de-identified.

I do however think that there are 'uses' of data where de-identification is appropriate. Even if simply as a data reduction technique. That is to use the de-identification process to eliminate data elements that are not necessary for the intended use of the resulting data set. This elimination of data elements ends up with a smaller data-set, more concentrated on the needed elements. The lower risk is an added benefit.

In all cases where de-identification is used; one must consider the residual risk of the resulting data-set. Unless that data-set is empty, then there is some residual risk.


Overall I like this paper. It has some depth that is nice to have. I encourage a full read of the paper, as the executive overview doesn't cover everything . The above observations are not surprising given that there are three representatives from Google on the team that wrote this paper. I am surprised at the lack of many of the other big-data perspectives.

Sunday, June 29, 2014

De-Identifying free-text

Each time I blog on De-Identification, I get questions about free-text fields.  The focus on free-text fields is covered in ISO 25237 healthcare specification on De-Identification, in DICOM, and in the IHE Handbook on De-Identification. They all cover the more broad concept of “Non-Structured Data Variables”. This includes free-text fields, but also recognizes other data that are not well constrained upon input and storage. So this includes Voice Recordings, Images, and even calls out the medical imaging standards of DICOM. DICOM also addresses this problem space, pointing out that historically it is common for the Radiology image to have burned-into the image identifying and routing information.

Why is Free-Text a concern?

So the specific problem with non-struct
ured data elements/attributes/variables are that they could contain Direct Identifiers, Indirect Identifiers, or simply non-identifying data. It is the very fact that they are non-structured that results in this non-deterministic situation. Often times these fields are simple text-editing fields where the clinician can write anything they want. They might be prompted to enter relevant information, like description of the disposition of the patient. However without restrictions, the clinician could have put the patient name into the field.

Free-text is not safe

This is where the Intended Use-case of the resulting data-set comes into play. If there is no need that comes from the Intended Use-case for these non-structured data fields, then simply dropping them. My second rule of De-Identification is that by default you get ZERO data elements. The intended use-case needs to justify everything that is provided. Most of the time the value of a free-text field is of no value. Often times they are included simply to future-proof a workflow. That is there is a free-text field to handle ‘anything else’. So deleting the free-text field is really the most likely right way to handle it.

Useful Free-text

However there are times where for example a study is being done, and those clinicians participating are instructed to put critical study information into a normally unused free-text field. Thus the field content is critical to the Intended Use-case of the resulting data-set. In this case, we do know that the otherwise free-text field, really might contain some structured data. So the easy thing to do is pre-process the free-text fields to extract out the information that is needed by the Intended Use-case of the resulting data-set into a structured and coded entry. Throw away the rest of the free-text field. Treat the new coded entry according to the normal processing rules, which does mean it must be determined if it is a Direct Identifier, Indirect Identifier, or non-identifying data; and one must also look at the values resulting to determine if they might themselves identify an individual.

What is important here is to recognize that you have converted the free-text into structured and coded values. So, ultimately you are not passing free-text, the free-text field is destroyed.

Unstructured Image

It is much less likely that you can post-process images or voice into structured data, but I am not going to say it can’t be done. DICOM has a very comprehensive treatment of De-Identifying DICOM objects.  Specifically Chapter E of Part 15

5. The de-identifier should ensure that no identifying information that is burned in to the image pixel data either because the modality does not generate such burned in identification in the first place, or by removing it through the use of the Clean Pixel Data Option; see Section E.3. If non-pixel data graphics or overlays contain identification, the de-identifier is required to remove them, or clean them if the Clean Graphics option is supported. See Section E.3.3 The means by which burned in or graphic identifying information is located and removed is outside the scope of this standard.


I am a fan of De-Identification, when used properly and for the right reason. However De-Identification is not the only tool to be used, sometimes data simply should be properly managed, including Access Controls and Audit Controls. This same conclusion is true of data that are De-Identified, that is unless you end up with the null-set then you will have some risk that needs to be properly managed, including Access Controls and Audit Controls.

Free-text fields, all fields that you don't know have a specific structure to them, need to be treated carefully. Best case is to delete their content, but if you need part of the content then parse that information out into structured and coded values, discarding the original free-text.

Friday, June 27, 2014

De-Identification: process reduce risk of identification of entries in a data-set

It has been a very active De-Identification month. There have been many blog articles lately, some complaining about other blogs. Other blogs saying how over blown re-identification is. Many of these were inspired by the USA White-house "President Council of Advisors on Science and Technology (PCAST)" that produced an interesting paper on "Big Data: A Technological Perspective".

I would like to say first: YOU ALL ARE RIGHT, yet also sligntly twisted in your perspective.

Whenever the topic of De-Identification comes up, I am quick to remind the audience that "The only truly de-identified data are the null-set!". It is important that everyone understand that as long as there are any data, there is some risk. This is not unlike encryption, at the extremes, in that brute force can crack all encryption, but the 'key' (pun intended) is to make it so hard to brute force that the cost is simply too expensive (lengthy). Unlike encryption, the maturity and effectiveness of de-identification is much less. 

There are plenty of cases where someone thought they had done a good enough job of de-identifying, only to be proven wrong. These cases are really embarrasing to those of us that are trying to use de-identification. But these cases almost always fail due to poor execution of the 'de-identification process'.

De-Identification is a process to reduce risk.

I have been working on the revision of the ISO 25237 healthcare specification on De-Identification. We are making it even more clear that this is just a risk reduction, not an elimination of risk. Often times the result of a de-identification process is a data-set that still has some risk. Thus the de-identification process must consider the Security and Privacy controls that will manage the resulting data-set. It is rare to lower the risk so much that the data-set needs no ongoing security controls.

The following is a visualization of this  process. This shows that the top-most concept is de-identification, as a process.  This process utilizes sub-processes: Pseudonymization and/or Anonymization. These sub-processes use various tools that are specific to the type of data element they operate on, and the method of risk reduction.

The presumption is that zero data are allowed to pass through the system. Each element must be justified by the intended use of the resulting data-set. This intended use of the data-set greatly affects the de-identification process.


De-Identification might leverage Pseudonymization where longitudinal consistency is needed. This might be to keep a bunch of records together that should be associated with each other, where without this longitudinal consistency they might get disassociated. This is useful to keep all of the records for a patient together, under a pseudonym. This also can be used to assure that each time data are extracted into a de-identified set that new entries are also associated with the same pseudonym. In Pseudonymization the algorithm used might be intentionally reversible, or intentionally not-reversible. A reversible scheme might be a secret lookup-table that where authorized can be used to discover the original identity. In non-reversable is a temporary table might be used during the process, but is destroyed when the process completes.


Anonymization is the process and set of tools used where no longitudinal consistency is needed. The Anonymization process is also used where Pseudonymization has been used to address the remaining data attributes. Anonymization utilizes tools like Redaction, Removal, Blanking,  Substitution, Randomization, Shifting, Skewing, Truncation, Grouping, etc.

Each element allowed to pass must be justified. Each element must present the minimal risk, given the intended use of the resulting data-set. Thus where the intended use of the resulting data-set does not require fine-grain codes, a grouping of codes might be used.

Direct and Indirect Identifiers

De-Identification process identifies three kinds of data: Direct identifiers, that by themselves identify the patient; Indirect identifiers, that provide correlation when used with other indirect or external knowledge; and non-identifying data, the rest of the data. Some also refer to indirect identifiers as 'pseudo identifiers'.

Usually a de-identification process is applied to a data-set, made up of entries that have many attributes. For example a spreadsheet, made up of rows of data organized by column.

The de-identification process, including pseudonymization and anonymization, are applied to all the data. Pseudonymization generally are used against direct identifiers, but might be used against indirect identifiers, as appropriate to reduce risk while maintaining the longitudinal needs of the intended use of the resulting data-set. Anonymization tools are used against all forms of data, as appropriate to reduce risk.

IHE De-Identification Handbook

Books on De-Identification

I just finished reading, and highly recommend the book "Anonymizing Health Data: Case Studies and Methods to Get You Started", by Khaled El Emam. This is a good read for someone needing to understand the de-identification domain. It is not a reference, or deep instructional document. I presume his other books cover that. There are some really compelling examples in this book, real-world examples. There is also a very nicely done explanation, high-level explanation, of the quantitative mechanism to assess residual risk on a resulting data-set. Such as K-anonymity.

References to Blog articles

Friday, June 6, 2014

FW: IHE ITI Published: PDQm, SeR, and De-Id Handbook

 I will further explain these new supplements and handbook in later posts

IHE IT Infrastructure Technical Framework Supplements Published for Public Comment

The IHE IT Infrastructure Technical Committee has published the following supplements to the IHE IT Infrastructure Technical Framework for public comment in the period from June 6 through July 5, 2014:
  • Patient Demographics Query for Mobile (PDQm) 
  • Secure Retrieve (SeR)
The documents are available for download at Comments submitted by July 5, 2014 will be considered by the IHE IT Infrastructure Technical Committee in developing the trial implementation versions of the supplements. Comments can be submitted at

The committee has also published the following Handbook:
  • De-Identification
    • De-Identification Mapping (Excel file)
The documents are available for download at Comments on all documents are invited at any time and can be submitted at

Friday, May 16, 2014


Like PCAST before it, we now have a JASON report to cause all kinds of chatter. If you don't know either of these, then you are not missing anything useful. The most frustrating part of both PCAST and JASON report are that they don't use the common terminology but insist on introducing new terminology. They both introduced the term "Atom" as a term to indicate the granularity for which health data would be communicated and managed. The implication is that if we could communicate and manage the smallest of data then we will have the most powerful of Health Information Exchanges. I don't disagree with this, but must point out that "Atom" is not a helpful word.

An Atom is not clear

First we need to make clear that the definition of 'Atom' need to be carefully defined based on use-case. As with the chemical use of "Atom" there are very large atoms and very small atoms; yet all atoms are made up of sub-components that must not be further broken-apart (or bad things happen).

An Atom must be meaningful

The example clinical case that is often used is that clearly one would not just communicate the systolic-blood-pressure, and often it isn’t even enough to simply pair up systolic and diastolic taken at the same time, as one may also need to know if this is a resting BP, exercising BP, drug induced BP, or drug-influenced BP and what was the body position of the patient. Further it can be important to know if this data was gathered by a calibrated machine, drug-store machine, trained professional, etc. These are not metadata, these are critical components of the clinical-fact; aka the smallest Atom that can be communicated.

FHIR Atom is a Resource

FHIR is defining Atom in a reasonable way that is informed by history on 'clinical facts'. These Atoms in FHIR are 'Resources'. Here is the list of resources

Metadata. – there is also pointers (a form of metadata) to the patient identity, provider identity, setting, etc…
For each resource there is provenance, security-tags, conformance-tags, etc…

XDS Atom is a Document

Second, we need to recognize that sometimes the Atom that gets communicated needs to be larger and more self-contained. This is the case with HIE today where "Document" is what defines an "Atom". In that case (e.g. XDS, XCA, XDM, XDR, MHD) there is appropriate metadata per Atom.  There are some really critical concepts of a Document that Resources and Messages don't have.

FHIR Atoms can make Documents

FHIR can also take this analogy further and make Molecules, for which one of these examples is a Document. This is especially confusing as to do this FHIR uses the ATOM 'standard' to do this composition of multiple resources.

FHIR can also be the 'last mile' API to XDS, XCA, and XDR.

FHIR can also be used to access a decomposed Document. That is a document can be submitted or communicated; it can be decomposed into the parts which can be accessed using FHIR. One could even build a "Service" that you send it a CDA document, it decomposes it and offers it up as FHIR resources, and flushes when the use of those resources are no longer needed.

Concluding Atom

We can NOT let the shiny new thing – aka FHIR – distract from very good progress on the Document Sharing model. We have already seen negative progress due to Direct distraction. I am absolutely committed to FHIR as the future model, especially for edge-device API. However FHIR is still under-development and unproven. Now is the time to work to get FHIR developed and proven.

I simply recommend we start with fully self-contained Atoms (Document), and work toward more discrete Atoms (Resources).  The last-mile API should prioritize use of FHIR at the Resource level, backbone used between organizations should prioritize use of Documents.

Friday, April 4, 2014

Murky Research Award

I am going to take a page from Keith, and his Ad Hoc Motorcycle Guy Harley Award. This is an authorized pillage of his idea. I thus create the Murky Research Award, tip of a hat to Car Talk - Click and Clack - Murky Research. I am constantly reminded of Murky Research when I explain to people how to pronounce my name.(Keith also recommended this title). Sorry my graphic isn't as nice as the Ad Hoc Motorcycle Guy Harley Award.

The First Murky Research Award goes to Josh Mandel, who showed tremendous Research abilities, transparency, and ultimate Professionalism in is pursuit of knowledge on security vulnerabilities he discovered in some EHR products regarding malformed CDA (an XML form) documents that are not robustly sanitized and validated before being displayed using a simple stylesheet and an off-the-shelf browser (or browser framework). The details of this are far better explained by  Josh.

Dear Strucdoc and Security WGs,

In this era of personal health records and Direct messaging, it's increasingly unrealistic to assume that an EHR can trust every (C-)CDA document that arrives in a clinician's inbox. Here's an article I've published on the SMART Platforms blog describing a set of security considerations for the display of potentially malicious C-CDA documents:

This post describes a set of security considerations that are probably well-known to many of you -- but that have been overlooked by multiple real-world EHR products, leading to serious vulnerabilities. 

Bringing "best practices" to real-world implementations is critical, and as a community we should think about how HL7 might help. (In this specific case, for example, by hardening stylesheets and including warnings that these stylesheets are unsafe for use with untrusted documents. In general, by advocating for well-defined vulnerability reporting protocols and bounty programs.)



Not only did Josh do the research into the deep details, and write them up in exacting details, but what you all don't yet know is that he has been working one-on-one with the vendor community to help them understand the problem, multiple times delaying his release to give a vendor another week. Did this all with the utmost discresion and professionalism. I know he is going to publish more deeper details.

It is not easy for someone who knows this level of problem to be so professional and to utalize the rules of responsible disclosure. My hat goes off to Josh Mandel. Thank You.