Thursday, August 23, 2012

Effective Standards Evaluation

Guest blog by Karen Witting, co-chair of the IHE ITI Planning Committee 

The NwHIN Power Team has been tasked with creating metrics through which standards are assessed for Maturity and Adoptability.  Their Final Recommendation was presented to the HIT Standards Committee on 8/15/2012 and was met with approval from members of the committee.

The criteria they recommend is very detailed, requesting high/medium/low ratings on many different aspects of a standard.  The details of these criteria were also provided to the HIT Standards Committee.

The Power team has done a very complete job, listing, for each metric, attributes that would qualify the standard to belong in each of the three ratings.  Some attributes are very specific, laying out concrete metrics for things like coordinated implementations and age of oldest known conforming implementation.     Others seem concrete on the surface but have hidden challenges, like “number of organizations supporting authorship and/or review”.  The difference between number of authors and number of reviewers is pretty significant and it seems confusing to mix that together in one metric.  But most are based on the perception of the reader, things like “no references”, “few references” and “numerous references” or “few user s”, limited user” and “active user”.   

Effective Evaluation vs just Evaluation
While I might disagree with the clarity and usefulness of some of the metrics, my larger concerns are regarding the validity and level of detail called for by the metrics.  For example, under Maturity, the Stability metric lists number of releases and problem history.  It suggests that standards with more releases are less stable than those with fewer releases.  But in reality the number of releases is only tangentially related to the stability.  It is the significance of the change between releases that is much more important.  And only those very familiar with the standard can assess the significance of the change.  The same can be said regarding problem history.  As such, these “metrics” are extremely subjective, especially when in the hands of someone who is not intimately involved in the standard’s development.  Simple lack of knowledge leads to use of anecdotal evidence which results in as subjective a result as is would be achieved without the detailed criteria. 

I observed this challenge as I listened to the NwHIN Power Team assess InfoButton as a test of their criteria.  Those doing the assessment were given only the InfoButton specification and Implementation Guide and asked to assess the standard on criteria that go way beyond anything a specification or implementation guide is designed to address.  No effort was made to gather evidence regarding implementation and deployment of the standard.  In fact, a paper assessing implementation of InfoButton ( would have provided very useful data to be considered when making the assessment.  During the discussion I heard statements such as “I don’t know anybody who has implemented this” when, in fact, 17 implementing organizations are listed in the paper spanning Health IT vendor, Healthcare organization and knowledge publisher. 

My understanding of the purpose of the NwHIN Power Teams criteria is to improve the subjective process of assessing standards by creating metrics which will enable the assessment to be done in a less subjective manner.  While the criteria developed by the NwHIN Power Team suggests that much more detailed knowledge goes into making the assessment I’m not convinced that the criteria alone will improve the subjective nature of assessing standards.  Having a detailed set of criteria is fine and probably helpful for those not familiar with standards.  But equally important, if not more important, is the need for the gathering of data on which the assessment can be based.  Through my involvement in IHE I have seen this done many times, as for each new profile we do a similar type of standards assessment.  We always assess the standards we select for maturity and adoptability.  But we do this by gathering as much data as can be reasonably gathered through web searches, queries to supporting standards bodies or requests sent through peer networks.

In the case of the assessment of Infobutton my concerns may be unfounded as this was more a test case than a true assessment.  But, this same committee did do some real assessments on NwHIN standards and I saw the same approach, very little time spent gathering data about the standard and a lot of time talking amongst people who often do not have any first-hand experience with the standard.  This is not a good process for assessing standards.

My wish is that any group which uses the NwHIN power team’s criteria ensure that, prior to doing any assessment, a thorough and complete data gathering task is completed and all reasonable data regarding the standard, its implementation and deployment are available.  All assessments are subjective, there is no avoiding that, but the more data used in making the assessment the less subjective it will be.  Having detailed criteria is only useful if the group also invests in gathering all relevant data.

IBM disclaimer:  "This postings is Karen's personal opinion and don't necessarily represent IBM's positions, strategies or opinions."