Wednesday, January 31, 2024

Provenance use in AI

I have been engaged in a few initiatives around AI/ML, both inside healthcare and broader. I have been engaged to work on a variety of different needs, that all use a variation of Provenance. The following is not a tutorial, but rather an outline of the various ways that Provenance is useful in AI. Useful is not to say that these are currently used.

  1. Provenance on dataset that is available for various uses, including being used as a learning dataset.
  2. Provenance on the learning dataset showing where each data came from. 
  3. Provenance on a ML model node showing which data influenced this node.
  4. Provenance on an AI output showing which nodes influenced this AI output (decision, observation, derivation, etc)
  5. Provenance on some action taken because of some AI output.
These steps are simplified and generalized. Especially inside of various architectures of AI/ML the concept of a node is not always identifiable. There is a push to use Provenance to enable explainable and trustworthy AI that would be able to explain why an AI output came to be. So, the above presumes that some node(s) in the knowledge model is identifiable.

These Provenance artifacts are also illustrated here purely as provenance details. That is to say that the Provenance does not carry the inputs or outputs; but certainly, points at them. Thus, one can't look to Provenance to embody the "AI Output", that AI output would be encoded in some other artifact.

I also speak of Provenance broadly. Within FHIR, the FHIR Provenance works fine. Outside of FHIR, the W3C PROV model works fine. But it is also possible that one has some other metadata structure that carries the artifacts of Provenance.

Provenance on dataset

This use of Provenance addresses the situation that those looking to teach an AI/ML, need data. The data may already be known, but there may be cases where one looks to a library of data looking for appropriate data. Where appropriate may include quality indicators, fit for use indicators, authorization rights. These are typical "Provenance What" attributes. As well as classic provenance attributes: Who owns the data, Where is the data, When was the data collected, Why was the data collected. 

The key here is to identify all the useful attributes that might be needed, and thus profile how that is expressed as part of Provenance. Some use-case needs:
  • How was this data collected? User questionnaire, Survey, Synthetic, Combination, Subset, etc
  • Is there a regulation covering this data?  Indicate the regulation
  • What region was this data collected within?
    • Is the data region locked?
  • Is the data about human subjects?
    • Is there subject authorization? 
    • Is the data de-identified? To what risk level?
  • Use obligations? Must be used in aggregation, must be de-identified, must get individual authorization, must be encrypted, etc
  • Allowed uses vs Forbidden uses?

Note that a source dataset may be derived from other source datasets. This is something that is key to Provenance. To be able to say this data is derived from that data using how methodology. In this way a Provenance can indicate that a dataset imports three other datasets. This said, the above What attributes would also need to be combined in appropriate ways. For example, I pull in three EHR datasets with de-identification that supports longitudinal consistency, and because the data are de-identified the original HIPAA regulation requirement is eliminated, yet the region covered is expanded. As such, there needs to be the ability to navigate back to the source of this derivation, but that pathway is likely privileged so not possible to navigate by all users.

Provenance on Learning dataset

This is very related to the Provenance on source dataset, but the distinction is that the source dataset doesn't always come with Provenance. But the learning dataset should know where all the data came from. Thus, the use-case need here is more classic Provenance holding simply where the data came from. This is not to say that one can't include the full details, but would be unnecessary if one can navigate from the learning dataset provenance to the source dataset provenance. Being able to navigate from one kind of provenance to the other is a key feature of provenance.

If there is a specific obligation that comes with some source data, this might be traceable using Provenance as well. I would think a simplifying methodology would be to have the obligations managed independently, so that the obligations have their own Provenance back to the source of that obligation. In this way a learning dataset may have a functional obligation that is sourced from more than one source dataset. This is simply one obligation (rule) with many Provenance. 

Similar to the source dataset discussion around derivation from multiple sources. The Learning dataset would have a wholistic Provenance that expresses the derived state, in addition to Provenance on each of the datasets that were imported.

Provenance on ML node

I will use the concept of a ML node, as an identifiable portion of a ML knowledge model. If there is a very specific ML model concept of a node, this works for me, but I didn't intend only that. I also know that some ML models don't have identifiable sub-divisions of the model, in that case then Provenance will be only possible to the Provenance on Learning dataset. Thus, the concept that a ML node is not always possible, but it certainly is important to explainable and trustworthy AI

The details of how the node was derived from the identifiable data is likely to be less describable. But where it can be explained, that explanation can be recorded in the Provenance as a how attribute.

Provenance on AI output

An AI model will take some input against the current model and produce some output. This input, current model, and output; are clearly attributes for Provenance of that output. The key use-case here is to track that some output is attributable to AI, and attributable to a given model. Use-case would also then be able to tack these outputs based on a given model, thus if the model is found to be defective, then those outputs can be re-evaluated or put into question.

Here I first put some emphasis on output being a subject of Provenance, so let me be clear that Provenance itself is not a way to encode the output. As with all artifacts, Provenance presumes that inputs, outputs, agents, algorithms, etc; are all encoded in some relevant and good standard and are able to be referenced by the Provenance.

Provenance on Actions taken because of AI output

This is getting a bit beyond AI/ML, but one uses an AI/ML to do something, and that something is what I am referring to here. I simply indicate that Provenance is applicable here too. So that one can indicate that some action was taken because of some output from an AI.


Provenance is not the core of AI/ML, but the general concept of Provenance is very valuable to the use of AI/ML

Tuesday, January 30, 2024

VIP Patients in #FHIR

The FHIR security tag `VIP` is used to indicate that a patient's health information is considered to be highly confidential and requires heightened security measures. This may be due to the patient's public profile, occupation, or other factors. VIP is a designation of a person, not a designation of the data. 

To use the VIP security tag, simply add it to the security tag of any FHIR resource that contains the patient's health information. For example, the following code shows how to add the VIP security tag to a Patient resource:

{ "resourceType": "Patient", 
 "id": "1234567890", 
 "meta": {
   "security": [ { 
     "system": "", 
     "code": "VIP" } ] }
... other content ...

This is an example of tagging the Patient resource to indicate that the patient is a VIP, and thus implies that all the data associated with this Patient needs to be treated as VIP patient data. Once the VIP security tag is added to the Patient, the patient's health information should be treated with heightened security measures. This may include restricting access to the information, encrypting the information, or auditing access to the information.

Here are some examples of how the VIP security tag might be used:
  • A hospital might use the VIP security tag to protect the health information of famous patients or patients who are in the public eye.
  • A government agency might use the VIP security tag to protect the health information of high-ranking officials or other sensitive individuals.
  • A research institution might use the VIP security tag to protect the health information of participants in sensitive clinical trials.
It is important to note that the VIP security tag is just one way to indicate that a patient's health information is considered to be highly confidential. There are other security tags that can be used, such as the Confidentiality or Sensitivity security tag codes. The specific security tags that are used will depend on the organization's policies and procedures.

Typically, VIP patients are limited to a subset of the clinical staff, such as a clearance or role. This might be implemented purely in the security infrastructure or might leverage FHIR CarePlan or PractitionerRole. All accesses to VIP patient data often will trigger stricter scrutiny of accesses. On a regular basis (e.g. daily) all accesses to VIP patient data are reviewed, and inappropriate accesses are investigated with potential corrective actions against the user.

Standards for Accounting of Disclosures

I was asked lately if there are standards that support "Accounting of Disclosures". The use-case of Accounting of Disclosures is specific to the USA, but the broader concept is an expected Privacy Principle. The broader concept of an Access Report, or a Report of Data Uses, would inform a data subject of any use of their data both those that were authorized by the patient (e.g. Consent) and those that were against that authorization. The USA concept of Accounting of Disclosures is a much smaller subset, and in my view a useless subset as this subset is made up of only those uses of the data that the patient explicitly authorized outside the normal Treatment, Payment, and healthcare Operations.

So, are there standards? YES. The standards don't produce a human readable report, but rather would provide the raw material that is used to fill out a human readable report. This is an important distinction, although it is a common distinction between technical standard and User Experience. For example, the technical standards for encoding a lab result are not fit for patient consumption, but they are key contributors to the human readable report that is given to the patient. The report includes context setting, and assistance with understanding the details.

Are their interoperability standards?

Yes, there is a long history of Healthcare and general standards that are designed to support Accounting of Disclosures, Access Log, and many other use cases.

  • ASTM E2147 - Setup the concept of security audit logs for healthcare including accounting of disclosures
  • IETF RFC 3881 - Defined the Information Model (IETF rule forced this to be informative)
  • DICOM Audit Log Message - Made the information model Normative, defined Vocabulary, Transport Binding, and Schema
  • IHE ATNA - Defines the grouping with secure transport and access controls; and defined specific audit log records for specific IHE transactions.
  • NIST SP800-92 - Shows how to do audit log management and reporting - consistent with our model
  • HL7 PASS - Defined an Audit Service with responsibilities and a query interface for reporting use
  • ISO 27789 - Defined the subset of audit events that an EHR would need
  • ISO/HL7 10781 EHR System Functional Model Release 2
  • ISO 21089 Trusted End-to-End Information Flows

More specifically does FHIR have this?

Yes, the AuditEvent resource has as a use-case to provide support for Accounting of Disclosures. The AuditEvent resource is a collaboration between HL7, DICOM, and IHE.

In FHIR R4 -

IHE has a relevant Implementation Guide – Basic Audit Log Patterns (BALP)

within BALP IG, which is all relevant to Security/Privacy audit log recording and access to that recording using FHIR, there is a specific profile of the AuditEvent resource for recording a known disclosure.

IHE has a supplement on ATNA that brings in FHIR AuditEvent

With this linkage between FHIR and ATNA, the events can be recorded using FHIR restful create, and can be accessed using FHIR search. 

Which brings up ATNA (Audit Trails and Node Authentication) which is the long-standing solution in IHE. 

Further IHE governance has each Profile that IHE writes should have in it how that Profiles transactions would be logged in the audit log. These would be in Volume 2, in the Security Considerations section.

Must I record using ATNA or FHIR AuditEvent?

No, one of the benefits of the supplement adding FHIR AuditEvent to ATNA is to provide a search mechanism that produces a FHIR Bundle of AuditEvent records. These records do not need to be originally stored in ATNA or FHIR AuditEvent, just made available in FHIR AuditEvent format. Much like clinical APIs to EHRs that expose the clinical data in FHIR clinical resources, while not mandating the format of the database to be FHIR.

Thus a system can record the event using whatever mechanism it wants to, which might be native database and web-server formats. 

Are there implementations of BALP?

Yes: The following commonly used FHIR Servers have BALP implemented within them. You just need to turn it on. For more details:


IHE is a recognized standards organization focusing on profiling standards. The use of AuditEvent is recognized broadly for support of Security and Privacy audit log requirements.