Pages

Monday, October 13, 2025

Modern view on Pseudonymization

For years, the terms 'anonymization' and 'pseudonymization' described distinct technical methods for de-identifying data. But if you're still thinking of them that way, you might be behind the times. Driven by regulations like GDPR and court decisions, the focus has shifted from pseudonymization as the method to pseudonymized is the dataset itself. Key is who possesses the re-identification method. This subtle change has profound implications.

Ten years ago, I worked on the De-Identification Handbook with IHE and also on the Health Informatics Pseudonymization standard within ISO at that time the concept of de-identification was broken down into two kinds there was "anonymization" and there was "pseudonymization". Where anonymization had no way to reverse and pseudonymization had some mechanism for reversing the pseudonymization. At the time these were seen as methods not as the resulting dataset. These methods would be used to identify how data would be De-Identified. The resulting dataset would then be analyzed for its risk to re-identification. That risk would be inclusive of risks relative to the pseudonymization methodology.

Today IHE is working on updating the De-Identification handbook. I'm no longer working on that project due to my employment situation. But while I was working on it before then the other subject matter experts were insisting on a very different meaning behind the words "pseudonymization" and "anonymization".

The following podcast by Ulrich Baumgartner really opened my eyes to how these words got a different meaning. They got a different meaning because they are used in a different contextual way. Whereas before the words were used purely as explanations of methodologies, they are today more dominantly used as words to describe a dataset that has either been pseudonymization or fully anonymized.

[The Privacy Advisor Podcast] Personal data defined? Ulrich Baumgartner on the implications of the CJEU's SRB ruling #thePrivacyAdvisorPodcast https://podcastaddict.com/the-privacy-advisor-podcast/episode/208363881



Where today because of GDPR there is a bigger focus on the dataset than the methodology. GDPR sees "pseudonymization" as a word describing the dataset that has only been pseudonymized but is still in the hands of the organization that possesses the methodology to re-identify. This is contextual. Therefore, the contextual understanding of that dataset is that it is contextually in the hands of an organization that has the ability to undo the pseudonymization. Therefore, the data are NOT de-identified. The data becomes de-identified when the pseudonymization re-identification mechanism is broken, that is to say when the dataset is passed to another party while the re-identification mechanism is NOT passed to that party.

This is the key point that is adding clarity to me. To me, the organization that is using pseudonymization is preparing a dataset to give to someone else; the first party organization already has the fully identified data, thus the pseudonymized data is not something they intend to operate on. It is the NEXT party, the data processor, that gets the dataset and does NOT get the re-identification mechanism. It is this NEXT party that now has de-identified data. 

I now do understand the new diagram, as there was a diagram that was drawing distinction between Identified data, and Anonymized data; with the transition of data from Fully-Identified->Pseudonymized->Anonymized. I saw this diagram, and it did not align with the original methodology perspective, but it does follow with this contextual/relative perspective.

Overall, this understanding is consistent with the original "methodology" meaning of the words, but for some reason the GDPR courts needed to say it out loud that the FIRST organization doesn't get the benefit of de-identification until they pass the data to the NEXT organization. This concept is why

There are some arguments within the GDPR community as to whether it is ever possible to make anonymous data out of pseudonymous data. This because there is SOME organization that does have access to the re-identification mechanism. As long as someone has that ability, then some courts see the data as potentially re-identifiable. That conclusion is not wrong on the blunt fact, but it does not recognize the controls in place to prevent inappropriate use of the re-identification mechanism. The current courts do see that there is a perception of a pathway from pseudonymization to anonymization.

Pseudonymization is more like Encryption than Anonymization

The interesting emphasis at this point is that within Europe under GDPR pseudonymization of a data-set is much like an encryption of a data-set. Both encryption and pseudonymization are seen as purely methodologies of protecting data, neither are a clear methodology to gain anonymization.

Conclusion

GDPR has placed a different emphasis on pseudonymization with the default meaning is where the data holder has used pseudonymization methods but still holds the re-identification key. This state of the data transition was never mentioned in the past, as ultimately the goal of pseudonymization is to produce a dataset that could be passed to another organization who does NOT get the re-identification keys. Whereas in the past we would have said that the other organization got a pseudonymized dataset without ability to re-identify; GDPR would now say that the other organization got an anonymized dataset.

Friday, October 10, 2025

How are complex trust networks handled in http/REST/OAuth.

 > How are http/REST authorized in complex trust networks handled? 

I don't have all the answers. This has not been worked out. I am not holding back "the" answer just waiting for someone to ask.

Whereas in XCA today we use a network of trust (saml signers certificate authorities, and tls certificate authorities), and the network communication also goes through "trusted intermediaries". 

In OAuth there are no "Trusted intermediaries". The search parameters and responses are always point to point between the one requesting and the one responding. The OAuth token used in that point-to-point request/response has been the hard thing to create. Where OAuth has a mechanism to "discover" who that responding service trusts. This is advertised as well-known metadata at that responding service endpoint. So, the Requester queries that well-known metadata, and from that data it then needs to figure out a trust arrangement between the requesting OAuth authorities and that responding trusted OAuth issuers. 

A. Where no trusted third party is needed

The majority case that is used very often today is that the well-known OAuth metadata can be directly used by the client. Client asks that OAuth authority to create a new token, given the requester token, for authorization to access the responder system. 

THIS is what everyone is doing today with client/server FHIR RESTful. This is what everyone looks to get their system to work with OAuth

The token has some lifetime and scope; and is used for multiple request/response. Again, this is normal. and this fact is normal for all uses of OAuth.

B. Where a trusted third party is needed

The case where the requester does not have a trust relationship with that responder defined OAuth authority is where the hard work comes in. In our use-cases where the requester and responder are in different communities. Like with XCA some trust authority is needed. Like with XCA discovering who that trust authority is the job of directory services. 


Ultimately the requesting system finds a trusted OAuth issuer, and it asks for a new token, given the requesting system token, be generated targeting the responding system. Once this token is issued then the requester can do http/REST/FHIR direct to the responding service endpoint using the internet for routing, with that last OAuth token. The responding system can test that OAuth token is valid.

In the healthcare scenario we might want to force an unusual nesting of prior tokens. In this way the responding service can record who/why and from where the request came from. This nesting is not typical and considered complex to implement and parse.

see:  OAuth 2.0 Token Exchange (RFC 8698)

C. Where multiple trusted third parties are needed

I think that the (B) solution can be iterated or recursed on infinitely. 

SO:

The main point of OAuth is that you get a new OAuth token issued for a given target/scope based on the OAuth token that you have. EACH OAuth authority makes a permit or deny decision; hence why an issued OAuth token is always a statement of authorization. If you were not authorized, you would not be issued a token.

In this way the authorization is established up-front; and the data transactions reuse that token until it expires. Thus, the up-front authorization may be expensive, but that token is reused 1000 times in the 60 seconds it is good for (simplified for illustration sake)

Caveat Emptor

I have no idea if the above is right. I think it is close, but I don't know.

I welcome commentors to correct me, especially if they can point at standards profiles that have been established. Especially if these standards profiles are established in general IT, not specific to healthcare. I am suspicious of healthcare experts who invent healthcare specific standards profiles.