Healthcare Data Masking: Tokenization, HIPAA and More

Healthcare Data Masking: A Primer
Healthcare Data Masking: A Primer

When trying to protect your data from the nefarious souls that would like access to it (?), there are several options available that apply to very specific use cases. In order for us to talk about the different solutions – it is important to define all of the terms:

  • PII – Personally Identifiable Information – any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII
  • GSA’s Rules of Behavior for Handling Personally Identifiable Information – This directive provides GSA’s policy on how to properly handle PII and the consequences and corrective actions that will be taken if a breach occurs
  • PHI – Protected Health Information – any information about health status, provision of health care, or payment for health care that can be lined to a specific individual
  • HIPAA Privacy Rule – The HIPAA Privacy Rule establishes national standards to protect individuals’ medical records and other personal health information and applies to health plans, health care clearinghouses, and those health care providers that conduct certain health care transactions electronically.  The Rule requires appropriate safeguards to protect the privacy of personal health information, and sets limits and conditions on the uses and disclosures that may be made of such information without patient authorization. The Rule also gives patients rights over their health information, including rights to examine and obtain a copy of their health records, and to request corrections.
  • Encryption – a method of protecting data by scrambling it into an unreadable form. It is a systematic encoding process which is only reversible with the right key.
  • Tokenization – a method of replacing sensitive data with non-sensitive placeholder tokens. These tokens are swapped with data stored in relational databases and files.
  • Data masking – a process that scrambles data, either an entire database or a subset. Unlike encryption, masking is not reversible; unlike tokenization, masked data is useful for limited purposes. There are several types of data masking:
    • Static data masking (SDM) masks data in advance of using it. Non production databases masked NOT in real-time.
    • Dynamic data masking (DDM) masks production data in real time
    • Data Redaction – masks unstructured content (PDF, Word, Excel)

Each of the three methods for protecting data (encryption, tokenization and data masking) have different benefits and work to solve different security issues . We’ll address them in a bit. For a visual representation of the three methods – please see the table below:


Original Value Encrypted Tokenized Masked
Last Name johnson 8UY%45Sj wjehneo simpson
First Name margaret 3%ERT22##$ owhksoes marge
SSN 585-88-9874 Mh9&o03ms)) 93nmvhf93na 345-79-4444


For protecting PHI data – encryption is superior to tokenization. You encrypt different portions of personal healthcare data under different encryption keys. Only those with the requisite keys can see the data. This form of encryption requires advanced application support to manage the different data sets to be viewed or updated by different audiences. The key management service must be very scalable to handle even a modest community of users. Record management is particularly complicated. Encryption works better than tokenization for PHI – but it does not scale well.

Properly deployed, encryption is a perfectly suitable tool for protecting PII. It can be set up to protect archived data or data residing on file systems without modification to business processes.

  • To protect the data, you must install encryption and key management services to protect the data – this only protects the data from access that circumvents applications
  • You can add application layer encryption to protect data in use
    • This requires changing applications and databases to support the additional protection
    • You will pay the cost of modification and the performance of the application will be impacted


For tokenization of PHI – there are many pieces of data which must be bundled up in different ways for many different audiences. Using the tokenized data requires it to be de-tokenized (which usually includes a decryption process). This introduces an overhead to the process. A person’s medical history is a combination of medical attributes, doctor visits, outsourced visits. It is an entangled set of personal, financial, and medical data. Different groups need access to different subsets. Each audience needs a different slice of the data – but must not see the rest of it. You need to issue a different token for each and every audience. You will need a very sophisticated token management and tracking system to divide up the data, issuing and tracking different tokens for each audience.

Data Masking

Masking can scramble individual data columns in different ways so that the masked data looks like the original (retaining its format and data type) but it is no longer sensitive data. Masking is effective for maintaining aggregate values across an entire database, enabling preservation of sum and average values within a data set, while changing all the individual data elements. Masking plus encryption provide a powerful combination for distribution and sharing of medical information

Traditionally, data masking has been viewed as a technique for solving a test data problem. The December 2014 Gartner Magic Quadrant Report on Data Masking Technology extends the scope of data masking to more broadly include data de-identification in production, non-production, and analytic use cases. The challenge is to do this while retaining business value in the information for consumption and use.

Masked data should be realistic and quasi-real. It should satisfy the same business rules as real data. It is very common to use masked data in test and development environments as the data looks like “real” data, but doesn’t contain any sensitive information.