Tag Archives: tokenization
- PII – Personally Identifiable Information – any data that could potentially identify a specific individual. Any information that can be used to distinguish one person from another and can be used for de-anonymizing anonymous data can be considered PII
- GSA’s Rules of Behavior for Handling Personally Identifiable Information – This directive provides GSA’s policy on how to properly handle PII and the consequences and corrective actions that will be taken if a breach occurs
- PHI – Protected Health Information – any information about health status, provision of health care, or payment for health care that can be lined to a specific individual
- HIPAA Privacy Rule – The HIPAA Privacy Rule establishes national standards to protect individuals’ medical records and other personal health information and applies to health plans, health care clearinghouses, and those health care providers that conduct certain health care transactions electronically. The Rule requires appropriate safeguards to protect the privacy of personal health information, and sets limits and conditions on the uses and disclosures that may be made of such information without patient authorization. The Rule also gives patients rights over their health information, including rights to examine and obtain a copy of their health records, and to request corrections.
- Encryption – a method of protecting data by scrambling it into an unreadable form. It is a systematic encoding process which is only reversible with the right key.
- Tokenization – a method of replacing sensitive data with non-sensitive placeholder tokens. These tokens are swapped with data stored in relational databases and files.
- Data masking – a process that scrambles data, either an entire database or a subset. Unlike encryption, masking is not reversible; unlike tokenization, masked data is useful for limited purposes. There are several types of data masking:
- Static data masking (SDM) masks data in advance of using it. Non production databases masked NOT in real-time.
- Dynamic data masking (DDM) masks production data in real time
- Data Redaction – masks unstructured content (PDF, Word, Excel)
Each of the three methods for protecting data (encryption, tokenization and data masking) have different benefits and work to solve different security issues . We’ll address them in a bit. For a visual representation of the three methods – please see the table below:
For protecting PHI data – encryption is superior to tokenization. You encrypt different portions of personal healthcare data under different encryption keys. Only those with the requisite keys can see the data. This form of encryption requires advanced application support to manage the different data sets to be viewed or updated by different audiences. The key management service must be very scalable to handle even a modest community of users. Record management is particularly complicated. Encryption works better than tokenization for PHI – but it does not scale well.
Properly deployed, encryption is a perfectly suitable tool for protecting PII. It can be set up to protect archived data or data residing on file systems without modification to business processes.
- To protect the data, you must install encryption and key management services to protect the data – this only protects the data from access that circumvents applications
- You can add application layer encryption to protect data in use
- This requires changing applications and databases to support the additional protection
- You will pay the cost of modification and the performance of the application will be impacted
For tokenization of PHI – there are many pieces of data which must be bundled up in different ways for many different audiences. Using the tokenized data requires it to be de-tokenized (which usually includes a decryption process). This introduces an overhead to the process. A person’s medical history is a combination of medical attributes, doctor visits, outsourced visits. It is an entangled set of personal, financial, and medical data. Different groups need access to different subsets. Each audience needs a different slice of the data – but must not see the rest of it. You need to issue a different token for each and every audience. You will need a very sophisticated token management and tracking system to divide up the data, issuing and tracking different tokens for each audience.
Masking can scramble individual data columns in different ways so that the masked data looks like the original (retaining its format and data type) but it is no longer sensitive data. Masking is effective for maintaining aggregate values across an entire database, enabling preservation of sum and average values within a data set, while changing all the individual data elements. Masking plus encryption provide a powerful combination for distribution and sharing of medical information
Traditionally, data masking has been viewed as a technique for solving a test data problem. The December 2014 Gartner Magic Quadrant Report on Data Masking Technology extends the scope of data masking to more broadly include data de-identification in production, non-production, and analytic use cases. The challenge is to do this while retaining business value in the information for consumption and use.
Masked data should be realistic and quasi-real. It should satisfy the same business rules as real data. It is very common to use masked data in test and development environments as the data looks like “real” data, but doesn’t contain any sensitive information.
Column-oriented Database Management Systems (CDBMS), also referred to as columnar databases and CBAT, have been getting a lot of attention recently in the data warehouse marketplace and trade press. Interestingly, some of the newer companies offering CDBMS-based products give the impression that this is an entirely new development in the RDBMS arena. This technology has actually been around for quite a while. But the market has only recently started to recognize the many benefits of CDBMS. So, why is CDBMS now coming to be recognized as the technology that offers the best support for very large, complex data warehouses intended to support ad hoc analytics? In my opinion, one of the fundamental reasons is the reduction in I/O workload that it enables. (more…)
Columnar Deduplication and Column Tokenization: Improving Database Performance, Security and Interoperability
For some time now, a special technique called columnar deduplication has been implemented by a number of commercially available relational database management systems. In today’s blog post, I discuss the nature and benefits of this technique, which I will refer to as column tokenization for reasons that will become evident.
Column tokenization is a process in which a unique identifier (called a Token ID) is assigned to each unique value in a column, and then employed to represent that value anywhere it appears in the column. Using this approach, data size reductions of up to 50% can be achieved, depending on the number of unique values in the column (that is, on the column’s cardinality). Some RDBMSs use this technique simply as a way of compressing data; the column tokenization process is integrated into the buffer and I/O subsystems, and when a query is executed, each row needs to be materialized and the token IDs replaced by their corresponding values. At Informatica for the File Archive Service (FAS) part of the Information Lifecycle Management product family, column tokenization is the core of our technology: the tokenized structure is actually used during query execution, with row materialization occurring only when the final result set is returned. We also use special compression algorithms to achieve further size reduction, typically on the order of 95%.
Personally Identifiable Information is under attack like never before. In the news recently two prominent organizations—institutions—were attacked. What happened:
- A data breach at a major U.S. Insurance company exposed over a million of their policyholders to identity fraud. The data stolen included Personally Identifiable information such as names, Social Security numbers, driver’s license numbers and birth dates. In addition to Nationwide paying million dollar identity fraud protection to policyholders, this breach is creating fears that class action lawsuits will follow. (more…)
In a May 2012 survey by the Ponemon Institute, 66 percent said they are not confident their organization would be able to detect the loss or theft of sensitive personal information contained in systems operated by third parties, including cloud providers. In addition, the majority are not confident that their organization would be able detect the loss or theft of sensitive personal information in their company’s production environment.
Which aspect of data security for your cloud solution is most important?
1. Is it to protect the data in copies of production/cloud applications used for test or training purposes? For example, do you need to secure data in your Salesforce.com Sandbox?
2. Is it to protect the data so that a user will see data based on her/his role, privileges, location and data privacy rules?
3. Is it to protect the data before it gets to the cloud?
As compliance continues to drive people to action, compliance with contractual agreements, especially for the cloud infrastructure continues to drive investment. In addition, many organizations are supporting Salesforce.com as well as packaged solutions such as Oracle eBusiness, Peoplesoft, SAP, and Siebel.
Of the available data protection solutions, tokenization has been used and is well known for supporting PCI data and preserving the format and width of a table column. But because many tokenization solutions today require creating database views or changing application source code, it has been difficult for organizations to support packaged applications that don’t allow these changes. In addition, databases and applications take a measurable performance hit to process tokens.
What might work better is to dynamically tokenize data before it gets to the cloud. So there would be a transparent layer between the cloud and on-premise data integration that would replace the sensitive data with tokens. In this way, additional code to the application would not be required.
In the Ponemon survey, most said the best control is to dynamically mask sensitive information based on the user’s privilege level. After dynamically masking sensitive data, people said encrypting all sensitive information contained in the record is the best option.
The strange thing is that people recognize there is a problem but are not spending accordingly. In the same survey from Ponemon, 69% of organizations find it difficult to restrict user access to sensitive information in IT and business environments. However, only 33% say they have adequate budgets to invest in the necessary solutions to reduce the insider threat.
Is this an opportunity for you?
Hear Larry Ponemon discuss the survey results in more detail during a CSOonline.com/Computerworld webinar, Data Privacy Challenges and Solutions: Research Findings with Ponemon Institute, on Wednesday, June 13.
Recently, Oracle announced that its latest April critical patch update does not address the TNS Poison vulnerability uncovered by a researcher 4 years ago. In addition to this vulnerability from an attacker, organizations face data breaches from internal negligence and insiders. In a May 2012 survey by the Ponemon Institute, 50% say sensitive data contained in databases and applications has been compromised or stolen by malicious insiders such as privileged users. On top of that 68% find it difficult to restrict user access to sensitive information in IT and business environments.
While databases offer basic security features that can be programmed and configured to protect data, it may not be enough and may not scale with your growing organizations. The problem stems from the fact that application development and DBA teams need to have a solid understanding of database vendor specific offerings in order to ensure that the security feature has been properly set up and deployed. If your organization has a number of different databases (Oracle, DB2, Microsoft SQL Server) and that number is growing, it can be costly to maintain all the database specific solutions. Many Informatica customers have faced this problem and looked to Informatica to provide a complete, end-to-end solution that addresses database security on an enterprise-wide level.
Come talk to us at Informatica World and hear from our customers about how they’ve used Informatica to minimize the risk of breaches across a number of use cases including:
– Test data management
– Production support in off-shore projects
– Dynamically protecting PII or PHI data for research portals
– Dynamically protecting data in cross-border applications
At Informatica, you can meet us in our sessions on Thursday, May 17, at the Aria in Las Vegas:
10:10 – 11:10 – Ensuring Data Privacy for Warehouses and Applications with Informatica Data Masking in Room Juniper 3
11:20 – 12:20 – Protecting Sensitive Data Using Informatica’s Test Data Management Solution in Room Starvine 12
Also come to the Informatica Data Privacy booth and lab for in depth demonstrations and presentations of our data privacy solutions and customer deployments.
Data breaches in healthcare have increased 32 percent in the past year and have cost the industry an estimated $6.5 billion annually according to the Ponemon Institute. Responsible for these breaches were largely employee handling of data and the increasing use of mobile devices. Forty-one percent of healthcare executive surveyed attributed data breaches related to protected health information (PHI) to employee mistakes. Half of the respondents said their organization does nothing to protect the information contained on mobile devices. “Healthcare data breaches are an epidemic,” said Dr. Larry Ponemon, chairman and founder, Ponemon Institute, in an announcement of the study results.
Why are healthcare data breaches becoming more common?
PHI data is in all production and test systems, as well as numerous copies that are created of production systems for test, training and application development purposes. In addition to these production systems, PHI data lives in servers inside and outside of the organization. As more mobile devices are used to access critical patient data, and doctors are using their mobile devices to address medical issues from all over the country (if not the world), more sensitive patient data is exposed. In addition to PHI data such as social security number, a lot of sensitive data that healthcare organizations have is contained in textual notes. So the textual data also needs to be protected. But patient data needs to be protected not only within the hospital or healthcare organization. As patient data is used for clinical trial and research purposes, it is important to protect the data that leaves the organization.
To address these concerns, Informatica has seen organizations move towards an end-to-end, enterprise wide data privacy solution that enables them to:
– Consistently define sensitive data and set data privacy policies
– Identify where sensitive data lives throughout the organization
– Create subsets of production data for testing purposes, greatly reducing costs of managing test data (reducing hardware and software)
– Mask data according to all required PHI rules
– Report / provide audit trail that data has been masked and data is secure
Maintaining many, individual privacy solutions can be both costly and risky. An enterprise wide solution centralizes data privacy management, streamlining development and ongoing maintenance.
For more information on healthcare privacy challenges and how to address them, please join us in our upcoming webinar.