Tag Archives: Data Integration
The post is by Philip Howard, Research Director, Bloor Research.
One of the standard metrics used to support buying decisions for enterprise software is total cost of ownership. Typically, the other major metric is functionality. However functionality is ephemeral. Not only does it evolve with every new release but while particular features may be relevant to today’s project there is no guarantee that those same features will be applicable to tomorrow’s needs. A broader metric than functionality is capability: how suitable is this product for a range of different project scenarios and will it support both simple and complex environments?
Earlier this year Bloor Research published some research into the data integration market, which exactly investigated these issues: how often were tools reused, how many targets and sources were involved, for what sort of projects were products deemed suitable? And then we compared these with total cost of ownership figures that we also captured in our survey. I will be discussing the results of our research live with Kristin Kokie, who is the interim CIO of Informatica, on Guy Fawkes’ day (November 5th). I don’t promise anything explosive but it should be interesting and I hope you can join us. The discussions will be vendor neutral (mostly: I expect that Kristin has a degree of bias).
To Register for the Webinar, click Here.
Western Union, a multi-billion dollar global financial services and communications company, data is recognized as their core asset. Like many other financial services firms, Western Union thrives on data for both harvesting new business opportunities and managing its internal operations. And like many other enterprises, Western Union isn’t just ingesting data from relational data sources. They are mining a number of new information-rich sources like clickstream data and log data. With Western Union’s scale and speed demands, the data pipeline just has to work so they can optimize customer experience across multiple channels (e.g. retail, online, mobile, etc.) to grow the business.
Let’s level set on how important scale and speed is to Western Union. Western Union processes more than 29 financial transactions every second. Analytical performance simply can’t be the bottleneck for extracting insights from this blazing velocity of data. So to maximize the performance of their data warehouse appliance, Western Union offloaded data quality and data integration workloads onto a Cloudera Hadoop cluster. Using the Informatica Big Data Edition, Western Union capitalized on the performance and scalability of Hadoop while unleashing the productivity of their Informatica developers.
Informatica Big Data Edition enables data driven organizations to profile, parse, transform, and cleanse data on Hadoop with a simple visual development environment, prebuilt transformations, and reusable business rules. So instead of hand coding one-off scripts, developers can easily create mappings without worrying about the underlying execution platform. Raw data can be easily loaded into Hadoop using Informatica Data Replication and Informatica’s suite of PowerExchange connectors. After the data is prepared, it can be loaded into a data warehouse appliance for supporting high performance analysis. It’s a win-win solution for both data managers and data consumers. Using Hadoop and Informatica, the right workloads are processed by the right platforms so that the right people get the right data at the right time.
Using Informatica’s Big Data solutions, Western Union is transforming the economics of data delivery, enabling data consumers to create safer and more personalized experiences for Western Union’s customers. Learn how the Informatica Big Data Edition can help put Hadoop to work for you. And download a free trial to get started today!
Key findings from the report include:
- 65% of organizations cite data processing and integration as hampering distribution capability, with nearly half claiming their existing software and ERP is not suitable for distribution.
- Nearly two-thirds of enterprises have some form of distribution process, involving products or services.
- More than 80% of organizations have at least some problem with product or service distribution.
- More than 50% of CIOs in organizations with distribution processes believe better distribution would increase revenue and optimize business processes, with a further 38% citing reduced operating costs.
The core findings: “With better data integration comes better automation and decision making.”
This report is one of many I’ve seen over the years that come to the same conclusion. Most of those involved with the operations of the business don’t have access to key data points they need, thus they can’t automate tactical decisions, and also cannot “mine” the data, in terms of understanding the true state of the business.
The more businesses deal with building and moving products, the more data integration becomes an imperative value. As stated in this survey, as well as others, the large majority cite “data processing and integration as hampering distribution capabilities.”
Of course, these issues goes well beyond Australia. Most enterprises I’ve dealt with have some gap between the need to share key business data to support business processes, and decision support, and what current exists in terms of data integration capabilities.
The focus here is on the multiple values that data integration can bring. This includes:
- The ability to track everything as it moves from manufacturing, to inventory, to distribution, and beyond. You to bind these to core business processes, such as automatic reordering of parts to make more products, to fill inventory.
- The ability to see into the past, and to see into the future. The emerging approaches to predictive analytics allow businesses to finally see into the future. Also, to see what went truly right and truly wrong in the past.
While data integration technology has been around for decades, most businesses that both manufacture and distribute products have not taken full advantage of this technology. The reasons range from perceptions around affordability, to the skills required to maintain the data integration flow. However, the truth is that you really can’t afford to ignore data integration technology any longer. It’s time to create and deploy a data integration strategy, using the right technology.
This survey is just an instance of a pattern. Data integration was considered optional in the past. With today’s emerging notions around the strategic use of data, clearly, it’s no longer an option.
The Informatica Cloud team has been busy updating connectivity to Hadoop using the Cloud Connector SDK. Updated connectors are available now for Cloudera and Hortonworks and new connectivity has been added for MapR, Pivotal HD and Amazon EMR (Elastic Map Reduce).
Informatica Cloud’s Hadoop connectivity brings a new level of ease of use to Hadoop data loading and integration. Informatica Cloud provides a quick way to load data from popular on premise data sources and apps such as SAP and Oracle E-Business, as well as SaaS apps, such as Salesforce.com, NetSuite, and Workday, into Hadoop clusters for pilots and POCs. Less technical users are empowered to contribute to enterprise data lakes through the easy-to-use Informatica Cloud web user interface.
Informatica Cloud’s rich connectivity to a multitude of SaaS apps can now be leveraged with Hadoop. Data from SaaS apps for CRM, ERP and other lines of business are becoming increasingly important to enterprises. Bringing this data into Hadoop for analytics is now easier than ever.
Users of Amazon Web Services (AWS) can leverage Informatica Cloud to load data from SaaS apps and on premise sources into EMR directly. Combined with connectivity to Amazon Redshift, Informatica Cloud can be used to move data into EMR for processing and then onto Redshift for analytics.
Self service data loading and basic integration can be done by less technical users through Informatica Cloud’s drag and drop web-based user interface. This enables more of the team to contribute to and collaborate on data lakes without having to learn Hadoop.
Bringing the cloud and Big Data together to put the potential of data to work – that’s the power of Informatica in action.
Free trials of the Informatica Cloud Connector for Hadoop are available here: http://www.informaticacloud.com/connectivity/hadoop-connector.html
I recently participated on an EDM Council panel on BCBS 239 earlier this month in London and New York. The panel consisted of Chief Risk Officers, Chief Data Officers, and information management experts from the financial industry. BCBS 239 set out 14 key principles requiring banks aggregate their risk data to allow banking regulators to avoid another 2008 crisis, with a deadline of Jan 1, 2016. Earlier this year, the Basel Committee on Banking Supervision released the findings from a self-assessment from the Globally Systemically Important Banks (GISB’s) in their readiness to 11 out of the 14 principles related to BCBS 239.
Given all of the investments made by the banking industry to improve data management and governance practices to improve ongoing risk measurement and management, I was expecting to hear signs of significant process. Unfortunately, there is still much work to be done to satisfy BCBS 239 as evidenced from my findings. Here is what we discussed in London and New York.
- It was clear that the “Data Agenda” has shifted quite considerably from IT to the Business as evidenced by the number of risk, compliance, and data governance executives in the room. Though it’s a good sign that business is taking more ownership of data requirements, there was limited discussions on the importance of having capable data management technology, infrastructure, and architecture to support a successful data governance practice. Specifically capable data integration, data quality and validation, master and reference data management, metadata to support data lineage and transparency, and business glossary and data ontology solutions to govern the terms and definitions of required data across the enterprise.
- With regard to accessing, aggregating, and streamlining the delivery of risk data from disparate systems across the enterprise and simplifying the complexity that exists today from point to point integrations accessing the same data from the same systems over and over again creating points of failure and increasing the maintenance costs of supporting the current state. The idea of replacing those point to point integrations via a centralized, scalable, and flexible data hub approach was clearly recognized as a need however, difficult to envision given the enormous work to modernize the current state.
- Data accuracy and integrity continues to be a concern to generate accurate and reliable risk data to meet normal and stress/crisis reporting accuracy requirements. Many in the room acknowledged heavy reliance on manual methods implemented over the years and investing in Automating data integration and onboarding risk data from disparate systems across the enterprise is important as part of Principle 3 however, much of what’s in place today was built as one off projects against the same systems accessing the same data delivering it to hundreds if not thousands of downstream applications in an inconsistent and costly way.
- Data transparency and auditability was a popular conversation point in the room as the need to provide comprehensive data lineage reports to help explain how data is captured, from where, how it’s transformed, and used remains a concern despite advancements in technical metadata solutions that are not integrated with their existing risk management data infrastructure
- Lastly, big concerns regarding the ability to capture and aggregate all material risk data across the banking group to deliver data by business line, legal entity, asset type, industry, region and other groupings, to support identifying and reporting risk exposures, concentrations and emerging risks. This master and reference data challenge unfortunately cannot be solved by external data utility providers due to the fact the banks have legal entity, client, counterparty, and securities instrument data residing in existing systems that require the ability to cross reference any external identifier for consistent reporting and risk measurement.
To sum it up, most banks admit they have a lot of work to do. Specifically, they must work to address gaps across their data governance and technology infrastructure.BCBS 239 is the latest and biggest data challenge facing the banking industry and not just for the GSIB’s but also for the next level down as mid-size firms will also be required to provide similar transparency to regional regulators who are adopting BCBS 239 as a framework for their local markets. BCBS 239 is not just a deadline but the principles set forth are a key requirement for banks to ensure they have the right data to manage risk and ensure transparency to industry regulators to monitor system risk across the global markets. How ready are you?
Which Method of Controls Should You Use to Protect Sensitive Data in Databases and Enterprise Applications? Part II
- Do you need to protect data at rest (in storage), during transmission, and/or when accessed?
- Do some privileged users still need the ability to view the original sensitive data or does sensitive data need to be obfuscated at all levels?
- What is the granularity of controls that you need?
- Datafile level
- Table level
- Row level
- Field / column level
- Cell level
- Do you need to be able to control viewing vs. modification of sensitive data?
- Do you need to maintain the original characteristics / format of the data (e.g. for testing, demo, development purposes)?
- Is response time latency / performance of high importance for the application? This can be the case for mission critical production applications that need to maintain response times in the order of seconds or sub-seconds.
In order to help you determine which method of control is appropriate for your requirements, the following table provides a comparison of the different methods and their characteristics.
A combination of protection method may be appropriate based on your requirements. For example, to protect data in non-production environments, you may want to use persistent data masking to ensure that no one has access to the original production data, since they don’t need to. This is especially true if your development and testing is outsourced to third parties. In addition, persistent data masking allows you to maintain the original characteristics of the data to ensure test data quality.
In production environments, you may want to use a combination of encryption and dynamic data masking. This is the case if you would like to ensure that all data at rest is protected against unauthorized users, yet you need to protect sensitive fields only for certain sets of authorized or privileged users, but the rest of your users should be able to view the data in the clear.
The best method or combination of methods will depend on each scenario and set of requirements for your environment and organization. As with any technology and solution, there is no one size fits all.
Which Method of Controls Should You Use to Protect Sensitive Data in Databases and Enterprise Applications? Part I
- Which types of data should be protected?
- Which data should be classified as “sensitive?”
- Where is this sensitive data located?
- Which groups of users should have access to this data?
Because these questions come up frequently, it seems ideal to share a few guidelines on this topic.
When protecting the confidentiality and integrity of data, the first level of defense is Authentication and access control. However, data with higher levels of sensitivity or confidentiality may require additional levels of protection, beyond regular authentication and authorization methods.
There are a number of control methods for securing sensitive data available in the market today, including:
- Persistent (Static) Data Masking
- Dynamic Data Masking
- Retention management and purging
Encryption is a cryptographic method of encoding data. There are generally, two methods of encryption: symmetric (using single secret key) and asymmetric (using public and private keys). Although there are methods of deciphering encrypted information without possessing the key, a good encryption algorithm makes it very difficult to decode the encrypted data without knowledge of the key. Key management is usually a key concern with this method of control. Encryption is ideal for mass protection of data (e.g. an entire data file, table, partition, etc.) against unauthorized users.
Persistent or static data masking obfuscates data at rest in storage. There is usually no way to retrieve the original data – the data is permanently masked. There are multiple techniques for masking data, including: shuffling, substitution, aging, encryption, domain-specific masking (e.g. email address, IP address, credit card, etc.), dictionary lookup, randomization, etc. Depending on the technique, there may be ways to perform reverse masking - this should be used sparingly. Persistent masking is ideal for cases where all users should not see the original sensitive data (e.g. for test / development environments) and field level data protection is required.
Dynamic data masking de-identifies data when it is accessed. The original data is still stored in the database. Dynamic data masking (DDM) acts as a proxy between the application and database and rewrites the user / application request against the database depending on whether the user has the privilege to view the data or not. If the requested data is not sensitive or the user is a privileged user who has the permission to access the sensitive data, then the DDM proxy passes the request to the database without modification, and the result set is returned to the user in the clear. If the data is sensitive and the user does not have the privilege to view the data, then the DDM proxy rewrites the request to include a masking function and passes the request to the database to execute. The result is returned to the user with the sensitive data masked. Dynamic data masking is ideal for protecting sensitive fields in production systems where application changes are difficult or disruptive to implement and performance / response time is of high importance.
Tokenization substitutes a sensitive data element with a non-sensitive data element or token. The first generation tokenization system requires a token server and a database to store the original sensitive data. The mapping from the clear text to the token makes it very difficult to reverse the token back to the original data without the token system. The existence of a token server and database storing the original sensitive data renders the token server and mapping database as a potential point of security vulnerability, bottleneck for scalability, and single point of failure. Next generation tokenization systems have addressed these weaknesses. However, tokenization does require changes to the application layer to tokenize and detokenize when the sensitive data is accessed. Tokenization can be used in production systems to protect sensitive data at rest in the database store, when changes to the application layer can be made relatively easily to perform the tokenization / detokenization operations.
Retention management and purging is more of a data management method to ensure that data is retained only as long as necessary. The best method of reducing data privacy risk is to eliminate the sensitive data. Therefore, appropriate retention, archiving, and purging policies should be applied to reduce the privacy and legal risks of holding on to sensitive data for too long. Retention management and purging is a data management best practices that should always be put to use.
With that said, the basic approaches to consider are from the top-down, or the bottom-up. You can be successful with either approach. However, there are certain efficiencies you’ll gain with a specific choice, and it could significantly reduce the risk and cost. Let’s explore the pros and cons of each approach.
Approaching data integration from the top-down means moving from the high level integration flows, down to the data semantics. Thus, you an approach, perhaps even a tool-set (using requirements), and then define the flows that are decomposed down to the raw data.
The advantages of this approach include:
The ability to spend time defining the higher levels of abstraction without being limited by the underlying integration details. This typically means that those charged with designing the integration flows are more concerned with how they have to deal with the underlying source and target, and this approach means that they don’t have to deal with that issue until later, as they break down the flows.
The disadvantages of this approach include:
The data integration architect does not consider the specific needs of the source or target systems, in many instances, and thus some rework around the higher level flows may have to occur later. That causes inefficiencies, and could add risk and cost to the final design and implementation.
For the most part, this is the approach that most choose for data integration. Indeed, I use this approach about 75 percent of the time. The process is to start from the native data in the sources and targets, and work your way up to the integration flows. This typically means that those charged with designing the integration flows are more concerned with the underlying data semantic mediation than the flows.
The advantages of this approach include:
It’s typically a more natural and traditional way of approaching data integration. Called “data-driven” integration design in many circles, this initially deals with the details, so by the time you get up to the integration flows there are few surprises, and there’s not much rework to be done. It’s a bit less risky and less expensive, in most cases.
The disadvantages of this approach include:
Starting with the details means that you could get so involved in the details that you miss the larger picture, and the end state of your architecture appears to be poorly planned, when all is said and done. Of course, that depends on the types of data integration problems you’re looking to solve.
No matter which approach you leverage, with some planning and some strategic thinking, you’ll be fine. However, there are different paths to the same destination, and some paths are longer and less efficient than others. As you pick an approach, learn as you go, and adjust as needed.