Next Gen Analytics Strategies: Optimize the Data Pipeline for Big Data
Welcome to the Next-Gen Analytics Strategies blog series in which I dive into five principal data management strategies for intelligent analytics. Today, we’ll examine optimizing a data pipeline for big data. When architecting a data management platform for analytics there are key components that are fundamentally necessary in building out a data pipeline. These data pipelines are becoming increasingly complex and have to deal with hundreds of source systems (on-premises and in the cloud) and perhaps thousands of sensors and machines. A data pipeline must collect, integrate, cleanse, prepare, relate, protect, and deliver trusted data at scale and at the speed of business.
As I discussed in my last blog, step one is catalog your data. Step two is to collect and ingest data into a data lake. Next, you need to integrate, cleanse, and master the data. You also need to curate, enrich, and prepare it for analysis (nowadays in a self-service fashion). And don’t forget you must protect sensitive data to comply with industry regulations such as GDPR.
When architecting your data pipeline, you basically have three choices:
- Hand code the components. This may be a quick task for a POC but would take years to deploy into production with enterprise class capabilities
- Purchase point-solutions. Most likely these weren’t designed to work together so it’s on you to integrate all these systems and manage a complex configuration of upgrades (hiking up your total-cost-of-ownership)
- Invest project-by-project in an enterprise class and fully integrated data management platform that covers all your bases.
The right choice is number 3.
Informatica helps you optimize the entire data pipeline with products that scale for big data and increase productivity and efficiency using the CLAIRE™ engine (our AI and machine learning technology). Let’s take a closer look at some of these products to illustrate how Informatica helps you optimize your data pipeline for big data.
Future-proof development work
Of course, you expect Informatica to provide a no-code visual development environment with hundreds of pre-built transforms and connectors. But we’ve also increased developer productivity a hundred-fold with dynamic mapping templates, mass ingestion wizards, and intelligent structure discovery to automatically parse, ingest and integrate dynamically changing data (such as in IoT and social/mobile data situations).
All your development work is future proof because the data flows (call mappings in Informatica-speak) are abstracted from the underlying execution engine. Years ago, we ran big data workloads on MapReduce, then Hive on Tez, then Blaze (our own distributed processing engine), then Spark, and now Spark Streaming. This means that as processing technologies change and evolve you don’t have to rebuild your mapping logic. In other words, a mapping you defined say 5 years ago using Informatica Big Data Management (BDM) can now run today on Spark without changing anything. If you had hand-coded the mapping logic or used a code-generator you would need to re-write and re-test everything. And to further simplify deployments, Informatica Big Data Management runs in the Cloud on serverless architectures (e.g. Amazon AWS, Microsoft Azure), on-premises, and in hybrid environments.
Now the world is moving toward real-time, event-based and stream processing. If you’ve been building your mappings using Informatica Big Data Management you can with a flip of a checkbox run them on Spark Streaming by licensing Informatica Big Data Streaming. Edge Data Streaming, a complementary product, is designed to collect real-time data at the edge from sensors, machines, log files, and other IoT devices. Then it streams the data into or out of Kafka and other popular messaging systems and streaming frameworks like Amazon Kinesis.
Nobody has the patience to wait for their data anymore now that the success of any digital transformation is critically dependent on trusted and relevant data. The good news is that using data prep tools like Informatica Enterprise Data Lake, enterprise data is available at your fingertips. And because it’s integrated with Informatica Enterprise Data Catalog, data analysts and data scientists can quickly find the data they need and then prepare it using a simple and intuitive Excel-like tool. Now that Informatica Enterprise Data Lake is integrated with Apache Zeppelin data science project collaboration is greatly enhanced. Once the data is prepared it can be saved and published to the data lake and immediately provisioned for IT deployment since it auto-generates an Informatica Big Data Management mapping.
Don’t get bogged down in a data swamp
Here is a riddle: What do you get when you dump a ton of data into Hadoop? The answer: A data swamp. Unless you have Informatica Big Data Quality to identify inconsistent, non-standard data and anomalies and then apply data quality rules to cleanse and enrich the data, your data lake will quickly turn into an unmanageable, ungoverned data swamp. Informatica Big Data Quality can even validate and standardize millions of addresses at scale.
Informatica Relate 360 is unique to Informatica and helps customers significantly increase the value from their big data analytic projects. Out of billions of records representing millions of people, it can identify customers, their households, and their social relationships to enrich customer master data with more contextual information. Then by layering demographic, geographic, and psychographic data on top of those associations, organizations can uncover an intimate holistic social view of customers.
All the products I briefly described in this blog are part of the Informatica Intelligent Data Platform powered by CLAIRE. This means your entire data pipeline is optimized for big data workloads, all enterprise metadata is unified, previously manual steps are automated, intelligent recommendations guide you through data management tasks, and overall user data experience and understanding is greatly enhanced for efficiency.
In my next blog I’ll discuss Next Gen Analytics Strategies: End-to-End Collaborative Data Governance.
Other Posts in the Next Gen Analytics Strategies Series: