Big Data: The Current State of ETL into and on Hadoop

In my previous blog on this subject, I talked about the incredible innovations of Hadoop as a new analytics engine, and the innovations of Informatica in removing un-maintainable and complex hand-coding. In this blog I want to drill into the world of Informatica ETL and Hadoop in order to show why these two innovations are critical to augmenting traditional data processing approaches as companies begin to look at leveraging Big Data for new analytics.

The good news is that Informatica now supports ETL for Hadoop (with some transformations on Hadoop currently in beta).

This means that companies can unleash the power of big data across the enterprise by leveraging the 100,000+ Informatica-trained developers available around the globe.  Organizations can now leverage Hadoop to cost-effectively store and process massive amounts of data on the order of petabytes without having to resort to the bad habits of hand-coding ETL.

As an example, Informatica customers like eHarmony are realizing the cost-savings and benefits of using Informatica for ETL on Hadoop. Mike Olson (CEO of Cloudera) talked about this joint customer success during his keynote at Informatica World in June. Further details can be found here in the Information Week article highlighting the eHarmony approach.  eHarmony was able to increase customer subscriptions and reduce operational costs while reducing ETL processing time by 4X.

However, companies have come to realize that while Hadoop offers tremendous performance improvements and cost savings for big data ETL processing it does not replace the information management systems in place today.  While an IT organization may choose to adopt Hadoop to supplement their existing systems for processing tens of terabytes or petabytes of multi-structured data in batch they will continue to use their data warehouse to support hundreds of concurrent users with fast query response times and use a more traditional grid computing architecture for near real-time ETL processing

  • Informatica is the world’s number one independent provider of data integration software and with its current beta release provides the richest capabilities for ETL on Hadoop. Since Hadoop complements your information management architecture, data needs to flow efficiently with no bottlenecks between source systems, with ETL processing executed on Hadoop, and finally results delivered to target applications. The first thing you need is the ability to consistently access big data without having to reinvent the wheel over and over and without resorting to hand-coding or specialized knowledge about the source systems. This is the power of innovation in removing hand-coding.
  • Informatica provides near universal access inside and outside the firewall to both big transaction data including relational databases, legacy mainframe systems, and cloud applications and big interaction data such as social media, web logs, market data, machine device data, and industry standards (e.g. FIX, SWIFT, ACORD, NACHA, HL7, HIPAA, EDI, etc.).   Regardless of the complexity of the underlying technology, end users can concentrate on business logic without worrying about data access issues.
  • Informatica can load data into Hadoop in batch, trickle-feed, high-speed replication, or real-time streaming. Depending on the volume, type, and usage patterns of data you can decide the best choice.  For example, when offloading data from source systems in bulk you can use Informatica to replicate hundreds of gigabytes or terabytes an hour into Hadoop. You may decide to extract data in batch or changed-data-capture (CDC) mode, pushdown filtering to the source system, and pre-process data prior to loading into Hadoop. If data is generated in real-time such as from web sites or machine sensors you can stream millions of records per second from web logs, machine devices, and market data directly into Hadoop.
  • Informatica provides an extensive library of prebuilt transformation capabilities capabilities with the beta release from basic data type conversions and string manipulations, high-performance caching-enabled lookups, joiners, sorters, routers, aggregations to more complex natural language processing, multi-structured data parsing, data quality rules and data matching.
  • Informatica delivers these transformations with a model and metadata driven graphical development environment so developers can rapidly build ETL data flows for Hadoop. The current beta release enables ETL data flows are easily deployed on Hadoop without any specialized source and target system knowledge or Hadoop expertise.  In other words, ETL developers can focus on the data and transformation logic without having to worry about whether the ETL process is deployed on Hadoop or traditional processing platforms.

Companies are adopting Informatica for ETL on Hadoop across industries and for many types of big data projects.  A very common use-case is offloading data storage and pre-processing from expensive database and data warehouse platforms to Hadoop for staging and ETL. Financial services companies are improving their fraud detection processes and risk and portfolio analysis. Telcos are processing massive volumes of call detail records (CDRs) to improve customer support and provide new location-based services.  Manufacturers are leveraging big data from machine device sensors to improve product quality and predictive maintenance.  Retailers are using big data to make next-best offer recommendations to increase customer up-sell and cross-sell opportunities. All of these projects require data integration and more specifically ETL.

The smart companies are using Informatica for ETL on Hadoop to realize the associated benefits and cost-savings:

  • Faster Time to Value – increase productivity and eliminate hand-coding, leverage existing Informatica resources skills, easy to use visual development environment, repeatable development process paradigm, rich library of pre-built ETL transformations, near universal data access.
  • Wider Adoption – enabling wider adoption of ETL on Hadoop across projects, easier to administer and support, optimized end-to-end performance on Hadoop from source to target systems, optimal deployment in hybrid IT architectures.
  • Lower Risk – consistent and reliable ETL design and execution for all types of data, integrated profiling and data quality, data lineage for audits and compliance, vendor support from the leader in Data Integration.


This entry was posted in Big Data and tagged , , , , , , , , , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>