As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard.
For those new to Apache Hadoop, MapReduce is a parallel computing framework for processing large volumes of data. It deals with the four V’s of big data (as Forrester described) that present challenges to existing data systems, namely: volume, velocity, variety and variability. Together with the Hadoop Distributed File System (HDFS) and a handful of other important Apache Hadoop projects, it provides a massively scalable and highly reliable platform for storing, processing, managing and ultimately analyzing the ever-increasing data coming not only from transactional systems but also unstructured data in the form of server logs, customer interaction records, social media updates, email, PDFs, CDRs and so forth.
While extremely powerful and adept at providing insights into huge volumes of data, Apache Hadoop still presents challenges to enterprises trying to implement it. For example, with complex data types come complex tasks required to be able to write applications to analyze the data. One of the big areas of pain has always been in the area of parsing deeply hierarchical data like XML or JSON coming from clickstream or logs. Previously, you needed to manually script tasks in MapReduce to handle this and it could take weeks or even months to accomplish the feat in large organizations.
Informatica has addressed this challenge head on with HParser. This solution runs within MapReduce applications to exploit the inherent parallelism. It turns manually-scripted parsing tasks into standardized tool-based ones. This means that the user community can become significantly more productive. It also means they can focus much more of their time on writing valuable applications that deliver more and better analytical insight to their business users. Informatica has also released a community edition of HParser that is available for free and compatible with the Hortonworks Data Platform, powered by Apache Hadoop.
As the lead of the Next Generation of MapReduce in Apache Hadoop, I am excited that industry leaders like Informatica are creating solutions that simplify complex Hadoop tasks. As a co-founder of Hortonworks, I am excited that our two companies are working together to accelerate the adoption of Apache Hadoop. To learn more about Informatica’s HParser Community Edition, please visit the Informatica Marketplace. For more information on Hortonworks, please visit www.hortonworks.com.
For more from Arun on Hortonworks and Informatica HParser please see this video: