Don’t Forget to Manage the Retention and Disposal of Data on Hadoop
According to an article written by Mark Brunelli interviewing James Kobielus of Forrester Research: Forrester’s Kobielus: It’s time for a Hadoop standards body, Hadoop is still a bit immature and needs adoption of standards. Mr. Kobielus goes on to indicate that when implementing Hadoop, “whether it’s through a data warehouse or Hadoop cluster, you’re talking about petabytes or multiple hundreds of terabytes worth of storage.” Hadoop, while designed to access these large data volumes (which can include social media data), does nothing to manage retention of that data. (more…)
Apache Hadoop MapReduce Meets Informatica Data Parsing
Guest blog from Arun C. Murthy, Founder & Architect, Hortonworks
As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard. (more…)
Future Integration Needs: Embracing Complex Data
Hear from Informatica’s Karen Hsu on a new study’s findings and implications of big complex data.
For more on this see: Future Integration Needs: Embracing Complex Data
MDM and ACORD Standards: Synergies and Considerations
Hear from Informatic’s Karen Hsu on the new ACORD certified Information Management solution that helps insurance organizations drive customer-centricity.
For more on this see: Master Data Management and ACORD Standards: Synergies and Considerations
Action Plan for Hadoop Data Integration: Conclusion of Hadoop Blog Series
I had the opportunity to review and comment on the draft of a new Hadoop technical guide. It’s great to see the published paper: Technical Guide: Unleashing the Power of Hadoop with Informatica. This guide outlines the following five steps to get started with Hadoop from a data integration perspective.
(1) Select the Right Projects for Hadoop Implementation
Choose projects that fit Hadoop’s strengths and minimize its disadvantages. Enterprises use Hadoop in data-science applications for log analysis, data mining, machine learning and image processing involving unstructured or raw data. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios. Hadoop Distributed File System (HDFS) and MapReduce address growth in enterprise data volumes from terabytes to petabytes and more; and the increasing variety of complex multi-dimensional data from disparate sources. (more…)
Hadoop Security: Part 6 of Hadoop Series
Security is a work-in-progress for the Apache Hadoop project and sub-projects, as I discuss as part of an O’Reilly Hadoop tutorial, “Get started with Hadoop: from evaluation to your first production cluster”. Below are several of the security tips and best practices that I discuss in that article. (more…)
Video: Electronic Health Records Update
Richard Cramer, Chief Healthcare Strategist for Informatica shares some views on Electronic Health Record (EHR) adoption, including HITECH and Meaningful Use pressures. He also talks about the challenges that the future holds for EHRs.
Visit Informatica’s Healthcare pages for more on EMRs.
Hadoop Toolbox: Part 5 of Hadoop Series
Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.
The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)
Dating With Data: Part 4 In Hadoop Series
eHarmony, an online dating service, uses Hadoop processing and the Hive data warehouse for analytics to match singles based on each individual’s “29 Dimensions® of Compatibility”, per a a June 2011 press release by eHarmony and one its suppliers, SeaMicro. According to eHarmony, an average of 542 eHarmony members marry daily in the United States. (more…)
Hadoop Extends Data Architectures: Part 3 In Hadoop Series
The list and diversity of NoSQL, “NewSQL”, cloud, grid, and other data architecture options seem to grow every year.
The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.


