Tag Archives: HDFS
As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard. (more…)
Security is a work-in-progress for the Apache Hadoop project and sub-projects, as I discuss as part of an O’Reilly Hadoop tutorial, “Get started with Hadoop: from evaluation to your first production cluster”. Below are several of the security tips and best practices that I discuss in that article. (more…)
Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.
The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)
eHarmony, an online dating service, uses Hadoop processing and the Hive data warehouse for analytics to match singles based on each individual’s “29 Dimensions® of Compatibility”, per a a June 2011 press release by eHarmony and one its suppliers, SeaMicro. According to eHarmony, an average of 542 eHarmony members marry daily in the United States. (more…)
Doug Cutting may have the world’s most famous green-stuffed-elephant toy. While named after his son’s plaything, Hadoop is maturing and entering the business world, creating significant opportunities as well as challenges. “Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform”, notes Forrester Research Senior Analyst James G. Kobielus (Forrester Blogs, June 7, 2011).
This is the first of a series of Hadoop articles I’ll write for Informatica Perspectives. My focus for this series is to guide an existing or prospective user of Apache Hadoop on best practices and tips so that organizations can become more data centric. After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners both innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator. (more…)