Tag Archives: MapReduce
I recently had the pleasure of participating in a big data panel at the Pacific Crest investor’s conference (the replay available here.) I was joined on the panel by Hortonworks, MapR, Datastax and Microsoft. There is clearly a lot of interest in the world of big data and how the market is evolving. I came away from the panel with four fundamental thoughts: (more…)
Quite a bit has happened on the topic of big data since my last post on Informatica Perspectives almost one and a half years ago. I have spent a career working with organizations on how to get control over their uncontrolled data growth and industry visionaries are promoting this brave new world of big data. (more…)
As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard. (more…)
Security is a work-in-progress for the Apache Hadoop project and sub-projects, as I discuss as part of an O’Reilly Hadoop tutorial, “Get started with Hadoop: from evaluation to your first production cluster”. Below are several of the security tips and best practices that I discuss in that article. (more…)
Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.
The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)
A post from the TABB Group
For the biggest swaps dealers, creation of their new OTC derivatives infrastructure will include rebuilding existing platforms, buying key elements from technology providers, leveraging technology already in place in other asset classes and, of course, building new platforms from scratch. This is not a buy-versus-build decision—it’s a careful balancing act of process and technology decisions to create a best-of-breed infrastructure. (more…)
The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.
Doug Cutting may have the world’s most famous green-stuffed-elephant toy. While named after his son’s plaything, Hadoop is maturing and entering the business world, creating significant opportunities as well as challenges. “Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform”, notes Forrester Research Senior Analyst James G. Kobielus (Forrester Blogs, June 7, 2011).
This is the first of a series of Hadoop articles I’ll write for Informatica Perspectives. My focus for this series is to guide an existing or prospective user of Apache Hadoop on best practices and tips so that organizations can become more data centric. After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners both innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator. (more…)