Tag Archives: MapReduce

Top 5 Big Data Mistakes

Top 5 Big Data mistakes

Top 5 Big Data mistakes

I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.

MongoDB as your Big Data platform

Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.

RDBMS schema as files

You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.

Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.

Data Ponds

Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.

Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”

Missing use cases

Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.

It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.

SQL

You like SQL. Query languages and techniques have changed with time. Today, think of Pig as PL/SQL on steroids with maybe a touch of acid.

To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.

Twitter @bigdatabeat

Share
Posted in Architects, Big Data, Business Impact / Benefits, CIO, Hadoop | Tagged , , , , , , , | Leave a comment

Big Data Innovation in Analytics (Hadoop) and Hand-coding (Informatica)

I recently had the pleasure of participating in a big data panel at the Pacific Crest investor’s conference (the replay available here.) I was joined on the panel by Hortonworks, MapR, Datastax and Microsoft. There is clearly a lot of interest in the world of big data and how the market is evolving. I came away from the panel with four fundamental thoughts: (more…)

Share
Posted in B2B, Big Data, Cloud Computing | Tagged , , , , , , , , , , , , | 3 Comments

A Return to Big Data

Quite a bit has happened on the topic of big data since my last post on Informatica Perspectives almost one and a half years ago.  I have spent a career working with organizations on how to get control over their uncontrolled data growth and industry visionaries are promoting this brave new world of big data. (more…)

Share
Posted in Application ILM, Big Data, Informatica 9.5 | Tagged , , , , , , | 2 Comments

Apache Hadoop MapReduce Meets Informatica Data Parsing

Guest blog from Arun C. Murthy, Founder & Architect, Hortonworks

As the framework architects and developers of Apache Hadoop MapReduce, we are always looking for ways to simplify the complex tasks associated with large-scale processing of data. We want users and organizations to spend their time on analyzing their growing data to gain valuable insights, not on menial tasks such as massaging their data for consumption or tediously parsing complex structures in their data. The Informatica HParser technology is extremely valuable in this regard. (more…)

Share
Posted in B2B, Big Data, Marketplace, News & Announcements | Tagged , , , , , , , | Leave a comment

Hadoop Security: Part 6 of Hadoop Series

Security is a work-in-progress for the Apache Hadoop project and sub-projects, as I discuss as part of an O’Reilly Hadoop tutorial, “Get started with Hadoop: from evaluation to your first production cluster”. Below are several of the security tips and best practices that I discuss in that article. (more…)

Share
Posted in Big Data | Tagged , , , , , , , , , , | Leave a comment

Hadoop Toolbox: Part 5 of Hadoop Series

Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.

The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)

Share
Posted in Big Data | Tagged , , , , , , , , , , , , , , , , , , , | Leave a comment

Anatomy Of A New Swaps Infrastructure

A post from the TABB Group

For the biggest swaps dealers, creation of their new OTC derivatives infrastructure will include rebuilding existing platforms, buying key elements from technology providers, leveraging technology already in place in other asset classes and, of course, building new platforms from scratch. This is not a buy-versus-build decision—it’s a careful balancing act of process and technology decisions to create a best-of-breed infrastructure. (more…)

Share
Posted in Ultra Messaging | Tagged , , , , , , | Leave a comment

Hadoop Extends Data Architectures: Part 3 In Hadoop Series

The list and diversity of NoSQL, “NewSQL”, cloud, grid, and other data architecture options seem to grow every year.

The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.

(more…)

Share
Posted in Big Data | Tagged , , , , , , , , , , , , , | Leave a comment

Hadoop Series: Hadoop Means Business

Doug Cutting may have the world’s most famous green-stuffed-elephant toy. While named after his son’s plaything, Hadoop is maturing and entering the business world, creating significant opportunities as well as challenges. “Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform”, notes Forrester Research Senior Analyst James G. Kobielus (Forrester Blogs, June 7, 2011).

Doug Cutting and the Original "Hadoop"

This is the first of a series of Hadoop articles I’ll write for Informatica Perspectives. My focus for this series is to guide an existing or prospective user of Apache Hadoop on best practices and tips so that organizations can become more data centric. After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners both innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator. (more…)

Share
Posted in Big Data | Tagged , , , , , | Leave a comment