Top 5 Big Data Mistakes

Top 5 Big Data mistakes
Top 5 Big Data mistakes

I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.

MongoDB as your Big Data platform

Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.

RDBMS schema as files

You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.

Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.

Data Ponds

Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.

Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”

Missing use cases

Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.

It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.


You like SQL. Query languages and techniques have changed with time. Today, think of Pig as PL/SQL on steroids with maybe a touch of acid.

To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.

Twitter @bigdatabeat