Tag Archives: HIVE
I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.
MongoDB as your Big Data platform
Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.
RDBMS schema as files
You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.
Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.
Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.
Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”
Missing use cases
Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.
It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.
To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.
The hype around big data is certainly top of mind with executives at most companies today but what I am really seeing are companies finally making the connection between innovation and data. Data as a corporate asset is now getting the respect it deserves in terms of a business strategy to introduce new innovative products and services and improve business operations. The most advanced companies have C-level executives responsible for delivering top and bottom line results by managing their data assets to their maximum potential. The Chief Data Officer and Chief Analytics Officer own this responsibility and report directly to the CEO. (more…)
Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.
The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)
The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.
Enterprises use Hadoop in data-science applications that improve operational efficiency, grow revenues or reduce risk. Many of these data-intensive applications use Hadoop for log analysis, data mining, machine learning or image processing.
Commercial, open source or internally developed data-science applications have to tackle a lot of semi-structured, unstructured or raw data. They benefit from Hadoop’s combination of storage and processing in each data node spread across a cluster of cost-effective commodity hardware. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios.
Doug Cutting may have the world’s most famous green-stuffed-elephant toy. While named after his son’s plaything, Hadoop is maturing and entering the business world, creating significant opportunities as well as challenges. “Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform”, notes Forrester Research Senior Analyst James G. Kobielus (Forrester Blogs, June 7, 2011).
This is the first of a series of Hadoop articles I’ll write for Informatica Perspectives. My focus for this series is to guide an existing or prospective user of Apache Hadoop on best practices and tips so that organizations can become more data centric. After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners both innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator. (more…)