Tag Archives: HIVE
The hype around big data is certainly top of mind with executives at most companies today but what I am really seeing are companies finally making the connection between innovation and data. Data as a corporate asset is now getting the respect it deserves in terms of a business strategy to introduce new innovative products and services and improve business operations. The most advanced companies have C-level executives responsible for delivering top and bottom line results by managing their data assets to their maximum potential. The Chief Data Officer and Chief Analytics Officer own this responsibility and report directly to the CEO. (more…)
Many organizations will mix and match individual Apache projects and sub-projects using Apache Hadoop’s loosely coupled architecture. This Hadoop toolbox provides a powerful set of tools and capabilities, but it does have some important limitations that can require a platform approach to address.
The Hadoop Distributed File System (HDFS) combines storage and processing in each data node. With the HDFS file system, you can add new files or append to existing files, but not replace files without use of a new filename. The append capability works well for adding new time-stamped logs as they come in, but can complicate storage of structured files. (more…)
The Harry Potter books and movies were a particularly popular inspiration for project names. For example, at LinkedIn, to empower features such as “People You May Know” and “Jobs You May Be Interested In”, LinkedIn uses Hadoop together with an Azkaban batch workflow scheduler and Voldemort key-value store. We’ll see if the Twilight series has a similar impact on project names.
Enterprises use Hadoop in data-science applications that improve operational efficiency, grow revenues or reduce risk. Many of these data-intensive applications use Hadoop for log analysis, data mining, machine learning or image processing.
Commercial, open source or internally developed data-science applications have to tackle a lot of semi-structured, unstructured or raw data. They benefit from Hadoop’s combination of storage and processing in each data node spread across a cluster of cost-effective commodity hardware. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios.
Doug Cutting may have the world’s most famous green-stuffed-elephant toy. While named after his son’s plaything, Hadoop is maturing and entering the business world, creating significant opportunities as well as challenges. “Hadoop is on a fast track to becoming the world’s pre-eminent scientific analytic platform”, notes Forrester Research Senior Analyst James G. Kobielus (Forrester Blogs, June 7, 2011).
This is the first of a series of Hadoop articles I’ll write for Informatica Perspectives. My focus for this series is to guide an existing or prospective user of Apache Hadoop on best practices and tips so that organizations can become more data centric. After participating in Hadoop user communities, both local and virtual, for the last several years, I’m happy to share from work with Hadoop pioneers and practitioners both innovative use cases and “areas to watch out for” in deploying and integrating Hadoop as part of a broader enterprise data architecture. I also bring a user perspective as a certified Hadoop system administrator. (more…)