Category Archives: Hadoop
This post was written by guest author Dale Kim, Director of Industry Solutions at MapR Technologies, a valued Informatica partner that provides a distribution for Apache Hadoop that ensures production success for its customers.
Apache Hadoop is growing in popularity as the foundation for an enterprise data hub. An Enterprise Data Hub (EDH) extends and optimizes the traditional data warehouse model by adding complementary big data technologies. It focuses your data warehouse on high value data by reallocating less frequently used data to an alternative platform. It also aggregates data from previously untapped sources to give you a more complete picture of data.
So you have your data, your warehouses, your analytical tools, your Informatica products, and you want to deploy an EDH… now what about Hadoop?
Requirements for Hadoop in an Enterprise Data Hub
Let’s look at characteristics required to meet your EDH needs for a production environment:
You already expect these from your existing enterprise deployments. Shouldn’t you hold Hadoop to the same standards? Let’s discuss each topic:
Enterprise-grade is about the features that keep a system running, i.e., high availability (HA), disaster recovery (DR), and data protection. HA helps a system run even when components (e.g., computers, routers, power supplies) fail. In Hadoop, this means no downtime and no data loss, but also no work loss. If a node fails, you still want jobs to run to completion. DR with remote replication or mirroring guards against site-wide disasters. Mirroring needs to be consistent to ensure recovery to a known state. Using file copy tools won’t cut it. And data protection, using snapshots, lets you recover from data corruption, especially from user or application errors. As with DR replicas, snapshots must be consistent, in that they must reflect the state of the data at the time the snapshot was taken. Not all Hadoop distributions can offer this guarantee.
Hadoop interoperability is an obvious necessity. Features like a POSIX-compliant, NFS-accessible file system let you reuse existing, file system-based applications on Hadoop data. Support for existing tools lets your developers get up to speed quickly. And integration with REST APIs enables easy, open connectivity with other systems.
You should be able to logically divide clusters to support different use cases, job types, user group, and administrators as needed. To avoid a complex, multi-cluster setup, choose a Hadoop distribution with multi-tenancy capabilities to simplify the architecture. This gives you less risk for error and no data/effort duplication.
Security should be a priority to protect against the exposure of confidential data. You should assess how you’ll handle authentication (with or without Kerberos), authorization (access controls), over-the-network encryption, and auditing. Many of these features should be native to your Hadoop distribution, and there are also strong security vendors that provide technologies for securing Hadoop.
Any large scale deployment needs fast read, write, and update capabilities. Hadoop can support the operational requirements of an EDH with integrated, in-Hadoop databases like Apache HBase™ and Accumulo™, as well as MapR-DB (the MapR NoSQL database). This in-Hadoop model helps to simplify the overall EDH architecture.
Using Hadoop as a foundation for an EDH is a powerful option for businesses. Choosing the correct Hadoop distribution is the key to deploying a successful EDH. Be sure not to take shortcuts – especially in a production environment – as you will want to hold your Hadoop platform to the same high expectations you have of your existing enterprise systems.
In my last blog, I talked about the dreadful experience of cleaning raw data by hand as a former analyst a few years back. Well, the truth is, I was not alone. At a recent data mining Meetup event in San Francisco bay area, I asked a few analysts: “How much time do you spend on cleaning your data at work?” “More than 80% of my time” and “most my days” said the analysts, and “they are not fun”.
But check this out: There are over a dozen Meetup groups focused on data science and data mining here in the bay area I live. Those groups put on events multiple times a month, with topics often around hot, emerging technologies such as machine learning, graph analysis, real-time analytics, new algorithm on analyzing social media data, and of course, anything Big Data. Cools BI tools, new programming models and algorithms for better analysis are a big draw to data practitioners these days.
That got me thinking… if what analysts said to me is true, i.e., they spent 80% of their time on data prepping and 1/4 of that time analyzing the data and visualizing the results, which BTW, “is actually fun”, quoting a data analyst, then why are they drawn to the events focused on discussing the tools that can only help them 20% of the time? Why wouldn’t they want to explore technologies that can help address the dreadful 80% of the data scrubbing task they complain about?
Having been there myself, I thought perhaps a little self-reflection would help answer the question.
As a student of math, I love data and am fascinated about good stories I can discover from them. My two-year math program in graduate school was primarily focused on learning how to build fabulous math models to simulate the real events, and use those formula to predict the future, or look for meaningful patterns.
I used BI and statistical analysis tools while at school, and continued to use them at work after I graduated. Those software were great in that they helped me get to the results and see what’s in my data, and I can develop conclusions and make recommendations based on those insights for my clients. Without BI and visualization tools, I would not have delivered any results.
That was fun and glamorous part of my job as an analyst, but when I was not creating nice charts and presentations to tell the stories in my data, I was spending time, great amount of time, sometimes up to the wee hours cleaning and verifying my data, I was convinced that was part of my job and I just had to suck it up.
It was only a few months ago that I stumbled upon data quality software – it happened when I joined Informatica. At first I thought they were talking to the wrong person when they started pitching me data quality solutions.
Turns out, the concept of data quality automation is a highly relevant and extremely intuitive subject to me, and for anyone who is dealing with data on the regular basis. Data quality software offers an automated process for data cleansing and is much faster and delivers more accurate results than manual process. To put that in math context, if a data quality tool can reduce the data cleansing effort from 80% to 40% (btw, this is hardly a random number, some of our customers have reported much better results), that means analysts can now free up 40% of their time from scrubbing data, and use that times to do the things they like – playing with data in BI tools, building new models or running more scenarios, producing different views of the data and discovering things they may not be able to before, and do all of that with clean, trusted data. No more bored to death experience, what they are left with are improved productivity, more accurate and consistent results, compelling stories about data, and most important, they can focus on doing the things they like! Not too shabby right?
I am excited about trying out the data quality tools we have here at Informtica, my fellow analysts, you should start looking into them also. And I will check back in soon with more stories to share..
I have a little fable to tell you…
This fable has nothing to do with Big Data, but instead deals with an Overabundance of Food and how to better digest it to make it useful.
And it all started when this SEO copywriter from IT Corporation walked into a bar, pub, grill, restaurant, liquor establishment, and noticed 2 large crowded tables. After what seemed like an endless loop, an SQL programmer sauntered in and contemplated the table problem. “Mind if I join you?”, he said? Since the tables were partially occupied and there were no virtual tables available, the host looked on the patio of the restaurant at 2 open tables. “Shall I do an outside join instead?” asked the programmer? The host considered their schema and assigned 2 seats to the space.
The writer told the programmer to look at the menu, bill of fare, blackboard – there were so many choices but not enough real nutrition. “Hmmm, I’m hungry for the right combination of food, grub, chow, to help me train for a triathlon” he said. With that contextual information, they thought about foregoing the menu items and instead getting in the all-you-can-eat buffer line. But there was too much food available and despite its appealing looks in its neat rows and columns, it seemed to be mostly empty calories. They both realized they had no idea what important elements were in the food, but came to the conclusion that this restaurant had a “Big Food” problem.
They scoped it out for a moment and then the writer did an about face, reversal, change in direction and the SQL programmer did a commit and quick pivot toward the buffer line where they did a batch insert of all of the food, even the BLOBS of spaghetti, mash potatoes and jello. There was far too much and it was far too rich for their tastes and needs, but they binged and consumed it all. You should have seen all the empty dishes at the end – they even caused a stack overflow. Because it was a batch binge, their digestive tracts didn’t know how to process all of the food, so they got a stomach ache from “big food” ingestion – and it nearly caused a core dump – in which case the restaurant host would have assigned his most dedicated servers to perform a thorough cleansing and scrubbing. There was no way to do a rollback at this point.
It was clear they needed relief. The programmer did an ad hoc query to JSON, their Server who they thought was Active, for a response about why they were having such “big food” indigestion, and did they have packets of relief available. No response. Then they asked again. There was still no response. So the programmer said to the writer, “Gee, the Quality Of Service here is terrible!”
Just then, the programmer remembered a remedy he had heard about previously and so he spoke up. “Oh, it’s very easy just <SELECT>Vibe.Data.Stream from INFORMATICA where REAL-TIME is NOT NULL.”
Informatica’s Vibe Data Stream enables streaming food collection for real-time Big food analytics, operational intelligence, and traditional enterprise food warehousing from a variety of distributed food sources at high scale and low latency. It enables the right food ingested at the right time when nutrition is needed without any need for binge or batch ingestion.
And so they all lived happily ever after and all was good in the IT Corporation once again.
Download Now and take your first steps to rapidly developing applications that sense and respond to streaming food (or data) in real-time.
Forget degrees from Harvard or MIT, forget NoSQL, Hadoop or OBIEE. These are all powerful tools but they will not win you the face-off. It starts with who you are (or are not), who or what you are going up against and what has happened in the past. Why should martial arts be a job requirement for Chief Data Officers? I boiled it down to three simple reasons to help you understand.
I started practicing Kendo three years ago and it surprises me every single practice how inadequate I still am, how much I can glean from my opponent to determine future behavior and how unimportant “background noise” really is. Even if I have a good day, some strike from a teenager or a retiree, who has been practicing for a decade or more, will remind me that I got only 1% better compared to last month and I have a long way to go. At our last practice, one of my Senseis told me that the higher ranks get their Ki-Ken-Tai-Ichi (alignment of spirit, sword, body) right maybe half the time. It’s a life lesson every time.
These three facts are probably also true for many one-on-one sports where adversaries study each other for more than just a couple of seconds before their next swing or shot. If you ask me, most western sports are about endurance, strength, mindset and strategy with a heavy focus on the physical aspects. Kendo is 90% strategy and mindset. That is why six and sixty-year olds alike can excel in it. It is more akin to chess with baseball bats.
You study your opponent from the second he walks up to the chair in the middle of the podium for a gentleman-like exchange of cerebral willpower but in the end you will smack him relentlessly with a bat. Everything you do is directly driven from how you feel, what you think, how your opponent moves, what your opponent feels and thinks. The goal is not to react to a hand being raised but to anticipate your opponent’s next move based on their most recent actions. By the time your eyes (and, a tenth of a second later, your brain) capture the right hand going up to strike towards you – you’ve already lost, as it is too late to react.
You are effectively analyzing core data domains and key attributes, like posture. Business data requires the same rigor and focus on the essential. There is also a tremendous amount of process (formalities like repeated bowing) and deeper meaning in everything you do; call it “Governance”.
A Chief Data Officer (CDO) needs to mind the same aspects in his or her existence.
- You are not the professional you think you are (humility)
- Someone else always knows something you don’t, so every additional bit helps to predict future actions (willingness to learn)
- How to eliminate all the noise detracting from the ultimate goal (focus)
In reality, the data problem has not been solved long ago. Something new can be learned to combat this age-old problem. The learning piece comes into play when we are willing to listen to people who have done or seen similar problems being fixed in another environment, not necessarily the same industry or department. The third is that political and technical detractors like procurement processes, M&A, new leadership or transactional volume spikes from more applications will continue to pop up. However, it is on you, the CDO, to uncover, isolate and preach that fixing a process may not always be the root cause of a business issue and as such needs to be put in perspective. As I always say “throwing bad data at a better process” just saved you a step but still renders errors, rework and bad decisions.
So what does this mean in “real” terms:
- Seek and accept opinions frequently, even if they don’t match your issue perfectly. Often a customer is a customer is a customer….admit it. Your business model may not be that special after all.
- Watch what the others do on a fundamental level, i.e. becoming data-driven organizations. These could be competitors, partners, organizations you (should) admire.
- Internalize and socialize what the core asset, goal of the organization is, which will move the needle the most. Often it will be your intelligence (speak for information or data).
I will leave you with these thoughts and invite you to sit down, cross your legs, close your eyes and get all esoteric on me, young grasshopper, but please envision what you organization should look like and how it should make its money in five years from now. Throwing more resources at new problems, ignoring core data issues and reacting when things bubble up at greater numbers will likely not cut it.
And here is where I will bow out. Take a moment and think about it; how does your take on life influence your assessment of what you encounter in your workplace?