Tag Archives: Hadoop
In my last blog, I talked about the dreadful experience of cleaning raw data by hand as a former analyst a few years back. Well, the truth is, I was not alone. At a recent data mining Meetup event in San Francisco bay area, I asked a few analysts: “How much time do you spend on cleaning your data at work?” “More than 80% of my time” and “most my days” said the analysts, and “they are not fun”.
But check this out: There are over a dozen Meetup groups focused on data science and data mining here in the bay area I live. Those groups put on events multiple times a month, with topics often around hot, emerging technologies such as machine learning, graph analysis, real-time analytics, new algorithm on analyzing social media data, and of course, anything Big Data. Cools BI tools, new programming models and algorithms for better analysis are a big draw to data practitioners these days.
That got me thinking… if what analysts said to me is true, i.e., they spent 80% of their time on data prepping and 1/4 of that time analyzing the data and visualizing the results, which BTW, “is actually fun”, quoting a data analyst, then why are they drawn to the events focused on discussing the tools that can only help them 20% of the time? Why wouldn’t they want to explore technologies that can help address the dreadful 80% of the data scrubbing task they complain about?
Having been there myself, I thought perhaps a little self-reflection would help answer the question.
As a student of math, I love data and am fascinated about good stories I can discover from them. My two-year math program in graduate school was primarily focused on learning how to build fabulous math models to simulate the real events, and use those formula to predict the future, or look for meaningful patterns.
I used BI and statistical analysis tools while at school, and continued to use them at work after I graduated. Those software were great in that they helped me get to the results and see what’s in my data, and I can develop conclusions and make recommendations based on those insights for my clients. Without BI and visualization tools, I would not have delivered any results.
That was fun and glamorous part of my job as an analyst, but when I was not creating nice charts and presentations to tell the stories in my data, I was spending time, great amount of time, sometimes up to the wee hours cleaning and verifying my data, I was convinced that was part of my job and I just had to suck it up.
It was only a few months ago that I stumbled upon data quality software – it happened when I joined Informatica. At first I thought they were talking to the wrong person when they started pitching me data quality solutions.
Turns out, the concept of data quality automation is a highly relevant and extremely intuitive subject to me, and for anyone who is dealing with data on the regular basis. Data quality software offers an automated process for data cleansing and is much faster and delivers more accurate results than manual process. To put that in math context, if a data quality tool can reduce the data cleansing effort from 80% to 40% (btw, this is hardly a random number, some of our customers have reported much better results), that means analysts can now free up 40% of their time from scrubbing data, and use that times to do the things they like – playing with data in BI tools, building new models or running more scenarios, producing different views of the data and discovering things they may not be able to before, and do all of that with clean, trusted data. No more bored to death experience, what they are left with are improved productivity, more accurate and consistent results, compelling stories about data, and most important, they can focus on doing the things they like! Not too shabby right?
I am excited about trying out the data quality tools we have here at Informtica, my fellow analysts, you should start looking into them also. And I will check back in soon with more stories to share..
I have a little fable to tell you…
This fable has nothing to do with Big Data, but instead deals with an Overabundance of Food and how to better digest it to make it useful.
And it all started when this SEO copywriter from IT Corporation walked into a bar, pub, grill, restaurant, liquor establishment, and noticed 2 large crowded tables. After what seemed like an endless loop, an SQL programmer sauntered in and contemplated the table problem. “Mind if I join you?”, he said? Since the tables were partially occupied and there were no virtual tables available, the host looked on the patio of the restaurant at 2 open tables. “Shall I do an outside join instead?” asked the programmer? The host considered their schema and assigned 2 seats to the space.
The writer told the programmer to look at the menu, bill of fare, blackboard – there were so many choices but not enough real nutrition. “Hmmm, I’m hungry for the right combination of food, grub, chow, to help me train for a triathlon” he said. With that contextual information, they thought about foregoing the menu items and instead getting in the all-you-can-eat buffer line. But there was too much food available and despite its appealing looks in its neat rows and columns, it seemed to be mostly empty calories. They both realized they had no idea what important elements were in the food, but came to the conclusion that this restaurant had a “Big Food” problem.
They scoped it out for a moment and then the writer did an about face, reversal, change in direction and the SQL programmer did a commit and quick pivot toward the buffer line where they did a batch insert of all of the food, even the BLOBS of spaghetti, mash potatoes and jello. There was far too much and it was far too rich for their tastes and needs, but they binged and consumed it all. You should have seen all the empty dishes at the end – they even caused a stack overflow. Because it was a batch binge, their digestive tracts didn’t know how to process all of the food, so they got a stomach ache from “big food” ingestion – and it nearly caused a core dump – in which case the restaurant host would have assigned his most dedicated servers to perform a thorough cleansing and scrubbing. There was no way to do a rollback at this point.
It was clear they needed relief. The programmer did an ad hoc query to JSON, their Server who they thought was Active, for a response about why they were having such “big food” indigestion, and did they have packets of relief available. No response. Then they asked again. There was still no response. So the programmer said to the writer, “Gee, the Quality Of Service here is terrible!”
Just then, the programmer remembered a remedy he had heard about previously and so he spoke up. “Oh, it’s very easy just <SELECT>Vibe.Data.Stream from INFORMATICA where REAL-TIME is NOT NULL.”
Informatica’s Vibe Data Stream enables streaming food collection for real-time Big food analytics, operational intelligence, and traditional enterprise food warehousing from a variety of distributed food sources at high scale and low latency. It enables the right food ingested at the right time when nutrition is needed without any need for binge or batch ingestion.
And so they all lived happily ever after and all was good in the IT Corporation once again.
Download Now and take your first steps to rapidly developing applications that sense and respond to streaming food (or data) in real-time.
A full house, lots of funny names and what does it all mean?
Cloudera, Appfluent and Informatica partnered today at Informatica World in Las Vegas to deliver together a one day training session on Introduction to Hadoop and Big Data. Technologies overview, best practices, and how to get started were on the agenda. Of course, we needed to start off with a little history. Processing and computing was important in the old days. And, even in the old days it was hard to do and very expensive.
Today it’s all about scalability. What Cloudera does is “Spread the Data and Spread the Processing” with Hadoop optimized for scanning lots of data. It’s the Hadoop File System (HDFS) that slices up the data. It takes a slice of data and then takes another slice. Map Reduce is then used to spread the processing. How does spreading the data and the processing help us with scalability?
When we spread the data and processing we need to index the data. How do we do this? We add the Get Puts. That’s Get a Row, Put a Row. Basically this is what helps us find a row of data easily. The potential for processing millions of rows of data today is more and more a reality for many businesses. Once we can find and process a row of data easily we can focus on our data analysis.
Data Analysis, what’s important to you and your business? Appfluent gives us the map to identify data and workloads to offload and archive to Hadoop. It helps us assess what is not necessary to load into the Data Warehouse. The Data Warehouse today with the exponential growth in volume and types of data will soon cost too much unless we identify what to load and offload.
Informatica has the tools to help you with processing your data. Tools that understand Hadoop and that you already use today. This helps you with a managing these volumes of data in a cost effective way. Add to that the ability to reuse what you have already developed. Now that makes these new tools and technologies exciting.
In this Big Data and Hadoop session, #INFA14, you will learn:
- Common terminologies used in Big Data
- Technologies, tools, and use cases associated with Hadoop
- How to identify and qualify the most appropriate jobs for Hadoop
- Options and best practices for using Hadoop to improve processes and increase efficiency
Live action at Informatica World 2014, May 12 9:00 am – 5:00 pm and updates at:
People are obsessed with data. Data captured from our smartphones. Internet data showing how we shop and search — and what marketers do with that data. Big Data, which I loosely define as people throwing every conceivable data point into a giant Hadoop cluster with the hope of figuring out what it all means.
Too bad all that attention stems from fear, uncertainty and doubt about the data that defines us. I blame the technology industry, which — in the immortal words of Cool Hand Luke has had a “failure to communicate.” For decades we’ve talked the language of IT and left it up to our direct customers to explain the proper care-and-feeding of data to their business users. Small wonder it’s way too hard for regular people to understand what we, as an industry, are doing. After all, how we can expect others to explain the do’s and don’ts of data management when we haven’t clearly explained it ourselves?
I say we need to start talking about the ABC’s of handling data in a way that’s easy for anyone to understand. I’m convinced we can because — if you think about it — everything you learned about data you learned in kindergarten: It has to be clean, safe and connected. Here’s what I mean:
Data cleanliness has always been important, but assumes real urgency with the move toward Big Data. I blame Hadoop, the underlying technology that makes Big Data possible. On the plus side, Hadoop gives companies a cost-effective way to store, process and analyze petabytes of nearly every imaginable data type. And that’s the problem as companies go through the enormous time suck of cataloging and organizing vast stores of data. Put bluntly, big data can be a swamp.
The question is, how to make it potable. This isn’t always easy, but it’s always, always necessary. It begins, naturally, by ensuring the data is accurate, de-deduped and complete.
Now comes the truly difficult part: Knowing where that data originated, where it’s been, how it’s related to other data and its lineage. That data provenance is absolutely vital in our hyper-connected world where one company’s data interacts with data from suppliers, partners, and customers. Someone else’s dirty data, regardless of origin, can ruin reputations and drive down sales faster than you can say “Target breach.” In fact, we now know that hackers entered Target’s point-of-sales terminals through a supplier’s project management and electronic billing system. We won’t know for a while the full extent of the damage. We do know the hack affected one-third of the entire U.S. population. Which brings us to:
Obviously, being safe means keeping data out of the hands of criminals. But it doesn’t stop there. That’s because today’s technologies make it oh so easy to misuse the data we have at our disposal. If we’re really determined to keep data safe, we have to think long and hard about responsibility and governance. We have to constantly question the data we use, and how we use it. Questions like:
- How much of our data should be accessible, and by whom?
- Do we really need to include personal information, like social security numbers or medical data, in our Hadoop clusters?
- When do we go the extra step of making that data anonymous?
And as I think about it, I realize that everything we learned in kindergarten boils down to down to the ethics of data: How, for example, do we know if we’re using data for good or for evil?
That question is especially relevant for marketers, who have a tendency to use data to scare people, for crass commercialism, or to violate our privacy just because technology makes it possible. Use data ethically, and we can help change the use.
In fact, I believe that the ethics of data is such an important topic that I’ve decided to make it the title of my new blog.
Stay tuned for more musings on The Ethics of Data.
I’m glad to hear you feel comfortable explaining data to your friends, and I completely understand why you’ll avoid discussing metadata with them. You’re in great company – most business leaders also avoid discussing metadata at all costs! You mentioned during our last call that you keep reading articles in the New York Times about this thing called “Big Data” so as promised I’ll try to explain it as best I can. (more…)
Data Warehouse Optimization (DWO) is becoming a popular term that describes how an organization optimizes their data storage and processing for cost and performance while data volumes continue to grow from an ever increasing variety of data sources.
Data warehouses are reaching their capacity much too quickly as the demand for more data and more types of data are forcing IT organizations into very costly upgrades. Further compounding the problem is that many organizations don’t have a strategy for managing the lifecycle of their data. It is not uncommon for much of the data in a data warehouse to be unused or infrequently used or that too much compute capacity is consumed by extract-load-transform (ELT) processing. This is sometimes the result of business requests for one off business reports that are no longer used or staging raw data in the data warehouse. A large global bank’s data warehouse was exploding with 200TB of data forcing them to consider an upgrade that would cost $20 million. They discovered that much of the data was no longer being used and could be archived to lower cost storage thereby avoiding the upgrade and saving millions. This same bank continues to retire data monthly resulting in on-going savings of $2-3 million annually. A large healthcare insurance company discovered that fewer than 2% of their ELT scripts were consuming 65% of their data warehouse CPU capacity. This company is now looking at Hadoop as a staging platform to offload the storage of raw data and ELT processing freeing up their data warehouse to support the hundreds of concurrent business users. A global media & entertainment company saw their data increase by 20x per year and the associated costs increase 3x within 6 months as they on-boarded more data such as web clickstream data from thousands of web sites and in-game telemetry data.
In this era of big data, not all data is created equal with most raw data originating from machine log files, social media, or years of original transaction data considered to be of lower value – at least until it has been prepared and refined for analysis. This raw data should be staged in Hadoop to reduce storage and data preparation costs while the data warehouse capacity should be reserved for refined, curated and frequently used datasets. Therefore, it’s time to consider optimizing your data warehouse environment to lower costs, increase capacity, optimize performance, and establish an infrastructure that can support growing data volumes from a variety of data sources. Informatica has a complete solution available for data warehouse optimization.
The first step in the optimization process as illustrated in Figure 1 below is to identify inactive and infrequently used data and ELT performance bottlenecks in the data warehouse. Step 2 is to offload the data and ELT processing identified in step 1 to Hadoop. PowerCenter customers have the advantage of Vibe which allows them to map once and deploy anywhere so that ELT processing executed through PowerCenter pushdown capabilities can be converted to ETL processing on Hadoop as part of a simple configuration step during deployment. Most raw data, such as original transaction data, log files (e.g. Internet clickstream), social media, sensor device, and machine data should be staged in Hadoop as noted in step 3. Informatica provides near-universal connectivity to all types of data so that you can load data directly into Hadoop. You can even replicate entire schemas and files into Hadoop, capture just the changes, and stream millions of transactions per second into Hadoop such as machine data. The Informatica PowerCenter Big Data Edition makes every PowerCenter developer a Hadoop developer without having to learn Hadoop so that all ETL, data integration and data quality can be executed natively on Hadoop using readily available resource skills while increasing productivity up to 5x over hand-coding. Informatica also provides data discovery and profiling tools on Hadoop to help data science teams collaborate and understand their data. The final step is to move the resulting high value and frequently used data sets prepared and refined on Hadoop into the data warehouse that supports your enterprise BI and analytics applications.
To get started, Informatica has teamed up with Cloudera to deliver a reference architecture for data warehouse optimization so organizations can lower infrastructure and operational costs, optimize performance and scalability, and ensure enterprise-ready deployments that meet business SLA’s. To learn more please join the webinar A Big Data Reference Architecture for Data Warehouse Optimization on Tuesday November 19 at 8:00am PST.
Figure 1: Process steps for Data Warehouse Optimization
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
An explosion in mobile devices and social media usage has been the driving force behind large brands using big data solutions for deep, insightful analytics. In fact, a recent mobile consumer survey found that 71% of people used their mobile devices to access social media.
With social media becoming a major avenue for advertising, and mobile devices being the medium of access, there are numerous data points that global brands can cross-reference to get a more complete picture of their consumer, and their buying propensities. Analyzing these multitudes of data points is the reason behind the rise of big data solutions such as Hadoop.
However, Hadoop itself is only one Big Data framework, and consists of several different flavors. Facebook, which called itself the owner of the world’s largest Hadoop cluster, at 100 petabytes, outgrew its capabilities on Hadoop and is looking into a technology which would allow it to abstract its Hadoop workloads across several geographically dispersed datacenters.
When it comes to analytics projects that require intensive data warehousing, there is no one-size fits all answer for Big Data as the use cases can be extremely varied, ranging from short-term to long-term. Deploying Hadoop clusters requires specialized skills and proper capacity planning. In contrast, Big Data solutions in the cloud such as Amazon RedShift allow users to provision database nodes on demand and in a matter of minutes, without the need to take into account large outlays of infrastructure such as servers, and datacenter space. As a result, cloud-based Big Data can be a viable alternative for short-term analytics projects as well as fulfilling sandbox requirements to test out larger Big Data integration projects. Cloud-based Big Data may also make sense in situations where only a subset of the data is required for analysis as opposed to the entire dataset.
With cloud integration, much of the complexity of connecting to data sources and targets is abstracted away. Consequently, when a cloud-based Big Data deployment is combined with a cloud integration solution, it can result in even more time and cost savings and get the projects off the ground much faster.
We’ll be discussing several use cases around cloud-based Big Data in our webinar on August 22nd, Big Data in the Cloud with Informatica Cloud and Amazon Redshift, with special guests from Amazon on the event.
We discussed Big Data and Big Data integration last month, but the rise of Big Data and the systemic use of data integration approaches and technology continues to be a source of confusion. As with any evolution of technology, assumptions are being made that could get many enterprises into a great deal of trouble as they move to Big Data.
Case in point: The rise of big data gave many people the impression that data integration is not needed when implementing big data technology. The notion is, if we consolidate all of the data into a single cluster of servers, than the integration is systemic to the solution. Not the case.
As you may recall, we made many of the same mistakes around the rise of service oriented architecture (SOA). Don’t let history repeat itself with the rise of cloud computing. Data integration, if anything, becomes more important as new technology is layered within the enterprise.
Hadoop’s storage approach leverages a distributed file system that maps data wherever it sits in a cluster. This means that massive amounts of data reside in these clusters, and you can map and remap the data to any number of structures. Moreover, you’re able to work with both structured and unstructured data.
As covered in a recent Read Write article, the movement to Big Data does indeed come with built-in business value. “Hadoop, then, allows companies to store data much more cheaply. How much more cheaply? In 2012, Rainstor estimated that running a 75-node, 300TB Hadoop cluster would cost $1.05 million over three years. In 2008, Oracle sold a database with a little over half the storage (168TB) for $2.33 million – and that’s not including operating costs. Throw in the salary of an Oracle admin at around $95,000 per year, and you’re talking an operational cost of $2.62 million over three years – 2.5 times the cost, for just over half of the storage capacity.”
Thus, if these data points are indeed correct, Hadoop clearly enables companies to hold all of their data on a single cluster of servers. Moreover, this data really has no fixed structure. “Fixed assumptions don’t need to be made in advance. All data becomes equal and equally available, so business scenarios can be run with raw data at any time as needed, without limitation or assumption.”
While this process may look like data integration to some, the heavy lifting around supplying these clusters with data is always a data integration solution, leveraging the right enabling technology. Indeed, consider what’s required around the movement to Big Data systems additional stress and you’ll realize why strain is placed upon the data integration solution. A Big Data strategy that leverages Big Data technology increases, not decreases, the need for a solid data integration strategy and a sound data integration technology solution.
Big Data is a killer application that most enterprises should at least consider. The business strategic benefits are crystal clear, and the movement around finally being able to see and analyze all of your business data in real time is underway for most of the Global 2000 and the government. However, you won’t achieve these objectives without a sound approach to data integration, and a solid plan to leverage the right data integration technology.
Science fiction represents some of the most impactful stories I’ve read throughout my life. By impactful, I mean the ideas have stuck with me 30 years since I last read them. I recently recalled two of these stories and realized they represent two very different paths for Big Data. One path, quite literally, was towards enlightenment. Let’s just say the other path went in a different direction. The amazing thing is that both of these stories were written between 50-60 years ago. (more…)