Tag Archives: Big Data
Unlike some of my friends, History was a subject in high school and college that I truly enjoyed. I particularly appreciated biographies of favorite historical figures because it painted a human face and gave meaning and color to the past. I also vowed at that time to navigate my life and future under the principle attributed to Harvard professor Jorge Agustín Nicolás Ruiz de Santayana y Borrás that goes, “Those who cannot remember the past are condemned to repeat it.”
So that’s a little ditty regarding my history regarding history.
Forwarding now to the present in which I have carved out my career in technology, and in particular, enterprise software, I’m afforded a great platform where I talk to lots of IT and business leaders. When I do, I usually ask them, “How are you implementing advanced projects that help the business become more agile or effective or opportunistically proactive?” They usually answer something along the lines of “this is the age and renaissance of data science and analytics” and then end up talking exclusively about their meat and potatoes business intelligence software projects and how 300 reports now run their business.
Then when I probe and hear their answer more in depth, I am once again reminded of THE history quote and think to myself there’s an amusing irony at play here. When I think about the Business Intelligence systems of today, most are designed to “remember” and report on the historical past through large data warehouses of a gazillion transactions, along with basic, but numerous shipping and billing histories and maybe assorted support records.
But when it comes right down to it, business intelligence “history” is still just that. Nothing is really learned and applied right when and where it counted – AND when it would have made all the difference had the company been able to react in time.
So, in essence, by using standalone BI systems as they are designed today, companies are indeed condemned to repeat what they have already learned because they are too late – so the same mistakes will be repeated again and again.
This means the challenge for BI is to reduce latency, measure the pertinent data / sensors / events, and get scalable – extremely scalable and flexible enough to handle the volume and variety of the forthcoming data onslaught.
There’s a part 2 to this story so keep an eye out for my next blog post History Repeats Itself (Part 2)
I’m glad to hear you feel comfortable explaining data to your friends, and I completely understand why you’ll avoid discussing metadata with them. You’re in great company – most business leaders also avoid discussing metadata at all costs! You mentioned during our last call that you keep reading articles in the New York Times about this thing called “Big Data” so as promised I’ll try to explain it as best I can. (more…)
So I missed Strata this year so I can only report back what I heard from my team. I was out on the road talking with customers while the gang was at Strata, talking to customers and prospective customers. That said, the conversations they had with new cool Hadoop companies were and my conversations were quite similar. Lots of talk about trials on Hadoop, but outside of the big internet firms, some startups that are focused on solving “big data” problems and some wall street firms, most companies are still kicking the Hadoop tires.
Which reminds me of a picture my neighbor took of a presentation that he saw on Hadoop. The presenter had a slide with a rehash of an old joke that went something like this (I am paraphrasing here as I don’t have the exact quote):
“Hadoop is a lot like teenage sex. Everyone says they do it, but most are not. And for those who are doing it, most of them aren’t very good at it yet. “
So if you haven’t gotten started on your Hadoop project, don’t worry, you aren’t as far behind as you think.
My wife invited my new neighbors over for dinner this past Saturday night. They are a French couple with a super cute 5 year old son. Dinner was nice, and like most ex-pats in the San Francisco Bay Area, he is in high tech. His company is a successful internet company in Europe, but have had a hard time penetrating the U.S. market which is why they moved to the Bay Area. He is starting up a satellite engineering organization in Palo Alto and he asked me where he can find good “big data” engineers. He is having a hard time finding people.
This is a story that I am hearing quite a bit with customers that I have been talking to as well. They want to start up big data teams, but can’t find enough skilled engineers who understand how to develop in PIG or HIVE or YARN or whatever is coming next in the Hadoop/map reduce world.
This reminds me of when I used to work in the telecom software business 20 years ago and everyone was looking at technologies like DCE and CORBA to build out distributed computing environments to solve complex problems that couldn’t be solved easily on a single computing system. If you don’t know what DCE or CORBA are/were, that’s OK. It is kind of the point. They are distributed computing development platforms that failed because they were too damn hard and there just weren’t enough people who could understand how to use them effectively. Now DCE and CORBA were not trying to solve the same problems as Hadoop, but the basic point still stands, they were damn hard and the reality is that programming on a Hadoop platform is damn hard as well.
So could Hadoop fail, just like CORBA and DCE. I doubt it, for a few key reasons. One… there is a considerable amount of venture and industrial investment going into Hadoop to make it work. Not since Java has there been such a concerted effort by the industry to try to make a new technology successful. Second, much of that investment is in providing graphical development environments and applications that use the storage and compute power of Hadoop, but hide its complexity. That is what Informatica is doing with PowerCenter Big Data Edition. We are making it possible for data integration developers to parse, cleanse, transform and integrate data using Hadoop as the underlying storage and engine. But the developer doesn’t have to know anything about Hadoop. The same thing is happening at the analytics layer, at the data prep layer and at the visualization layer.
Bit by bit, software vendors are hiding the underlying complexity of Hadoop so organizations won’t have to hire an army of big data scientists to solve interesting problems. They will still need a few of them, but not so many that Hadoop will end up like those other technologies that most Hadoop developers have never even heard of.
Power to the elephant. And more later about my dinner guest and his super cute 5 year old son.
Everyone knows that Informatica is the Data Integration company that helps organizations connect their disparate software into a cohesive and synchronous enterprise information system. The value to business is enormous and well documented in the form of use cases, ROI studies and loyalty / renewal rates that are industry-leading.
Event Processing, on the other hand is a technology that has been around only for a few years now and has yet to reach Main Street in Systems City, IT. But if you look at how event processing is being used, it’s amazing that more people haven’t heard about it. The idea at its core (pun intended) is very simple – monitor your data / events – those things that happen on a daily, hourly, minute-ly basis and then look for important patterns that are positive or negative indicators, and then set up your systems to automatically take action when those patterns come up – like notify a sales rep when a pattern indicates a customer is ready to buy, or stop that transaction, your company is about to be defrauded.
Since this is an Informatica blog, then you probably have a decent set of “muscles” in place already and so why, you ask, would you need 6 pack abs? Because 6 packs abs are a good indication of a strong musculature core and are the basis of a stable and highly athletic body. It’s the same parallel for companies because in today’s competitive business environment, you need strength, stability, and agility to compete. And since IT systems increasingly ARE the business, if your company isn’t performing as strong, lean, and mean as possible, then you can be sure your competitors will be looking to implement every advantage they can.
You may also be thinking why would you need something like Event Processing when you already have good Business Intelligence systems in place? The reality is that it’s not easy to monitor and measure useful but sometimes hidden data /event / sensor / social media sources and also to discern which patterns have meaning and which patterns may be discovered as false negatives. But the real difference is that BI usually reports to you after the fact when the value of acting on the situation has diminished significantly.
So while muscles are important to be able to stand up and run, and good quality, strong muscles are necessary to do heavy lifting, it’s those 6 pack abs on top of it all that give you the mean lean fighting machine to identify significant threats and opportunities amongst your data, and in essence, to better compete and win.
We’ve posted three compelling new articles to the Potential at Work for Information Leaders site, including:
- “Will the real Chief Data Officer please stand up?” Some question the need for a new C-level position, arguing that a company’s chief information officer should be the one to oversee an organization’s data. Others argue the CIO is stretched too thin already and a new type of leader must emerge. Where do you stand?
- “Introducing a ‘define once, govern everywhere’ data management style” The sanity afforded by defining data standards only once and applying them anywhere will create time to investigate innovative uses for that data. Information leaders will be much more successful if they spend less time managing projects to recode the same rules across every new application, and instead work with business partners to identify new information opportunities.
- “Rise of the machines: the Internet of Things” Are devices that track our every move poised to unlock new potential in humankind or are they just downright invasive? While privacy remains a critical consideration, this article illustrates the global potential if we can effectively leverage big data to harness the emerging Internet of Things.
For these articles and many more, check out the Potential at Work for Information Leaders community today. Available in nine languages, this site will continue to feature fresh, new ideas to promote the value of information management from a variety of top technology leaders.
“Raw data is both an oxymoron and a bad idea. On the contrary, data should be cooked with care.” This was a statement made by Geoff Bowker in 2005, and served as the opening lines of a recent talk by Kate Crawford, principal researcher at Microsoft Research New England and a visiting MIT professor, who urged that big data be adopted and handled cautiously.
In her keynote at the recent DataEDGE 2013 conference, held at the University of California at Berkeley, Crawford said the time is now to have a discussion on the implications big data is having on business and society.
She outlined the six myths that have arisen around big data:
Myth #1: Big data is new. References to big data began to pop up in the literature in the late 1990s, but this is something some prominent industries, such as financial services firms and oil companies, have been wrestling with for decades, Crawford says. What is new, however, “is the fact that a lot of the tools of big data are becoming more easily reached by a lot more people. We’re having an explosion in ideas, creativity and imagination in terms of what we can do with these technologies.” This is the time to discuss the implications of big data, she adds, because much of it will be invisible within a few years as the tools and technologies mature. “Really usable systems and really good technologies disappear,” she states. “The easier they are to use, the harder they are to see.”
Myth #2: Big data is objective. Actually, big data sets can be very biased, Crawford states. For example, she says, she poured through 20 million tweets sent out about Hurricane Sandy, which flooded her neighborhood in Manhattan last year. While the tweets tell a compelling story about how residents coped, they mainly represent the views of younger, more well-to Manhattan residents. “If we look a little closer at the tweets, most were coming out of Manhattan, which has a higher concentration of people using smartphone, and a higher concentration of Twitter users – a subset of a subset. There were very few tweets coming from the far more affected areas, such as Breezy Point or the Far Rockaways. Because we don’t have the data from those places, we essentially have very privileged urban stories. We have to be really clear who were talking about, we have to think about what this data really represents,” she says.
Myth #3: Big data doesn’t discriminate. “There’s a myth that says essentially because you’re dealing with large data sets, you can somehow avoid group-level prejudice,” Crawford cautions. She pointed to a recent study of the Facebook “likes” of 60,000 people that found such data can be used to identify a person’s race, sexual orientation, religious views, political leanings, and even if they are a previous drug or alcohol user. “The researchers also expressed a set of concerns that this data can be bought by anyone. Ultimately, employees can make decisions about individuals based on this data.”
Myth #4: Big data makes cities smarter. While big data goes a long way to improve the management of city problems, it also may under-represent communities. “Not all data is created or collected equally – there are always certain communities of people who are going to be left out of those data sets,” Crawford says. For example, last year, the city of Boston released an app called StreetBump, which automatically registered potholes by passively collecting GPS data from drivers’ smartphones. The program collected a great deal of data on potholes. However, she adds, “wealthier younger citizens are more likely to have smartphones, and therefore, wealthy areas with younger people would get more attention, while areas with older residents with less money will get fewer resources.”
Myth #5: Big data is anonymous. Crawford cited a recent study, published in Nature, which determined that individuals could be identified with no more than four data points, including their cell phone number. Before the advent of personal technology, it took about 12 data points to identify an individual. “It’s very difficult to make data anonymous – even with two randomized data points, it’s possible to identify 50 percent of people.” Another big data initiative, the smart grid being adopted by electric utilities, will capture a wealth of data – from energy usage to “when you have friends over, when you are sleeping. This is some very intimate data.”
Myth #6: You can opt out of big data. There are suggestions that people will be able to protect their privacy is they pay a fee for web services to opt out of tracking, versus using services for free in exchange for giving up some information. Crawford cautions that this will result in a two-tier system, which “turns private data into a luxury good rather than a public good.”
Rather than making data privacy and management an individual choice, Crawford urges a more public discussion on “the way that the data is essentially flowing between corporations, individuals and governments.”
The drive to achieve competitive advantage with Big Data is creating a lot of interesting opportunities for managers and professionals working in the data analytics space. In some cases, new job categories not imaginable just a few years back are being created, and are in demand. “Data scientist” is only one of these descriptions, and there are jobs that don’t require Ph.Ds in statistics. Sometimes, it just takes a little creative thinking to move one’s career in a new and different direction.
In a recent Forbes post, J. Maureen Henderson, head of a market research firm, discussed the ways emerging Big Data expertise is being leveraged for business problems. One University of Tennessee student, for example, is pursuing a post-graduate degree in Big Data analytics to “tell stories from data,” noting that in a previous job, she saw that “there were plenty of talented ‘data’ people and plenty of the talented ‘business’ people; however, the people who could do both were extraordinarily valuable to the firm and to my team’s ability to solve problems. That really got my wheels turning, and I started thinking about what other problems I might be able to solve if I knew more about analytics.”
In the process of exploring the avenues by which big data will deliver value to businesses, some interesting new job titles and descriptions are emerging across the industry. The new generation of jobs being spurred by Big Data are often a blend of stats-savvy and business-savvy skillsets and activities.
Here is a sampling of a few of these blended positions that have recently appeared at online recruiting sites:
Industry analytics manager (pharmaceutical): “Collaborate with cross-functional partners in industry analytics, market analysis and strategy, the managed care contracting organization and brand marketing teams to consult with and deliver deep insights and actionable strategic and tactical recommendations on access and reimbursement drivers of the business. Demonstrate ability to break complex problems down into distinct parts, simplify complexity, and manage uncertainty.”
Data scientist/machine learning expert (online consumer site): “Data science team is looking for a data scientist to work on machine learning, data mining and information retrieval problems. Perform complex analysis on very large data sets using data mining, machine learning and graph analysis algorithms. Build complex predictive models that scale to petabytes of data. Define metrics, understand A/B testing and statistical measurement of model quality. Work closely with product, engineering, and marketing teams to identify, collect and analyze data to answer critical product questions.”
Data anthropologist (marketing firm): “Leverage huge data set and powerful analytical tools to give the public more insight into the digital world. Keep up-to-the-minute with current events and understand the online ecosystem—publishers, consumers, advertisers, analysts, bloggers, vendors, journalists and others who make the internet buzz. Craft stories to create buzz, backed by our data, and share them as tweets or blog posts or press releases or white papers or whatever best suits the material.”
Data scientist/data lover (scholarship fund): “Define and implement the social media measurement strategies and business intelligence analytics that align with marketing and business objectives; perform qualitative, statistical and quantitative analysis; producing meaningful marketing KPI dashboards and delivering routine and ad hoc, cross-channel performance reports with actionable insight. The candidate should be able to identify correlations in cause and effect of email and online/MR and social media integrated campaigns resulting in increased individual donations and stakeholder engagement.”
Systems engineer – big data (game publisher): “Players continue to rack up billions of hours of play — all of it logged, all of the logs frankly rather useless until our lab-coated Big Data scientists work their black magics, transmogrifying unwieldy petabytes through the careful application of open-source and proprietary technologies and bucketloads of intellectual elbow grease. As Systems Engineer – Big Data you’ll provide ongoing support for data warehouse and data services infrastructure and systems, ensuring Big Data van keeps rolling, come hell, high water or technical difficulties. Your exceptional communication skills will help you as you smooth the transition from raw data to actionable insight about the players.”
Data visualization engineer, streaming platforms (streaming video provider): “Own and build new, high-impact visualizations in our insight tools to make data both understandable and actionable. Develop rich interactive graphics, data visualizations of large amount of structured data, preferably in-browser. To deliver operational insights quickly and effectively, we need an excellent suite of interactive tools and dynamic dashboards. Our goal is to raise our operational insight capabilities to a whole new level of excellence that enables us to continuously improve our product while ensuring a flawless experience for our customers.”
An explosion in mobile devices and social media usage has been the driving force behind large brands using big data solutions for deep, insightful analytics. In fact, a recent mobile consumer survey found that 71% of people used their mobile devices to access social media.
With social media becoming a major avenue for advertising, and mobile devices being the medium of access, there are numerous data points that global brands can cross-reference to get a more complete picture of their consumer, and their buying propensities. Analyzing these multitudes of data points is the reason behind the rise of big data solutions such as Hadoop.
However, Hadoop itself is only one Big Data framework, and consists of several different flavors. Facebook, which called itself the owner of the world’s largest Hadoop cluster, at 100 petabytes, outgrew its capabilities on Hadoop and is looking into a technology which would allow it to abstract its Hadoop workloads across several geographically dispersed datacenters.
When it comes to analytics projects that require intensive data warehousing, there is no one-size fits all answer for Big Data as the use cases can be extremely varied, ranging from short-term to long-term. Deploying Hadoop clusters requires specialized skills and proper capacity planning. In contrast, Big Data solutions in the cloud such as Amazon RedShift allow users to provision database nodes on demand and in a matter of minutes, without the need to take into account large outlays of infrastructure such as servers, and datacenter space. As a result, cloud-based Big Data can be a viable alternative for short-term analytics projects as well as fulfilling sandbox requirements to test out larger Big Data integration projects. Cloud-based Big Data may also make sense in situations where only a subset of the data is required for analysis as opposed to the entire dataset.
With cloud integration, much of the complexity of connecting to data sources and targets is abstracted away. Consequently, when a cloud-based Big Data deployment is combined with a cloud integration solution, it can result in even more time and cost savings and get the projects off the ground much faster.
We’ll be discussing several use cases around cloud-based Big Data in our webinar on August 22nd, Big Data in the Cloud with Informatica Cloud and Amazon Redshift, with special guests from Amazon on the event.