The greatest challenge to big data management and analysis isn’t necessarily the technical underpinnings, but rather, lingering executive confusion and uncertainty about what it is and what it can do for their organizations.
The main issue – and root of executive befuddlement – is the abject and ongoing confusion about what, exactly, is meant by “big data.” It’s certainly a hyped-up term for something that has been around for a long time. If you had a one-terabyte database around the turn of the century, you had big data, that’s for sure. If you had a 500-megabyte database in 1990, that would have been big data.
So the “volume” has always been there, and has always been a relative measure. The same goes for the “variety” aspect of big data. Unstructured data – such as word documents or machine log data – has been floating around organizations for decades now. How about the “velocity”? Real-time processing has been on corporate radar screens for well over a decade.
So, what’s changed that we suddenly see this data as an enabler, a game-changer, opening up the gates to a brave new world of analtyics-driven purpose? The rise of relatively cheap open-source tools and platforms for one. Capturing and analyzing large volumes of fast-moving data of various structures required very expensive equipment and consulting assistance. The expensive consultants may still be needed, but the technology is within reach for many organizations.
With this in mind, it is interesting to see that business leaders are warming up to the possibilities of big data, as a new industry survey shows. But what it is exactly they think they’re warming up to is still a big question mark. The survey, conducted among 500 business and IT executives by CompTIA, shows the big data phenomenon has caught the eyes of executives. The vast majority of organizations, 78%, say they feel more positive about big data as a business initiative this year compared to a similar survey conducted a year ago. And, remarkably, 57% feel they’ve made progress in moving in the right direction with data-driven programs, compared with 37% the year before.
To its credit, the CompTIA study’s authors question how accurately these findings actually translate to progress on the big data front: They note that while this years’ survey finds 42% of respondents claiming to be engaged in some of big data initiative – more than double from a year ago (19%) – such initiatives may be “big data” in name only. “This may stem from confusion or reflect the possibility of different users interpreting the concept of big data in different ways,” they observe.
So what we have is a lot of organizations diving into what they see as “big data” projects because that’s what everybody tells them they should be doing. But how much of this is simply the same types of data management and analytics projects that may have been engaged five, 10 years ago?
To really be making the most of big data as we understand it today, organizations should be addressing the following questions:
How much unstructured data is coursing through the organization, and how much of it is worth harvesting? It’s usually easy to measure the amount of structured data, such as that stored in relational databases or data warehouses, but unstructured data is a huge question mark. In many cases, management is clueless about what types of unstructured assets (user-generated files, machine-generated data) are actually available. It’s going to take a lot of research and discovery to uncover the unstructured data assets that are truly meaningful for the business.
Does the current data architecture support the introduction and integration of data sources? Most traditional data architectures are fairly rigid, built to support the inputs and outputs of relational data. Efforts involving other forms of data tend to be one-off projects, in which connectors or interfaces are hand-built built for a single purpose and then forgotten. Reaching out and exploring new and varied types of data require an architecture in which new sources can be rapidly and seamlessly introduced, without the usual silos.
Is the organization moving to an analytics culture? Big data analytics will never be “big” if it only is available to a few select decision makers or analysts. Big data will pack its punch when it enables decision makers at all levels of the organization – from customer care centers to production floors to the executive suite – to access analytics from various data sources. Even more helpful would be a way in which decision-makers can access analytical tools and back-end data sources through self-service approaches.
“Raw data is both an oxymoron and a bad idea. On the contrary, data should be cooked with care.” This was a statement made by Geoff Bowker in 2005, and served as the opening lines of a recent talk by Kate Crawford, principal researcher at Microsoft Research New England and a visiting MIT professor, who urged that big data be adopted and handled cautiously.
In her keynote at the recent DataEDGE 2013 conference, held at the University of California at Berkeley, Crawford said the time is now to have a discussion on the implications big data is having on business and society.
She outlined the six myths that have arisen around big data:
Myth #1: Big data is new. References to big data began to pop up in the literature in the late 1990s, but this is something some prominent industries, such as financial services firms and oil companies, have been wrestling with for decades, Crawford says. What is new, however, “is the fact that a lot of the tools of big data are becoming more easily reached by a lot more people. We’re having an explosion in ideas, creativity and imagination in terms of what we can do with these technologies.” This is the time to discuss the implications of big data, she adds, because much of it will be invisible within a few years as the tools and technologies mature. “Really usable systems and really good technologies disappear,” she states. “The easier they are to use, the harder they are to see.”
Myth #2: Big data is objective. Actually, big data sets can be very biased, Crawford states. For example, she says, she poured through 20 million tweets sent out about Hurricane Sandy, which flooded her neighborhood in Manhattan last year. While the tweets tell a compelling story about how residents coped, they mainly represent the views of younger, more well-to Manhattan residents. “If we look a little closer at the tweets, most were coming out of Manhattan, which has a higher concentration of people using smartphone, and a higher concentration of Twitter users – a subset of a subset. There were very few tweets coming from the far more affected areas, such as Breezy Point or the Far Rockaways. Because we don’t have the data from those places, we essentially have very privileged urban stories. We have to be really clear who were talking about, we have to think about what this data really represents,” she says.
Myth #3: Big data doesn’t discriminate. “There’s a myth that says essentially because you’re dealing with large data sets, you can somehow avoid group-level prejudice,” Crawford cautions. She pointed to a recent study of the Facebook “likes” of 60,000 people that found such data can be used to identify a person’s race, sexual orientation, religious views, political leanings, and even if they are a previous drug or alcohol user. “The researchers also expressed a set of concerns that this data can be bought by anyone. Ultimately, employees can make decisions about individuals based on this data.”
Myth #4: Big data makes cities smarter. While big data goes a long way to improve the management of city problems, it also may under-represent communities. “Not all data is created or collected equally – there are always certain communities of people who are going to be left out of those data sets,” Crawford says. For example, last year, the city of Boston released an app called StreetBump, which automatically registered potholes by passively collecting GPS data from drivers’ smartphones. The program collected a great deal of data on potholes. However, she adds, “wealthier younger citizens are more likely to have smartphones, and therefore, wealthy areas with younger people would get more attention, while areas with older residents with less money will get fewer resources.”
Myth #5: Big data is anonymous. Crawford cited a recent study, published in Nature, which determined that individuals could be identified with no more than four data points, including their cell phone number. Before the advent of personal technology, it took about 12 data points to identify an individual. “It’s very difficult to make data anonymous – even with two randomized data points, it’s possible to identify 50 percent of people.” Another big data initiative, the smart grid being adopted by electric utilities, will capture a wealth of data – from energy usage to “when you have friends over, when you are sleeping. This is some very intimate data.”
Myth #6: You can opt out of big data. There are suggestions that people will be able to protect their privacy is they pay a fee for web services to opt out of tracking, versus using services for free in exchange for giving up some information. Crawford cautions that this will result in a two-tier system, which “turns private data into a luxury good rather than a public good.”
Rather than making data privacy and management an individual choice, Crawford urges a more public discussion on “the way that the data is essentially flowing between corporations, individuals and governments.”
The drive to achieve competitive advantage with Big Data is creating a lot of interesting opportunities for managers and professionals working in the data analytics space. In some cases, new job categories not imaginable just a few years back are being created, and are in demand. “Data scientist” is only one of these descriptions, and there are jobs that don’t require Ph.Ds in statistics. Sometimes, it just takes a little creative thinking to move one’s career in a new and different direction.
In a recent Forbes post, J. Maureen Henderson, head of a market research firm, discussed the ways emerging Big Data expertise is being leveraged for business problems. One University of Tennessee student, for example, is pursuing a post-graduate degree in Big Data analytics to “tell stories from data,” noting that in a previous job, she saw that “there were plenty of talented ‘data’ people and plenty of the talented ‘business’ people; however, the people who could do both were extraordinarily valuable to the firm and to my team’s ability to solve problems. That really got my wheels turning, and I started thinking about what other problems I might be able to solve if I knew more about analytics.”
In the process of exploring the avenues by which big data will deliver value to businesses, some interesting new job titles and descriptions are emerging across the industry. The new generation of jobs being spurred by Big Data are often a blend of stats-savvy and business-savvy skillsets and activities.
Here is a sampling of a few of these blended positions that have recently appeared at online recruiting sites:
Industry analytics manager (pharmaceutical): “Collaborate with cross-functional partners in industry analytics, market analysis and strategy, the managed care contracting organization and brand marketing teams to consult with and deliver deep insights and actionable strategic and tactical recommendations on access and reimbursement drivers of the business. Demonstrate ability to break complex problems down into distinct parts, simplify complexity, and manage uncertainty.”
Data scientist/machine learning expert (online consumer site): “Data science team is looking for a data scientist to work on machine learning, data mining and information retrieval problems. Perform complex analysis on very large data sets using data mining, machine learning and graph analysis algorithms. Build complex predictive models that scale to petabytes of data. Define metrics, understand A/B testing and statistical measurement of model quality. Work closely with product, engineering, and marketing teams to identify, collect and analyze data to answer critical product questions.”
Data anthropologist (marketing firm): “Leverage huge data set and powerful analytical tools to give the public more insight into the digital world. Keep up-to-the-minute with current events and understand the online ecosystem—publishers, consumers, advertisers, analysts, bloggers, vendors, journalists and others who make the internet buzz. Craft stories to create buzz, backed by our data, and share them as tweets or blog posts or press releases or white papers or whatever best suits the material.”
Data scientist/data lover (scholarship fund): “Define and implement the social media measurement strategies and business intelligence analytics that align with marketing and business objectives; perform qualitative, statistical and quantitative analysis; producing meaningful marketing KPI dashboards and delivering routine and ad hoc, cross-channel performance reports with actionable insight. The candidate should be able to identify correlations in cause and effect of email and online/MR and social media integrated campaigns resulting in increased individual donations and stakeholder engagement.”
Systems engineer – big data (game publisher): “Players continue to rack up billions of hours of play — all of it logged, all of the logs frankly rather useless until our lab-coated Big Data scientists work their black magics, transmogrifying unwieldy petabytes through the careful application of open-source and proprietary technologies and bucketloads of intellectual elbow grease. As Systems Engineer – Big Data you’ll provide ongoing support for data warehouse and data services infrastructure and systems, ensuring Big Data van keeps rolling, come hell, high water or technical difficulties. Your exceptional communication skills will help you as you smooth the transition from raw data to actionable insight about the players.”
Data visualization engineer, streaming platforms (streaming video provider): “Own and build new, high-impact visualizations in our insight tools to make data both understandable and actionable. Develop rich interactive graphics, data visualizations of large amount of structured data, preferably in-browser. To deliver operational insights quickly and effectively, we need an excellent suite of interactive tools and dynamic dashboards. Our goal is to raise our operational insight capabilities to a whole new level of excellence that enables us to continuously improve our product while ensuring a flawless experience for our customers.”
In June, I was invited to present and participate in a panel discussion at a special program on Big Data at Stevens Institute Technology in Hoboken, New Jersey.
But my role wasn’t to join the other speakers and help pay homage to the power and potential of Big Data. Rather, I was asked by the organizer, professor Lem Tarshis, to play “Devil’s Advocate,” and talk about the issues and challenges Big Data brings up.
Indeed, there has been some pushback taking place against Big Data, alleging that its potential for knowledge advancement is being over-promised, its legal implications not well understood, and the possibility it may be outright dangerous for business leaders to be basing decisions on erroneous assumptions.
I began my talk with a little bit of history – close to 30 years ago, to be exact:
On September 26, 1983, the United States was rebuilding its nuclear arsenal, the Soviet Union was still the Evil Empire, and there was no trust between the two superpowers. In fact, the leaders of the Soviet Union were almost paranoid that the U.S. was planning a surprise attack against them. NATO was conducting war exercises at the time. Everyone was on hair-trigger alert. On the night of September 26th, the officer in charge of the Soviet Air Defense Forces was ill, so another officer, Stanislav Yevgrafovich Petrov, filled in.
Not long after the shift started, the center received a warning from one of its satellites that an ICBM missile launch has just taken place from the United States. All the systems were flashing red. Petrov looked at it and decided: It’s just one missile. If they were attacking, they wouldn’t just launch a single missile. So he overrode the attack warning. But then the center was alerted that a second missile had been launched from the midwestern U.S. Still, Petrov was undaunted. Then, there were alarms for a third launch. Then a fourth launch. Then a fifth launch.
I imagine many Soviet apparatchiks would have reflexively hit that red launch button at that point. But Petrov kept his cool. He had no information confirming whether the US launch reports were real or erroneous. He only had his gut at that moment. But something in his gut told him that this wasn’t the real thing. And he chose not to put through an order for a massive Soviet missile retaliation.
It turns out Petrov’s gut instinct was correct, of course. The stationary Soviet satellite above the continental U.S. was actually picking up glints of sunlight that were coming over the horizon, and mistaking it for missile launches. The data that was streaming into the Soviet command center was erroneous data.
But that was 1983, a long time ago, right with old Soviet technology? Our systems and data feeds are all perfect and flawless now, right?
Well, technology is more advanced, and yes, misreading Big Data doesn’t have to mean the end of the world. But perhaps every organization could use a Stanislav Petrov on staff. Someone who thinks critically, who can question the results the data is providing and put it into context.
Consider how, just a couple of months ago, someone highjacked the AP Twitter account with a false report of an attack on the White House. Sensing the immediate swoon in stocks, the high-frequency trading algorithms kicked into high gear and sent major US stock indexes into a nosedive, all in three minutes time.
A recent survey of 300 financial executives released by Experian finds that most executives feel they lack enough accurate information to successfully perform daily operations or make decisions. The main challenges identified by respondents are outdated information, linking different sources of information and inaccurate data. On average, companies thought that 25 percent of their data was inaccurate. Only 13 percent of companies thought the problems with their data were small enough that it did not require further investment.
In big data scenarios, you have managers not trained in statistics making bet-the-business decisions based on data of unknown quality originating from unvetted sources. Data analysts and scientists can write the algorithms that extract the data, but they aren’t necessarily in a position to understand the business implications.
That’s why, even though Big Data analytics is providing a lot of new types of information organizations can act on, business leaders and managers need to still understand the sources of this data, and how systems are delivering the information they will bet the business on. What is the source of the information? Are there other potential sources that will help build a conclusion? And, very importantly: What is the context of this data?
To be successful at Big Data, it’s incumbent upon organizations to encourage critical thinking among business users of the data.
No matter how well integrated and powerful your back-end resources may be for managing Big Data, it’s all for naught if information can’t be effectively delivered and presented over that last 100 feet to decision-makers. It’s kind of like having a sophisticated power grid supporting the generation and transmission of electricity, but the consumer at home can’t figure out where the switch is to turn on the lamp.
That’s where data visualization can make all the difference. Yes, graphical displays of data have been around for more than a couple of decades now. I remember back in earliest days of the PC revolution using a package called Harvard Graphics, which did a nice job of converting rows and columns of data into nice, snazzy bar charts or pie charts. The spreadsheet makers recognized the power of visual representations and data, and also incorporated graphical capabilities into their products.
Now, there is an emerging class of front-end visualization tools that convert data points into visual displays – often stunning – that enable users to spot anomalies or trends in seconds. They are also referred to as 3D visualizations, but there is also the fourth dimension involved as well – time. Interfaces can be moved back in time – or forward if predictive analytics is available – to show how selected scenarios will change within a specified timeline.
If you want an illustration of what visualization can look like for enterprises, let’s broaden our horizons for a moment – really broaden our horizon. The Google Data Arts Team recently designed an interactive 3D map of the universe called “100,000 Stars.”
The 100,000 Stars interface enables you to zoom in on our own planet, then zoom out to the solar system, with our Sun at the center, then zoom over to the closest adjoining star and its solar system. Click on specific stars and planets, and you will get a brief description. Zoom out further, and you see we’re actually in one of the arms of the pinwheel of the Milky Way galaxy.
Imagine similar visualizations for business problem and opportunity areas, and you get what I mean – it’s out of this world. You can plot your data points, as well as even plot time to see how trends unfold. You already see this with those weather maps that move two, three days into the future. It turns the data into a physical object that you can view from different angles or timespans. It really brings data alive, and drives home any points that need to be made.
And we’re not just talking about “spacy” visualizations either. You may have seen, on some websites, the use of word “clouds,” for example. These terms getting the most usage are in the largest fonts, so at a glance, a view can see what the hottest topic may be.
In his latest book, Data Points: Visualization That Means Something, Nathan Yau makes the case for applying visualization against the toughest business and societal problems, as well as to uncover new opportunities that could not be considered previously in our flatter, 2D world. Ultimately, with data visualization, one can’t help but spot the trend or anomaly almost instantaneously:
“When you look at visualization for the first time, your eyes dart around trying to find a point of interest. Actually, when you look at anything, you tend to spot things that stand out, such as bright colors, shapes that are bigger than the rest, or people who are on the long tail of the height curve.”
In report published in 2011, Tony DeSantis, Mathew Gentile and Rich Simon, all of Deloitte, provide down-to-earth, on-the-ground example of how visualization can deliver business value: spotting potentially fraudulent invoices within an enterprise accounts payable department. “A traditional detection technique would be to list the invoice or purchase order numbers on a spreadsheet and sort them to identify numbers that are repeated, occur out of sequence, or increase by unusually small amounts over time, which such that the vendor has few or no other customers,” they point out. A visual graphic, on the other hand, will quickly make such anomalies blindingly obvious.
As DeSantis and his team put it: “Visual analytics builds on humans’ natural ability to absorb a greater volume of information in visual than in numeric form, and to perceive certain patterns, shapes and shades more easily than others. Using mathematical techniques to evaluate patterns and outliers, effective visuals can translate multidimensional data such as frequency, time and relationships into an intuitive picture.”
Big Data isn’t a technology or solution set that gets dropped into organization, ready to deliver compelling insights that will put the business on an upward trajectory of intelligence and prosperity. Rather, it is a gradually building wave that organization’s leaders will need to learn to ride, or else get swamped on the sidelines. Understanding and working effectively with big data will take a lot of practice.
That’s the theme of a new book co-authored by Michael Minelli, vice president of information services for MasterCard Advisors, along with Michele Chambers, formerly general manager and VP of Big Data analytics at IBM, and Ambiga Dhiraj, head of client delivery for Mu Sigma.
In the book, Big Data, Big Analytics: Emerging Business Intelligence and Analytic Trends for Today’s Businesses, Minelli, Chambers and Dhiraj lay out the ways organizations can prepare to consume big data analytics.
1) Consider who is handling the “last mile” in data analysis: You need people who can look at the big picture with big data, and be able to explain its implications to the business. The authors quote Dr. Usama Fayyad, who talks about the crucial last mile in data analytics – the people “who are basically there to deliver the results of the analysis and put them in terms the business can understand. This last-mile group is made up of data analysts who know enough about the business to present to the CMO or the CEO.” At issue is the ability to find and hire these people, which is not an easy task. Also, a mistake many organizations make is putting these people to work on tactical assignments. “That’s a mistake, because these are people who can help develop and guide strategy, move the needle, and grapple with big issues,” Fayyad is quoted as saying.
2) Introduce the power of “geospatial intelligence”: Geospatial intelligence involves the gathering and analysis of data to form more of a 3D view of what’s happening around the organization. It’s about “using data about space and time to improve the quality of predictive analysis.” Minelli and his co-authors quote IBM’s Jeff Jonas: “It’s going to come from weaving together data that has traditionally not been woven together.” This means location data generated from sensors and smartphones, as well as social media data.
3) Separate the signal from the noise: With so much data and extremely large datasets, there’s going to be a lot of noise, with a lot of conflicting signals. “As data gets larger, it becomes increasingly difficult to fully grasp the meaning and magnitude of the data through exploratory analysis.” the authors state. The best way to help analysts decipher the nuggets of information needed is through visualization tools. For example, a “word cloud” of relevant terms plucked from a site or journal – and the most mentions, the larger the font – will provide, at a glance, the topics mentioned most often.
4) Collaborate: “successful analytics is a collaborative endeavor,” Minelli and his co-authors state. The first step in the process is to take your analytics intent beyond your core team and sell it to a wider group of decision makers – the prospective daily consumers of analytics in your organization.”
5) Learn to lead: “organizations that successfully consume analytics are driven by leadership, which builds consensus in the organization and allows for moving ahead without the need to have everyone on board every step of the way,” the authors state. “Strong leadership has been found to be the most important trigger in the wider analytics adoption in organizations.”
6) Measure, measure, measure: “Use analytics to measure itself,” Minelli and his co-authors urge. They add that hard numbers actually aren’t necessary to gauge any progress – the availability of analytics may elevate discussions and awareness of what the business needs. “One often but profound change in organizations is the maturing of a culture of objective debates, arguments and viewpoints driven by data and not just ‘gut feel,’” the authors state.
7) Change your incentives: Big data analytics implementations will shake up the organization will shake up the flow of information across the organizations, and thus re-arrange the hierarchy. Such projects will “bring in new stakeholders in employees’ decisions as well as higher levels of oversight,” the authors point out. “Sometimes, a general tendency of status quo bias exists, and employees do want to venture out of their comfort zone. You need to create robust incentives to overcome these barriers.”
Hosting Big Data applications in the cloud has compelling advantages. Scale doesn’t become as overwhelming an issue as it is within on-premise systems. IT will no longer feel compelled to throw more disks at burgeoning storage requirements, and performance becomes the contractual obligation of someone else outside the organization.
Cloud may help clear up some of the costlier and thornier problems of attempting to manage Big Data environments, but it also creates some new issues. As Ron Exler of Saugatuck Technology recently pointed out in a new report, cloud-based solutions “can be quickly configured to address some big data business needs, enabling outsourcing and potentially faster implementations.” However, he adds, employing the cloud also brings some risks as well.
Data security is one major risk area, and I could write many posts on this. But management issues also present other challenges. Too many organizations see cloud as an cure-all for their application and data management ills, but broken processes are never fixed when new technology is applied to them. There are also plenty of risks with the misappropriation of big data, and the cloud won’t make these risks go away. Exler lists some of the risks that stem from over-reliance on cloud technology, from the late delivery of business reports to the delivery of incorrect business information, resulting in decisions based on incorrect source data. Sound familiar? The gremlins that have haunted data analytic and management for years simply won’t disappear behind a cloud.
Exler makes three recommendations for moving big data into cloud environments – note that the solutions he proposes have nothing to do with technology, and everything to do with management:
1) Analyze the growth trajectory of your data and your business. Typically, organizations will have a lot of different moving parts and interfaces. And, as the business grows and changes, it will be constantly adding new data sources. As Exler notes, “processing integration or hand off points in such piecemeal approaches represent high risk to data in the chain of possession – from collection points to raw data to data edits to data combination to data warehouse to analytics engine to viewing applications on multiple platforms.” Business growth and future requirements should be analyzed and modeled to make sure cloud engagements will be able “to provide adequate system performance, availability, and scalability to account for the projected business expansion,” he states.
2) Address data quality issues as close to the source as possible. Because both cloud and big data environments have so many moving parts, “finding the source of a data problem can be a significant challenge,” Exler warns. “Finding problems upstream in the data flow prevent time-consuming and expensive reprocessing that could be needed should errors be discovered downstream.” Such quality issues have a substantial business cost as well. When data errors are found, it becomes “an expensive company-wide fire drill to correct the data,” he says.
3) Build your project management, teamwork and communication skills. Because big data and cloud projects involve so many people and components from across the enterprise, requiring coordination and interaction between various specialists, subject matter experts, vendors, and outsourcing partners. “This coordination is not simple,” Exler warns. “Each group involved likely has different sets of terminology, work habits, communications methods, and documentation standards. Each group also has different priorities; oftentimes such new projects are delegated to lower priority for supporting groups.” Project managers must be leaders and understand the value of open and regular communications.
There are organizations truly reaping the rewards of Big Data, and then there are those who are just trying to catch up. What are the Big Data “leaders” doing that the “laggards” are missing? (more…)
Evolving from Chaos to Competitiveness: The Emerging Architecture of Next-Generation Data Integration
To compete on Big Data and analytics, today’s always-on enterprise needs a well-designed evolving high-level architecture that continuously provides trusted data originating from a vast and fast-changing range of sources, often with different formats, and within different contexts.
To meet this challenge, the art and science of data integration is evolving, from duplicative, project-based silos that have consumed organizations’ time and resources to an architectural approach, in which data integration is based on sustainable and repeatable data integration practices – delivering data integration automatically anytime the business requires it. (more…)
Last fall, The New York Times resident numbers geek Nate Silver provided a lesson in predictive analytics for the whole world to see – crunching big data to predict, with almost pinpoint accuracy – the winner of the U.S. presidential election.
The success of this high-profile project thrust big data analytics into the limelight, but there are many, somewhat more mundane applications, but with even more unforeseen revelations. (more…)