Does Increased Margarine Consumption Raise the Divorce Rate? (The Big Data Blind Spot)
At a presentation a couple of months back, I heard Forrester’s Mike Gualtieri talk about an interesting conundrum when it comes to big data analysis – separating correlation from causation. It starts with a human tendency to “fill in the blanks” when it comes to figuring out what data is telling us, and it exposes what just may be a “Big Data Blind Spot.”
Gualtieri notes that tracking data on divorce rates from 2000-2009 (in the state of Maine) follow the same path as per capita margarine consumption.
Perhaps consuming more margarine does somehow re-arrange one’s brain cells, and that leads to discontent in a marriage. But there are probably other factors in place for both trends, well independent of one another other. But can business decision makers determine whether such correlations in their own data – such as sales in specific regions, demographics, and so on – are interrelated, or are merely coincidental? What happens when they marshal significant organizational resources as a result of a flawed correlation?
The Big Data Blind Spot
There is large blind spot in the big data revolution. Even with the most massive data sets on the planet, one cannot rush to judgement if one slice of the data points to something else. Ian Chipman, writing at Stanford University’s Graduate School of Business site, explored this blind spot, and what one data expert, Susan Athey, professor of economics at Stanford Graduate School of Business, proposes to do about it. “It’s relatively easy these days to automatically classify complex things like text, speech, and photos, or to predict website traffic tomorrow,” the article explains. “It’s a whole different ballgame to ask a computer to explore how raising the minimum wage might affect employment or to design an algorithm to assign optimal treatments to every patient in a hospital.”
Athey outlined problematic scenarios brought on by big data. For example, “one question that comes up in businesses is whether a firm should target resources on retaining customers who have a high risk of attrition, or churn. Predicting churn can be accomplished with off-the-shelf machine-learning methods. However, the real problem is calculating the best allocation of resources, which requires identifying those customers for whom the causal effect of an intervention, such as offering discounts or sending out targeted emails, is the highest. That’s a harder thing to measure; it might require the firm to conduct a randomized experiment to learn where intervention has the biggest benefits.”
4 Ways to Address the Big Data Blind Spot
1) Testing: Many analysts employ what is called a “randomized controlled experiment,” Athey explains. Drug testing is where this is commonly seen: “A randomly selected group of people with a particular illness is given the drug while a second group with the same illness is given a placebo. If a significant portion of the first group gets better, the drug is probably the cause.” However, as Athey observes, “such experiments are not feasible in many real-world settings.” In organizations, there may be a wide range of factors affecting the outcome of experiments, from employee attitudes to ever-shifting market forces.
2) Data science: Another approach is to bring in data scientists – either through proactive hiring or training staff. These individuals can apply necessary statistical rigor and testing to data correlations to see if they are more than correlations.
3) Machine learning: Data scientists are a great addition to any enterprise, but there aren’t enough data scientists to go around to meet the demand and scale of the ever-growing amounts of data sets and models, Gualtieri warns. “It requires a lot of compute, and a lot of productivity from data scientists to maintain those models.” Ultimately, this needs to be addressed through greater automation, he continued, “to improve the productivity of the data scientist. We have a skills gap, and we have to make the data scientists we have 1,000 times more productive.” Emerging machine learning technologies and approaches may help alleviate these burdens.
4) Critical thinking: Finally, there is a need for good old-fashioned critical thinking to be applied to data analytics. Ultimately, humans run the organizations, and need to understand how data insights further the needs of the organization. There is already too much reliance on data, which may breeds a dependency among decision makers. Just as machine learning keeps refreshing algorithms with new data, human minds need to be constantly refreshed with new insights. Analytics need to be constantly reviewed and tested, and often, common sense needs to prevail.
As Athey puts it: “If you’re just trying to crunch big data and not thinking about everything that can go wrong in confusing correlation and causality, you might think that putting a bigger machine on your problem is going to solve things. But sometimes the answer’s just not in the data.”