Category Archives: Big Data
I found a truly cringe-worthy article today that shows what popular websites looked like more than a decade ago and what they look like today. Looking back to what was cutting-edge in 1996, or even 2006, is as bad as fingernails on a chalkboard compared to the modern homepages of popular sites today.
These websites are still well-used today, staying with the times and leading the way we design digital experiences. The key was change over many years of research and understanding of user experience. These sites stay modern, adapting to different web-enabled devices and experiences that the end user will encounter. Common among them are beautiful imagery, clear calls to action, and a sophisticated understanding of what people want on a homepage.
Can you imagine if any of those sites had stayed the same and never changed? We would not be using them today if that were the case. Their popularity would wane. Change is never easy, but it is usually necessary to stay relevant.
Web designers in 1996 could not imagine what the internet would be like in 2015, although they would probably agree there was a lot of potential. A modern equivalent is the implications of big data throughout the enterprise.
Data-driven marketers today are wondering how they can gain insight from big data. The answer? The ability to change is the connection between big data and insight. Data-driven marketers today know that their roles are changing: 68% of marketers think that marketing has seen more changes in the last two years than it has in the past 50 years, according to a recent survey. The changes are due to a renewed focus on customer experience within their jobs, and the need to use big data to improve that experience.
Big data should drive insights that change businesses, but is the real reason marketers aren’t sure about how to use big data tied to the change that it requires? Leading change in an organization is never easy, but it is definitely necessary.
What insights do you want from big data, and what value can you derive from them? If your reason for using big data is customer behavior insights, how will knowing how a customer behaves influence any changes in your approach?
The National Retail Federation recently reported that retailers say these are the three top reasons for using big data in a survey:
- Analyzing customer behavior (56 percent)
- Bringing together different data sources (49 percent)
- Improving personalization (48 percent)
What are your reasons for using big data?
Data-driven marketers can drown in too much information if they look at massive datasets without a question in mind that they want to answer. The question being asked often implies that a business must change to stay modern and relevant to its customers. Could concern over a need for great change be the roadblock to data-driven marketers who could be using data for valuable insights?
Big data has gotten a lot of buzz in the last few years. Data-driven marketers can move the big data concept from fuzzy, unrealized potential to a major part of how their business operates successfully.
Learn more in this white paper for marketers, The Secret to a Successful Customer Journey.
I recently got to meet with a very enlightened insurance company which was actively turning their SWOTT analysis (with the second T being trends) into concrete action. They shared with me that they view their go forward “right to win” being determined by the quality of customer experience they deliver to customers through their traditional channels and increasingly through “digital channels”. One marketing leader joked early on that “it’s no longer about the money; it is about the experience”. The marketing and business leaders that I met with made it extremely clear that they have a sense of urgency to respond to what it saw as significant market changes on the horizon. What this company wanted to achieve was a single view of customer across each of its distribution channel as well as their agent population. Typical of many businesses today, they had determined that they needed an automated, holistic view into things like its customer history. Smartly, this business wanted to put together its existing customer data with its customer leads.
Using Data to Accelerate the Percentage Customers that are Cross Sold
Taking this step was seen as allowing them to understand when an existing customer is also a lead for another product. With this knowledge, they wanted to provide them with special offers to accelerate their conversion from lead to being a customer with more than one product. What they wanted to do here reminded me of the capabilities of 58.com, eBay, and other Internet pure plays. The reason for doing this well was described recently by Gartner. Gartner suggests that increasing business success is determined by what they call “business moments”. Without a first rate experience that builds upon what this insurance company already knows about its customers, this insurance company worries it could be increasing at risk by Internet pure plays. As important, like many businesses, the degree of cross sell is for many businesses a major determinant of whether a customer is profitable or not.
Getting Customer Data Right is Key to Developing a Winning Digital Experience
To drive a first rate digital experience, this insurance company wanted to apply advanced analytics to a single view of customer and prospect data. This would allow them to do things like conduct nearest neighbor predictive analysis and modeling. In this form of analysis, “the goal is to predict whether a new customer will respond to an offer based on how other similar customers have responded” (Data Science for Business, Foster Provost, O’Reilly, 2013, page 147).
What has limiting this business like so many others is that their customer data is scattered across many enterprise systems. For just for one division, they have more than one Salesforce instance. Yet this company’s marketing team knew to keep its customers, it needed to be able to service them omnichannel and establish a single unified customer experience. To make this happen, they needed to for the first to share holistic customer information across their ecosystems. At the same time, they knew that they would needed to protect their customer’s privacy—i.e. only certain people would be able to see certain information. They wanted by role that the ability to selective mask data and protect their customer in particular consumers by only allowing certain users in defense parlance, with a need to know, to see a subset of the holistic set of information collected. When asked about the need for a single view of customer, the digital marketing folks openly shared that they perceived the potential for external digital market entrants—ala Porter’s five forces of competition. This firm saw them either as taking market share from them or effectively disintermediating them over time them from their customers as more and more customers move their insurance purchasing of Insurance to the Web. Given the risk, their competitive advantage needed to move to knowing better their customer and being able to respond better to them on the web. This clearly included new customers that are trying to win in the language of Theodore Levitt.
Competing on Customer Experience
In sum, this insurance company smartly felt that they needed to compete on customer experience to pull out a new phrase for me and this required superior knowledge of existing and new customers. This means they needed as complete and correct view of customers as possible including addresses, connection preferences, and increasingly social media responses. This means competitively responding directly to those that have honed their skills in web design, social presence, and advanced analytics. To do this, they will create predictive capabilities that will make use of their superior customer data. Clearly, without this prescience of thinking, this moment will not be like the strategic collision of Starbucks and Fast Food Vendors where the desire to grow forced competition between the existing player and new entrants wanting to claim a portion of the existing market player’s business.
Blogs and Articles
Last week I had the opportunity to attend the Data Mania industry event hosted by Informatica. The afternoon event was a nice mix of industry panels with technical and business speakers from companies that included Amazon, Birst, AppDynamics, SalesForce, Marketo, Tableau, Adobe, Informatica, Birst, Dun & Bradstreet and several others.
A main theme of the event that came through with so many small, medium and large SaaS vendors was that everyone is increasingly dependent on being able to integrate data from other solutions and platforms. The second part of this was that customers increasingly expect the data integration requirements to work under the covers so they can focus on the higher level business solutions.
I really thought the four companies presented by Informatica as the winners of their Connect-a-Thon contest were the highlight. Each of these solutions was built by a company and highlighted some great aspects of data integration.
Databricks provides a cloud platform for big data processing. The solution leverages Apache Spark, which is an open source engine for big data processing that has seen a lot of adoption. Spark is the engine for the Databricks Cloud which then adds several enterprise features for visualization, large scale Spark cluster management, workflow and integration with third party applications. Having a big data solution means bringing data in from a lot of SaaS and on premise sources so Databricks built a connector to Informatica Cloud to make it easier to load data into the Databricks Cloud. Again, it’s a great example the ecosystem where higher level solutions can leverage 3rd party.
Thoughspot provides a search based BI solution. The general idea is that a search based interface provides a tool that a much broader group of users can use with little training to access to the power of enterprise business intelligence tools. It reminds me of some other solutions that fall into the enterprise 2.0 area and do everything from expert location to finding structured and unstructured data more easily. They wrote a nice blog post explaining why they built the ThoughtSpot Connector for Informatica Cloud. The main reason they are using Informatica to handle the data integration so they can focus on their own solution, which is the end user facing BI tools. It’s the example of SaaS providers choosing to either roll their own data integration or leveraging other providers as part of their solution.
BigML provides some very interesting machine learning solutions. The simple summary would be they are trying to create beautiful visualization and predicative modeling tools. The solution greatly simplifies the process of model iteration and visualizing models. Their gallery of models has several very good examples. Again, in this case BigML built a connector to Informatica Cloud for the SaaS and on premise integration and also in conjunction with the existing BigML REST API. BigML wrote a great blog post on their connector that goes into more details.
FollowAnalytics had one of the more interesting demonstrations because it was a very different solution than the other three solutions. They have a mobile marketing platform that is used to drive end user engagement and measure that engagement. They also uploaded their Data Mania integration demo here. They mostly are leveraging the data integration to provide access to important data sources that can help drive customer engagement in their platform. Given their end users are more marketing or business analysts they just expect to be able to easily get the data they want and need to drive marketing analysis and engagement.
My takeaway from talking to many of the SaaS vendors was that there is a lot of interest being able to leverage higher level infrastructure, platform and middleware services as they mature to meet the real needs of SaaS vendors so that they can focus on their own solutions. The ecosystem might be more ready in a lot of cases than what is available.
I won’t say I’ve seen it all; I’ve only scratched the surface in the past 15 years. Below are some of the mistakes I’ve made or fixed during this time.
MongoDB as your Big Data platform
Ask yourself, why am I picking on MongoDB? The NoSQL database most abused at this point is MongoDB, while Mongo has an aggregation framework that tastes like MapReduce and even a very poorly documented Hadoop connector, its sweet spot is as an operational database, not an analytical system.
RDBMS schema as files
You dumped each table from your RDBMS into a file and stored that on HDFS, you now plan to use Hive on it. You know that Hive is slower than RDBMS; it’ll use MapReduce even for a simple select. Next, let’s look at row sizes; you have flat files measured in single-digit kilobytes.
Hadoop does best on large sets of relatively flat data. I’m sure you can create an extract that’s more de-normalized.
Instead of creating a single Data Lake, you created a series of data ponds or a data swamp. Conway’s law has struck again; your business groups have created their own mini-repositories and data analysis processes. That doesn’t sound bad at first, but with different extracts and ways of slicing and dicing the data, you end up with different views of the data, i.e., different answers for some of the same questions.
Schema-on-read doesn’t mean, “Don’t plan at all,” but it means “Don’t plan for every question you might ask.”
Missing use cases
Vendors, to escape the constraints of departmental funding, are selling the idea of the data lake. The byproduct of this is the business lost sight of real use cases. The data-lake approach can be valid, but you won’t get much out of it if you don’t have actual use cases in mind.
It isn’t hard to come up with use cases, but that is always an afterthought. The business should start thinking of the use cases when their databases can’t handle the load.
To do a larger bit of analytics, you may need a bigger tool set like that may include Hive, Pig, MapReduce, R, and more.
As I have shared within the posts of this series, businesses are using analytics to improve their internal and external facing business processes and to strengthen their “right to win” within the markets that they operate. Like healthcare institutions across the country, UPMC is striving to improve its quality of care and business profitability. One educational healthcare CEO put it to me this way–“if we can improve our quality of service, we can reduce costs while we increase our pricing power”. In UPMC’s case, they believe that the vast majority of their costs are in a fraction of their patients, but they want to prove this with real data and then use this information drive their go forward business strategies.
Getting more predictive to improved outcomes and reduce cost
Armed with this knowledge, UPMC’s leadership wanted to use advanced analytic and predictive modeling to improve clinical and financial decision making. And taking this action was seen as producing better patient outcomes and reducing costs. A focus area for analysis involved creating “longitudinal records” for the complete cost of providing particular types of care. For those that aren’t versed in time series analysis, longitudinal analysis uses a series of observations obtained from many respondents over time to derive a relevant business insight. When I was also involved in healthcare, I used this type of analysis to interrelate employee and patient engagement results versus healthcare outcomes. In UPMC’s case, they wanted to use this type of analysis to understand for example the total end to end cost of a spinal surgery. UPMC wanted to look beyond the cost of surgery and account for the pre-surgery care and recovery-related costs. However, to do this for the entire hospital meant that it needed to bring together data from hundreds of sources across UPMC and outside entities, including labs and pharmacies. However, by having this information, UPMC’s leadership saw the potential to create an accurate and comprehensive view which could be used to benchmark future procedures. Additionally, UPMC saw the potential to automate the creation of patient problem lists or examine clinical practice variations. But like the other case studies that we have reviewed, these steps required trustworthy and authoritative data to be accessed with agility and ease.
UPMC’s starts with a large, multiyear investment
In October 2012, UPMC made a $100 million investment to establish an enterprise analytics initiative to bring together for the first time, clinical, financial, administrative, genomic and other information together in one place. Tom Davenport, the author of Competing on Analytics, suggests in his writing that establishing an enterprise analytics capability represents a major step forward because it allows enterprises to answer the big questions, to better tie strategy and analytics, and to finally rationalize applications interconnect and business intelligence spending. As UPMC put its plan together, it realized that it needed to impact more than 1200 applications. As well it realized that it needed one system manage with data integration, master data management, and eventually complex event processing capabilities. At the same time, it created the people side of things by creating a governance team to manage data integrity improvements, ensuring that trusted data populates enterprise analytics and provides transparency into data integrity challenges. One of UPMC’s goals was to provide self-service capabilities. According to Terri Mikol, a project leader, “We can’t have people coming to IT for every information request. We’re never going to cure cancer that way.” Here is an example of the promise that occurred within the first eight months of this project. Researchers were able to integrate—for the first time ever– clinical and genomic information on 140 patients previously treated for breast cancer. Traditionally, these data have resided in separate information systems, making it difficult—if not impossible—to integrate and analyze dozens of variables. The researchers found intriguing molecular differences in the makeup of pre-menopausal vs. post-menopausal breast cancer, findings which will be further explored. For UPMC, this initial cancer insight is just the starting point of their efforts to mine massive amounts of data in the pursuit of smarter medicines.
Building the UPMC Enterprise Analytics Capability
To create their enterprise analytics platform, UPMC determined it was critical to establish “a single, unified platform for data integration, data governance, and master data management,” according to Terri Mikol. The solution required a number of key building blocks. The first was data integration to collect and cleanses data from hundreds of sources and organizes them into repositories that would enable fast, easy analysis and reporting by and for end users.
Specifically, the UPMC enterprise analytics capability pulls clinical and operational data from a broad range of sources, including systems for managing hospital admissions, emergency room operations, patient claims, health plans, electronic health records, as well as external databases that hold registries of genomic and epidemiological data needed for crafting personalized and translational medicine therapies. UPMC has integrated quality checked source data in accordance with industry-standard healthcare information models. This effort included putting together capabilities around data integration, data quality and master data management to manage transformations and enforce consistent definitions of patients, providers, facilities and medical terminology.
As said, the cleansed and harmonized data is organized into specialized genomics databases, multidimensional warehouses, and data marts. The approach makes use of traditional data warehousing approaches as well as big data capabilities to handle unstructured data and natural language processing. UPMC has also deployed analytical tools that allow end users to exploit the data enabled from the Enterprise Analytics platform. The tools drive everything from predictive analytics, cohort tracking, and business and compliance reporting. And UPMC did not stop here. If their data had value then it needed to be secured. UPMC created data audits and data governance practices. As well, they implemented a dynamic data masking solution ensures data security and privacy.
As I have discussed, many firms are pushing point silo solutions into their environments, but as UPMC shows this limits their ability to ask the bigger business questions or in UPMC’s case to discover things that can change people’s live. Analytics are more and more a business enabler if they are organized as an enterprise analytics capability. As well, I have come to believe that analytics have become foundational capability to all firms’ right to win. It informs a coherent set of capabilities and establishes a firm’s go forward right to win. For this, UPMC is a shining example of getting things right.
Author Twitter: @MylesSuer
On March 25th, Josh Lee, Global Director for Insurance Marketing at Informatica and Cindy Maike, General Manager, Insurance at Hortonworks, will be joining the Insurance Journal in a webinar on “How to Become an Analytics Ready Insurer”.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
Josh and Cindy exchange perspectives on what “analytics ready” really means for insurers, and today we are sharing some of our views (join the webinar to learn more). Josh and Cindy offer perspectives on the five questions posed here. Please join Insurance Journal, Informatica and Hortonworks on March 25th for more on this exciting topic.
See the Hortonworks site for a second posting of this blog and more details on exciting innovations in Big Data.
- What makes a big data environment attractive to an insurer?
CM: Many insurance companies are using new types of data to create innovative products that better meet their customers’ risk needs. For example, we are seeing insurance for “shared vehicles” and new products for prevention services. Much of this innovation is made possible by the rapid growth in sensor and machine data, which the industry incorporates into predictive analytics for risk assessment and claims management.
Customers who buy personal lines of insurance also expect the same type of personalized service and offers they receive from retailers and telecommunication companies. They expect carriers to have a single view of their business that permeates customer experience, claims handling, pricing and product development. Big data in Hadoop makes that single view possible.
JL: Let’s face it, insurance is all about analytics. Better analytics leads to better pricing, reduced risk and better customer service. But here’s the issue. Existing data sources are costly in storing vast amounts of data and inflexible to adapt to changing needs of innovative analytics. Imagine kicking off a simulation or modeling routine one evening only to return in the morning and find it incomplete or lacking data that requires a special request of IT.
This is where big data environments are helping insurers. Larger, more flexible data sets allowing longer series of analytics to be run, generating better results. And imagine doing all that at a fraction of the cost and time of traditional data structures. Oh, and heaven forbid you ask a mainframe to do any of this.
- So we hear a lot about Big Data being great for unstructured data. What about traditional data types that have been used in insurance forever?
CM: Traditional data types are very important to the industry – it drives our regulatory reporting and much of the performance management reporting. This data will continue to play a very important role in the insurance industry and for companies.
However, big data can now enrich that traditional data with new data sources for new insights. In areas such as customer service and product personalization, it can make the difference between cross-selling the right products to meet customer needs and losing the business. For commercial and group carriers, the new data provides the ability to better analyze risk needs, price accordingly and enable superior service in a highly competitive market.
JL: Traditional data will always be around. I doubt that I will outlive a mainframe installation at an insurer; which makes me a little sad. And for many rote tasks like financial reporting, a sales report, or a commission statement, those are sufficient. However, the business of insurance is changing in leaps and bounds. Innovators in data science are interested in correlating those traditional sources to other creative data to find new products, or areas to reduce risk. There is just a lot of data that is either ignored or locked in obscure systems that needs to be brought into the light. This data could be structured or unstructured, it doesn’t matter, and Big Data can assist there.
- How does this fit into an overall data management function?
JL: At the end of the day, a Hadoop cluster is another source of data for an insurer. More flexible, more cost effective and higher speed; but yet another data source for an insurer. So that’s one more on top of relational, cubes, content repositories, mainframes and whatever else insurers have latched onto over the years. So if it wasn’t completely obvious before, it should be now. Data needs to be managed. As data moves around the organization for consumption, it is shaped, cleaned, copied and we hope there is governance in place. And the Big Data installation is not exempt from any of these routines. In fact, one could argue that it is more critical to leverage good data management practices with Big Data not only to optimize the environment but also to eventually replace traditional data structures that just aren’t working.
CM: Insurance companies are blending new and old data and looking for the best ways to leverage “all data”. We are witnessing the development of a new generation of advanced analytical applications to take advantage of the volume, velocity, and variety in big data. We can also enhance current predictive models, enriching them with the unstructured information in claim and underwriting notes or diaries along with other external data.
There will be challenges. Insurance companies will still need to make important decisions on how to incorporate the new data into existing data governance and data management processes. The Chief Data or Chief Analytics officer will need to drive this business change in close partnership with IT.
- Tell me a little bit about how Informatica and Hortonworks are working together on this?
JL: For years Informatica has been helping our clients to realize the value in their data and analytics. And while enjoying great success in partnership with our clients, unlocking the full value of data requires new structures, new storage and something that doesn’t break the bank for our clients. So Informatica and Hortonworks are on a continuing journey to show that value in analytics comes with strong relationships between the Hadoop distribution and innovative market leading data management technology. As the relationship between Informatica and Hortonworks deepens, expect to see even more vertically relevant solutions and documented ROI for the Informatica/Hortonworks solution stack.
CM: Informatica and Hortonworks optimize the entire big data supply chain on Hadoop, turning data into actionable information to drive business value. By incorporating data management services into the data lake, companies can store and process massive amounts of data across a wide variety of channels including social media, clickstream data, server logs, customer transactions and interactions, videos, and sensor data from equipment in the field.
Matching data from internal sources (e.g. very granular data about customers) with external data (e.g. weather data or driving patterns in specific geographic areas) can unlock new revenue streams.
See this video for a discussion on unlocking those new revenue streams. Sanjay Krishnamurthi, Informatica CTO, and Shaun Connolly, Hortonworks VP of Corporate Strategy, share their perspectives.
- Do you have any additional comments on the future of data in this brave new world?
CM: My perspective is that, over time, we will drop the reference to “big” or ”small” data and get back to referring simply to “Data”. The term big data has been useful to describe the growing awareness on how the new data types can help insurance companies grow.
We can no longer use “traditional” methods to gain insights from data. Insurers need a modern data architecture to store, process and analyze data—transforming it into insight.
We will see an increase in new market entrants in the insurance industry, and existing insurance companies will improve their products and services based upon the insights they have gained from their data, regardless of whether that was “big” or “small” data.
JL: I’m sure that even now there is someone locked in their mother’s basement playing video games and trying to come up with the next data storage wave. So we have that to look forward to, and I’m sure it will be cool. But, if we are honest with ourselves, we’ll admit that we really don’t know what to do with half the data that we have. So while data storage structures are critical, the future holds even greater promise for new models, better analytical tools and applications that can make sense of all of this and point insurers in new directions. The trend that won’t change anytime soon is the ongoing need for good quality data, data ready at a moment’s notice, safe and secure and governed in a way that insurers can trust what those cool analytics show them.
Please join us for an interactive discussion on March 25th at 10am Pacific Time/ 1pm Eastern Time.
Register for the Webinar on March 25th at 10am Pacific/ 1pm Eastern
EpicMix is a website, data integration solution and web application that provides a great example of how companies can provide more value to their customers when they think about data-ready architecture. In this case the company is Vail Resorts and it is great to look at this as an IoT case study since the solution has been in use since 2010.
The basics of EpicMix
* RFID technology embedded into lift tickets provide the ability to collect data for anyone using one at any Vail managed. Vail realized they had all these lift tickets being worn and there was an opportunity to use them to collect data that could enhance the experience of their guests. It also is a very clever way to collect data on skiers to help drive segmentation and marketing decisions.
* EpicMix just works. If any guest wants to take advantage all they have to do is register on the website or download the mobile app for their Android or iOS smart phone and register. Having a low bar to use is important to getting people to try out the app and even if people do not use the EpicMix website or app Vail is still able to leverage the data they are generating to better understand what people do on the mountain. (Vail has a detailed information policy and opt out policy)
* Value added features beyond data visibility. What makes the solution more interesting are the features that go beyond just tracking skiing performance. These include private messaging between guests while on the mountain, sharing photos with friends, integration to personal social media accounts and the ability for people to earn badges and participate in challenges. These go beyond the generation one solution that would just track performance and nothing else.
This is the type of solution that qualifies as a IoT Personal Productivity solution and a Business Productivity solution.
- For the skier it provides information on their activity, communication and sharing information on social media.
- For Vail it allows them to better understand their guests, better communicate and offer their guests additional services and benefits and also how to use their resources or deploy their employees.
The EpicMix solution was made possible by taking advantage of data that was not being collected and then making it useful to users (skiers & guests). Having used EpicMix and similar performance tracking solutions the added communication and collaboration features are what sets it apart and the ease of use in getting started make it a great example of how fresh data can come from anywhere.
In the future it is easy to imagine features being added that streamlined ordering services for users (table reservation at the restaurant for Apre-ski) or Vail leveraging the data to make business decisions to provide more real time offers to guests on the mountain or frequent visitors on their next visit. And maybe we will see some of the new ski oriented wearables like XON bindings be integrated to solutions like EpicMix so it is possible to get even more data without having to have a second smart phone application.
Information for this post comes from Mapleton Hill Media and Vail Resorts
In the 2011 film Moneyball Billy Beane introduced to the sports industry how to use data analytics to acquire statistically optimal players for the Oakland A’s. In the last 4 years, advancements in data collection, preparation, aggregation and advanced analytics technology have made it possible to broaden the scope of applying analytics beyond the game and player, drastically change the shape of an industry that has a long history built on tradition.
Last week, MIT Sloan held its 9th annual Sports Analytics Conference in Boston, MA. Amidst the 6 foot snow banks, sports fanatics and data scientists came together at this sold out event to discuss the increasing role of analytics in the sports industry. This year’s conference agenda included topics spanning game statistics and modeling, player contract and salary negotiations, dynamic ticket pricing, referee calls to improving fan experiences.
This latter topic, improving fan experiences, is one that has seen a boost in technology innovation such that data is more readily available for use in analytics. For example, newer NFL stadiums are wifi connected throughout so that fans can watch replays on their devices, tweet, and share selfies during the game. With mobile devices connected to the stadium’s wifi, franchises can drive revenue generating marketing campaigns to their home fan base throughout the game.
More important, however, is the need to keep the Millennial Generation interested in watching games live. In an article posted by TechRepublic, college students are more likely to leave a game during halftime if they are not able to connect to the internet or use social media. Teams need to keep fans in the stadiums so the goal needs to ensure the fan experience in a live venue matches what they can experience at home.
Innovation in advanced analytics and Big Data platforms such as Hadoop gives sports analysts the ability to access significant volumes of detailed data resulting in greater modeling accuracy. Streamlined data preparation tools speed the process from receiving raw data to delivering insight. Advanced analytics offered in the cloud as a service offers team owners and managers access to predictive analytics tools without having to manage and staff large data centers. Better visualization applications provide an effective way to communicate what the data means to those without a math degree.
When applying these innovations to new data sources while combining with advancements of analytics in sports, the results will be game changing far beyond what Billy Beane was able to accomplish with the Okland A’s.
Our congratulations to the winners of the top research papers submitted at the MIT Sloan Sports Analytics conference: Who Is Responsible For A Called Strike? and Counterpoints: Advanced Defensive Metrics for NBA Basketball. It will be interesting to see how these models will make an impact, with Spring Training and March Madness just around the corner. Maybe next year, we will see a submission on the dependencies of atmospheric conditions on football pressure and its impact on the NFL playoffs (PV=NRT) and get a data-driven explanation of Deflate Gate.
Recently, I got to attend the Predictive Analytics Summit in San Diego. It felt great to be in a room full of data scientists from around the world—all my hidden statistics, operations research, and even modeling background came back to me instantly. I was most interested to learn what this vanguard was doing as well as any lessons learned that could be shared with the broader analytics audience. Presenters ranged from Internet leaders to more traditional companies like Scotts Miracle Gro. Brendan Hodge of Scotts Miracle Gro in fact said, as 125 year old company, he feels like “a dinosaur at a mammal convention”. So in the space that follows, I will share my key take-aways from some of the presenters.
Fei Long from 58.com
58.com is the Craigslist, Yelp, and Monster of China. Fei shared that 58.com is using predictive analytics to recommend resumes to employers and to drive more intelligent real time bidding for its products. Fei said that 58.com has 300 million users—about the number of people in the United States. Most interesting, Fei said that predictive analytics has driven a 10-20% increase in 58.com’s click through rate.
Ian Zhao from eBay
Ian said that eBay is starting to increase the footprint of its data science projects. He said that historical the focus for eBay’s data science was marketing, but today eBay is applying data science to sales and HR. Provost and Fawcett agree in “Data Science for Business” by saying that “the widest applications of data mining techniques are in marketing for tasks such as target marketing, online advertising, and recommendations for cross-selling”.
Ian said that in the non-marketing areas, they are finding a lot less data. The data is scattered across data sources, and requires a lot more cleansing. Ian is using things like time series and ARIMA to look at employee attrition. One thing that Ian found that was particularly interesting is that there is strong correlation between attrition and bonus payouts. Ian said it is critical to leave ample time for data prep. He said that it is important to start the data prep process by doing data exploration and discovery. This includes confirming that data is available for hypothesis testing. Sometimes, Ian said that this the data prep process can include inputting data that is not available in the data set and validating data summary statistics. With this, Ian said that data scientists need to dedicate time and resources for determining what things are drivers. He said with the business, data scientist should talk about likelihood because business people in general do not understand statistics. It is important as well that data scientist ask business people the so what questions. Data scientist should narrow things down to a dollar impact.
Barkha Saxena from Poshmark
Barkha is trying to model the value of user growth. Barkha said that this matters because Poshmark wants to be the #1 community driven marketplace. They want to use data to create a “personal boutique experience”. With 700,000 transactions a day, they are trying to measure customer lifetime value by implementing a cohort analysis. What was the most interesting in Barkha’s data is she discovered repeatable performance across cohorts. In their analysis, different models work better based upon the data—so a lot of time goes into procedurally determining the best model fit.
Meagan Huth from Google
Meagan said that Google is creating something that they call People Analytics. They are trying to make all people decisions by science and data. They want to make it cheaper and easier to work at Google. They have found through their research that good managers lower turnover, increase performance, and increase workplace happiness. The most interesting thing that she says they have found is the best predictor of being a good manager is being a good coach. They have developed predictive models around text threads including those that occur in employee surveys to ensure they have the data to needed to improve.
Hobson Lane from Sharp Labs
Hobson reminded everyone of the importance Nyquist (you need to sample data twice as fast as the fastest data event). This is especially important for organizations moving to the so called Internet of Things. Many of these devices have extremely large data event rates. Hobson, also, discussed the importance of looking at variance against the line that gets drawn in a regression analysis. Sometimes, multiple lines can be drawn. He, also, discussed the problem of not having enough data to support the complexity of the decision that needs to be made.
Ravi Iyer from Ranker
Ravi started by saying Ranker is a Yelp for everyone else. He then discussed the importance of have systematic data. A nice quote from him is as follows: “better data=better predictions”. Ravi discussed as well the topic of response bias. He said that asking about Coke can lead to different answer when you ask about Coke or Coke at a movie. He discussed interesting how their research shows that millennials are really all about “the best”. I see this happening every time that I take my children out to dinner—there is no longer a cheap dinner out.
Ranjan Sinha at eBay
Ranjan discussed the importance of customer centric commerce and creating predictive models around it. At eBay, they want to optimize the customer experience and improve their ability to make recommendations. eBay is finding customer expectations are changing. For this reason, they want customer context to be modeled by looking at transactions, engagement, intent, account, and inferred social behaviors. With modeling completed, they are using complex event processing to drive a more automated response to data. An amazing example given was for Valentine Day’s where they use a man’s partner’s data to predict the items that the man should get for his significant other.
Andrew Ahn from LinkedIn
Andrew is using analytics to create what he calls an economic graph and to make professionals more productive. One area that he personally is applying predictive analytics to is with LinkedIn’s sales solutions. In LinkedIn Sales Navigator, they display potential customers based upon the sales person’s demographic data—effectively the system makes lead recommendations. However, they want to de-risk this potential interaction for sale professionals and potential customers. Andrews says at the same time that they have found through data analysis that small changes in a LinkedIn profile can lead to big changes. To put this together, they have created something that they call the social selling index. It looks at predictors that they have determined are statistically relevant including member demographic, site engagement, and social network. The SSI score is viewed as a predictive index. Andrew says that they are trying to go from serendipity to data science.
Robert Wilde from Slacker Radio
Robert discussed the importance of simplicity and elegance in model building. He then went through a set of modeling issues to avoid. He said that modelers need to own the discussion of causality and cause and effect and how this can bias data interpretation. In addition, looking at data variance was stressed because what does one do when a line doesn’t have a single point fall on it. Additionally, Robert discussed what do you do when correlation is strong, weak, or mistaken. Is it X or Y that has the relationship. Or worse yet what do you do when there is coincidental correlation. This led to a discussion of forward and reverse causal inference. For this reason, Robert argued strongly for principal component analysis. This eliminates regression causational bias. At the same time, he suggested that models should be valued by complexity versus error rates.
Parsa Bakhtary from Facebook
Parsa has been looking at what games generate revenue and what games do not generate revenue for Facebook—Facebook amazingly has over 1,000 revenue bearing game. For this reason, Facebook wants to look at the Lifetime Value of Customers for Facebook Games—ithe dollar value of a relationship. Parsa said, however, there is a problem, only 20% pay for their games. Parsa argued that customer life time value (which was developed in the 1950s) doesn’t really work for apps where everyones lifetime is not the same. Additionally, social and mobile gamers are not particularly loyalty. He says that he, therefore, has to model individual games for their first 90 days across all periods of joining and then look at the cumulative revenue curves.
So we have seen here a wide variety of predictive analytics techniques being used by today’s data scientists. To me this says that predictive analytical approaches are alive and kicking. This is good news and shows that data scientists are trying to enable businesses to make better use of their data. Clearly, a key step that holds data scientist back today is data prep. While it is critical to leave ample time for data prep, it is also essential to get quality data to ensure models are working appropriately. At the same time, data prep needs to support inputting data that is not available within the original data set.
Solution Brief: Data Prep
Author Twitter: @MylesSuer
Despite spending more than $30 Billion in annual spending on Big Data, successful big data implementations elude most organizations. That’s the sobering assessment of a recent study of 226 senior executives from Capgemini, which found that only 13 percent feel they have truly have made any headway with their big data efforts.
The reasons for Big Data’s lackluster performance include the following:
- Data is in silos or legacy systems, scattered across the enterprise
- No convincing business case
- Ineffective alignment of Big Data and analytics teams across the organization
- Most data locked up in petrified, difficult to access legacy systems
- Lack of Big Data and analytics skills
Actually, there is nothing new about any of these issues – in fact, the perceived issues with Big Data initiatives so far map closely with the failed expect many other technology-driven initiatives. First, there’s the hype that tends to get way ahead of any actual well-functioning case studies. Second, there’s the notion that managers can simply take a solution of impressive magnitude and drop it on top of their organizations, expecting overnight delivery of profits and enhanced competitiveness.
Technology, and Big Data itself, is but a tool that supports the vision, well-designed plans and hard work of forward-looking organizations. Those managers seeking transformative effects need to look deep inside their organizations, at how deeply innovation is allowed to flourish, and in turn, how their employees are allowed to flourish. Think about it: if line employees suddenly have access to alternative ways of doing things, would they be allowed to run with it? If someone discovers through Big Data that customers are using a product differently than intended, do they have the latitude to promote that new use? Or do they have to go through chains of approval?
Big Data may be what everybody is after, but Big Culture is the ultimate key to success.
For its part, Capgemini provides some high-level recommendations for better baking in transformative values as part of Big Data initiatives, based on their observations of best-in-class enterprises:
The vision thing: “It all starts with vision,” says Capgemini’s Ron Tolido. “If the company executive leadership does not actively, demonstrably embrace the power of technology and data as the driver of change and future performance, nothing digitally convincing will happen. We have not even found one single exception to this rule. The CIO may live and breathe Big Data and there may even be a separate Chief Data Officer appointed – expect more of these soon – if they fail to commit their board of executives to data as the engine of success, there will be a dark void beyond the proof of concept.”
Establish a well-defined organizational structure: “Big Data initiatives are rarely, if ever, division-centric,” the Capgemini report states. “They often cut across various departments in an organization. Organizations that have clear organizational structures for managing rollout can minimize the problems of having to engage multiple stakeholders.”
Adopt a systematic implementation approach: Surprisingly, even the largest and most sophisticated organizations that do everything on process don’t necessarily approach Big Data this way, the report states. “Intuitively, it would seem that a systematic and structured approach should be the way to go in large-scale implementations. However, our survey shows that this philosophy and approach are rare. Seventy-four percent of organizations did not have well-defined criteria to identify, qualify and select Big Data use-cases. Sixty-seven percent of companies did not have clearly defined KPIs to assess initiatives. The lack of a systematic approach affects success rates.”
Adopt a “venture capitalist” approach to securing buy-in and funding: “The returns from investments in emerging digital technologies such as Big Data are often highly speculative, given the lack of historical benchmarks,” the Capgemini report points out. “Consequently, in many organizations, Big Data initiatives get stuck due to the lack of a clear and attributable business case.” To address this challenge, the report urges that Big Data leaders manage investments “by using a similar approach to venture capitalists. This involves making multiple small investments in a variety of proofs of concept, allowing rapid iteration, and then identifying PoCs that have potential and discarding those that do not.”
Leverage multiple channels to secure skills and capabilities: “The Big Data talent gap is something that organizations are increasingly coming face-to-face with. Closing this gap is a larger societal challenge. However, smart organizations realize that they need to adopt a multi-pronged strategy. They not only invest more on hiring and training, but also explore unconventional channels to source talent. Capgemini advises reaching out to partner organizations for the skills needed to develop Big Data initiatives. These can be employee exchanges, or “setting up innovation labs in high-tech hubs such as Silicon Valley.” Startups may also be another source of Big Data talent.