Back to the Future with Spark

Back-to-the-Future-By-Phatom-City-creativeJuly 2nd 2015 was the 30th anniversary of one of my all time favorite movie “Back to the future.” In order to see the greatest possibilities offered by Big Data, Data Scientists will need to go back to the future using machine learning to drive data insights. In the future, what is new is the number of parallelized data processing platforms available for applications of Big Data.

Data Scientists at the recent Strata + Hadoop World conference in San Jose, Calif., talked about the complexity of predictive machine learning algorithms and sheer numbers of these models. They were concerned that this could limit use of machine learning in an enterprise because, while the power of the machine learning techniques scales with the data, training times increase exponentially. Further, with several models and growing masses of data, iterative machine learning becomes a bottleneck. As a result, models run on samples, not full or near-full data sets, which results in some compromised accuracy and predictability. [1]

Too many models

Using Hadoop along with machine learning software you can run larger sets of data against existing learning models, which can lead to better predictions. These models can improve decision making around pricing, fraud prevention, underwriting, and marketing.

In the insurance industry, the bread-and-butter work with actuarial tables from long ago makes it a hotspot for use of statistical machine learning algorithms that predict outcomes. However, data size, model complexity and the number of iterations that are required to successfully train models bring processing limitations.

Time is of the essence

Machine learning has underpinned analytics for many years. That notion is so familiar that it is not something we speak of anymore. [2] Machine learning is empowered by the fact that you can now process much more data. You use a tremendous amount of computation power, not just one computer.

Time sensitivity comes from the fact that these analyses are supposed to lead to concrete actions.

For a retailer, it is important to identify buyer characteristics quickly for teams to act on. There also needs to be time in the sales cycle to get analytical information to marketing and sales personnel. They in turn must create product packages that appeal to customers. Imagine the value, if you can create a probability measure of who will buy and how much they will spend. The combination can be very powerful. The problem you will have is with creating as many predictive models as you need in a sufficient amount of time that would allow for customization of the actions.

Spark makes sense

The MLIB is Apache Spark Data Processing Engine scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, collaborative filtering, dimensionality reduction, as well as underlying optimization primitives. It has often seen use in what might be described as new-age machine learning applications.

For example, the Recommendation Engines found on many websites.

Twitter @bigdatabeat

Some of the content for this blog was provide by [1] Ryan Michaluk and [2] Lou Carvalheira.

Comments