Enterprises use Hadoop in data-science applications that improve operational efficiency, grow revenues or reduce risk. Many of these data-intensive applications use Hadoop for log analysis, data mining, machine learning or image processing.
Commercial, open source or internally developed data-science applications have to tackle a lot of semi-structured, unstructured or raw data. They benefit from Hadoop’s combination of storage and processing in each data node spread across a cluster of cost-effective commodity hardware. Hadoop’s lack of fixed-schema works particularly well for answering ad-hoc queries and exploratory “what if” scenarios.
At a large entertainment company profiled in a PwC Technology Forecast, their technology shared services group uses Hadoop for an integration mashup for diverse departmental data to analyze patterns across different but connected customer activities, such as attendance at theme parks, purchases from stores, and viewership of cable television programming. This complete view of customer activities drives up-sell revenue through more-optimized recommendation engines and personalized marketing offers that would not have been possible from departmental silo’ed data. This is particularly true with the advent of new sales channels and new forms of customer engagement through social media analytics.
Hadoop is not a replacement for master data management (MDM). Lumping data from disparate sources into a Hadoop “data bag” can empower ad-hoc analysis for marketing up-sell or churn management, but does not by itself solve broader business or compliance problems with inconsistent, incomplete or poor quality data that may vary by business unit or by geography.
Hadoop is powering entrants in data as a service and other new “blue ocean” market spaces, as described in the book Blue Ocean Strategy by W. Chan Kim and Renée Mauborgne. A few examples:
- StumbleUpon provides web content personalized to individual interests, using a combination of MySQL, HBase and Hive.
- An online video advertising platform uses Hadoop and the Infobright open-source analytic database to enable publishers, advertisers, ad networks and media groups to monitor more than a dozen engagement metrics including percentage viewed/completed, pause/resume and muting.
- For medical research, Hadoop powers DNA sequencing analysis in multiple genomic research projects.
- And a Netherlands-based research center is launching a Hadoop production facility for scientists working on frontier research in bioinformatics, computer science, information retrieval, natural language processing and other disciplines.
At a large bank that presented at the Yahoo! 2010 Hadoop Summit, Hadoop helps quantitative analytic staff study entire data sets with billions of records versus the limitations of looking only at sample data, to assess the market, credit and/or operational risk, and revenue lift, of new and existing financial products. Looking at complete data sets is helpful for credit card fraud detection, where isolated purchases spread across time or location may point to patterns that reflect stolen credit card numbers.
Note that as technology at the data storage and processing layer, HDFS and MapReduce extend rather than replace fraud management systems. To return to the example of this large bank, at the same time that they invested in Hadoop, columnar and R-programming-language extensions to their Teradata enterprise data warehouse, they implemented a SAS high performance risk management offering to power credit-risk modeling, scoring and loss forecasting, in concert with the other elements of their data architecture that includes IBM Cognos business intelligence, Tableau reporting and a SAP global ERP reporting system.
To turn to another risk management use case, a large power company uses Hadoop to store and analyze environmental sensor data for critical infrastructure testing of its smart energy grid and individual generators. They are able to improve network performance, scan through historical logs for forensics after a problem occurs, and pinpoint weaknesses to help prevent power outages.
Over time, more enterprise application vendors will bundle and support Hadoop directly in their software stack, but that’s not commonplace yet. One example for digital security, Zettaset (previously GOTO Metrics), offers a Hadoop-enabled security event management application to analyze logs from multiple security systems including firewalls, intrusion detection, network switches and host-based sensors.
Improve Operational Efficiency
Hadoop improves operational efficiency by processing and storing the immense quantities of log data generated by networks, mobile devices, applications, online advertising and sensors more cost-effectively than can many alternative architectures. A large mobile device manufacturer is using Hadoop to process and understand ever-growing volumes of mobile-device usage data, and several online advertising networks use Hadoop for clickstream log analysis, data mining and machine learning. And a global leader in car navigation software uses Hadoop for location-based content processing including machine learning algorithms for statistical categorization, deduping, aggregation and curation.
Hadoop can also save time and money for image processing. As described in a NYTimes.com blog, The New York Times converted its historical archives from TIFF images to PDFs for online access using Hadoop and Amazon Web Services. Another example, a large eCommerce company, extended its Teradata enterprise data warehouse with Hadoop for image processing and deep data mining. Other related uses include comparing satellite imagery to automatically identify changes over time from satellite pictures of the same location, or comparing texts to more quickly find similarities or differences among documents.
As shown in several of these use cases, Hadoop can serve as a “data bag” for data aggregation and pre-processing before loading into a data warehouse. At the same time, organizations can offload data from an enterprise data warehouse into Hadoop to create virtual sandboxes for use by data analysts. My next blog will go into more detail on the architectural options.