Making IoT Payoff With Big Data Streaming Analytics

Big Data Streaming AnalyticsThere is a lot of media attention as relates to the Internet of Things (IoT) in the world of Big Data with the promise to improve service, reduce fraud, and deliver innovative products.  But none of this happens overnight.  Most Big Data initiatives that strive to take advantage of IoT are typically implemented using a phased approach. Many organizations start their big data journey with data warehouse optimization to reduce costs and to establish Hadoop as part of the foundation for an architecture that can support their big data projects. The vision is to build out a data lake that can store, process, and manage all types of data including social media and IoT machine data at any scale, primarily for data exploration and discovery of new business insights.

By adding MDM and real-time streaming capabilities to a Big Data initiative, you can create a more complete view of customers and deliver real-time operational intelligence that optimally predicts next-best offers to customers, improves fraud detection and cyber security, and improves total customer experience.

The use-cases for Big Data Streaming Analytics require more sophisticated data platforms than traditional streaming analytics.  For example, one Informatica customer uses big data streaming analytics as part of their risk reduction program to proactively reduce money transfer fraud and AML.  Fraud and AML detection requires a big data management platform that can operationalize a predictive model which detects and responds to fraudulent and AML patterns in real-time.  These predictive models evolve and improve over time through an iterative and agile approach to Big Data Streaming Analytics.

  • The first step is to discover insights into fraudulent patterns and cyber security threats that may occur over months and even years.
  • Second, analytic models are hypothesized, tested, and validated using a variety of data types that include relational transactions data and IoT machine data
  • Third, once a model with sufficient predictive power is discovered it is then implemented as a data pipeline to detect and respond to fraudulent events and cyber security threats in real-time.

The only proven method that supports such an iterative and holistic approach to this example of Big Data Streaming Analytics and others is a Big Data Management platform that can:

Acquire all types of data at any latency:  Ingest all types of data (e.g. machine, relational, social, log files, etc.) of any size at a variety of latencies with high availability and reliability.  Machine data is generated, collected, and streamed in real-time to persistent and scalable data storage and processing platforms such as Hadoop.  Relational data (e.g. transactions, profiles, etc.) are typically ingested in near real-time (micro-batch) or batch into the same data platform.  All of this data is used to build up what is commonly referred to as a data lake or data hub

Discover insights in big data at scale:  Profile, integrate, cleanse, match, and deliver big data at scale for exploratory data analysis and statistical and machine learning model validation. This crucial step combines a variety of data types, formats, and size into the data sets used to build predictive analytic models.  It is typically a scalable batch process executed in MapReduce, Spark, or proprietary engines via YARN.  This process requires collaborative tools so that data scientists, analysts, engineers, and stewards can architect and deliver trusted and secure data pipelines that feed the analytic models.

Operationalize insights into actionable results:  Implement predictive models and insights as data pipelines at multiple latencies based on business requirements related to complexity, speed, accuracy, throughput, reliability, and cost.  Once an insight is discovered or a predictive model validated it needs to be implemented as a data pipeline that meets the latency requirements typically instantiated through real-time data collection and streaming and CEP engines.  The most efficient, reliable and agile approach is to operationalize the data pipeline designed and built during the discovery phase in which the model was validated.

There are point solutions that can partially address each of the three key requirements listed above but Informatica V10 is the first to deliver a complete solution for Big Data Streaming Analytics.