Fast and Furious: Designing for the Fast Data Lane: Transporting and Processing Streams
In the earlier post, we discussed the ability for stream processors, such as Informatica VDS, to process events at the edge, in this post we look at how these events are transported to the corporate data center for streaming analytics.
As you’re lounging in front of the TV flipping channels you notice a product during an infomercial that caught your eye. While you listen, and watch as the host describes the product, you pick up your tablet or smartphone to do more research and see very specific, targeted promotions regarding the product. Have you ever wondered how the company can act quickly to your response?
In this example, your interaction with the company’s website captures events, transports your data points for further analysis, which could be matching or enriching the data, and sends a response back to the consumer all in real-time. This is the power of the fast data lane, the ability to collect, transport, process, analyze and deliver an outcome instantly.
The transportation of events from producer to consumer can be a complex art form – integrating application events to an overall data system can be accomplished using a message broker for communication between systems. Once an application publishes an event to the message broker, the other application can consume the event asynchronously at any latency (low, mini or batch).
To facilitate the movement of millions or billions of events, Apache Kafka, was developed as a fast, distributed, scalable and durable event transport system using a modern approach to publish-subscribe model. Even though Apache Kafka may sound like a traditional message queue, one of its differentiators is the ability to store messages in a fault tolerant way, which means the messages are replicated to multiple servers and committed to disk, therefore guarantees zero data loss. Apache Kafka has become the de facto real-time publish-subscribe system that helps organization manage the transportation of events captured from stream processors to consumer frameworks such as spark streaming for analytic workloads. Alternatively, cloud-based solutions such as Azure Event Hubs, AWS Kinesis and Google Cloud’s pub/sub, offer a fully managed event-transport system (platform as a service) that can reliably move massive amounts of data with low latency and at low cost.
In the next phase, streaming analytics solutions consume the events from Kafka (or any of the mentioned cloud-based solution) and transform events (filtered, cleansed, aggregated, or enrich) and send real-time alerts or notifications back to Kafka or persist to a data store (Hadoop) for batch analysis where trained models built by data scientists during are applied.
In the upcoming blog, we will dive into how Informatica’s streaming solution helps to build a real-time data pipeline that allows organization to prepare and process data in streams by collecting, transforming and joining data from a variety for sources leveraging Kafka or message queues that scale to billions of events which have a processing latency of less than a second.
This blog is part of the series on stream processing and analytics. Catch the series: