Apparently, everyone’s favorites words these days are “big data.” But just because some new tools and techniques promise the potential of absorbing and analyzing huge amounts of data from a variety of sources, it does not mean that installing Hadoop in your enterprise is going to automatically help you to get new insights from existing and “big data,” faster.
In fact, before you install that software, let’s think a bit about some core challenges anyone developing big data, or in fact any applications requiring data reusability will face: accessibility, latency, and quality. Let’s look at each one of these in a little more detail.
Most applications developed to support siloed or business function-related processes are built with the presumption that the data sets are persisted in some kind of organization. Whether that organization is simplistic (files with fixed-field columns) or more sophisticated (such as entity-relationship models, etc.), the application programmer can depend on the accessibility of the data to support the acute business needs. Data reuse and data repurposing change the rules somewhat, especially when the data sets are sourced from alternate persistent or streamed models.
For example, your business intelligence applications want to aggregate and analyze data taken from a number of different sources, and that means that some methods must be created and invoked to extract the data from those sources and migrate that data into some target mechanism for analysis and/or reporting. And recognizing that the data sources may originate from a wide variety of systems or streams, software vendors evolved tools for accessing data from the original source in preparation for transformation and loading into data warehouses that would then feed the analysis and reporting engines.
As long as each piece of the analysis and reporting framework is distinct (and in many cases that remains true), solving the accessibility challenge using extraction, transformation, and loading (ETL) tools is “good enough.” This first generation of data integration techniques (at this point, at least) has effectively become a commodity, becoming bundled with other data management tools, or independently developed via open source channels. But as the expected timeframe for analysis becomes more compressed, the diversity of data source formats expands, and the volumes of data increase, this “phased extraction and transformation” for the purposes of data integration begins to seem insufficient.
Meeting performance needs is yet another fundamental challenge, especially with increased demands for absorption of data sources whose sizes are orders of magnitude greater than that experienced in the past. The issue is not necessarily the ability to extract and transform the data, but rather the ability to extract and transform the data within the narrowed time frame for discovery and subsequent exploitation of actionable knowledge. More simply, the second challenge involves reducing data latency, or eliminating the throttling affects of limited I/O channels and network bandwidth through which ever-increasing data volumes need to be pushed.
Data latency is not a new issue, but is one that has been ever present throughout the history of computing. There have always been delays in moving data from larger and slower platforms to smaller and faster ones. A primary example involves looking at the microprocessor itself: a CPU unit with a small number of registers that needs to make its calculations of data streamed into main memory (which is slower than the registers) from its storage on disk (which is way slower than main memory).
Addressing the Dilemma with Data Federation
You might see that these two issues are aligned somewhat: the first, accessibility, deals with pulling the data, while addressing the latency issue will need to address the speed of delivering that data to its intended target. The way data integration techniques evolved led to a solution approach that seems to address both of these issues. Data federation enabled a coding-based approach to developing services that provided the extraction componentry to be somewhat abstracted away from the integration experience while enabling some level of performance in delivering that data when it is needed. In the next post, we’ll look at some key similarities between solutions for addressing the latency issue at the hardware level and data federation.
A different facet of latency is the delays associated with the workflow stages in the data integration development process. This process is initiated with the business user providing requirements to IT, and then progressed through various stages: IT prioritizing the work against a backlog, design, development, testing, and then finally delivering the federated data or report. In the worst case, this process can take many months, and as with a children’s game of telephone, often the resulting output is dissimilar to what was originally requested in the first place. Without the business user getting involved early in the process and being able to work with IT to iterate through the results, accessing and merging heterogeneous data in real-time is simply a technical exercise.