In my last post, I introduced the concept of data federation, which for now I would like to differentiate from data virtualization – a term that I’ll bring into focus in a bit. But first, we explored two issues: data accessibility and data latency. Within recent times, the sophistication of data accessibility services has matured greatly, to the point where one can somewhat abstract those accessibility services from the downstream consumer (or “reuser”) of data.
But first let’s deal with the performance issue. I noted that data latency has always been an issue, especially in the hardware world, so let’s borrow one of their solutions: a cache. A hardware cache is a type of memory that is much faster than main memory, can hold more items than the CPU registers, but is still relatively small. Values stored in the cache can be accessed much faster than those in main memory. One aspect of data federation applies the same principle: enabling a virtual cache for data accessed from the sources reduces the latency in pulling data from those sources.
There is one potential penalty to be paid, in that changes to the underlying stored data (either in memory or on disk) become inconsistent with the data in the cache. In many cases, though, this can be addressed through low-latency messaging and other latency-reducing (yet consistency-assuring) methods, which we will look at in a future blog series.
That let’s us look at how the data accessing services are layered in: For the purposes of an analytical system, once the sources are identified, their access routines can be abstracted to pull the data into the virtual cache, which is then rapidly provided to the reporting and analysis applications. Problem solved, right? Hold on a second there…
We have solved two main problems: getting the data and providing it quickly, and this solves the technical issues, but ignores a much more critical problem with the content. We actually have a much more sophisticated fundamental challenge: inconsistency and inaccuracy of data across the different sources prevents the ability to effectively merge and consolidate the data.
That inconsistency and inaccuracy can exist on multiple layers:
- At a value layer, in which data values from one source are inconsistent with those from another source. For example, a state field in one data set allows for only the 50 United States and the District of Columbia, but another data set’s state field allows for US territories as well.
- At the format layer, in which values in one data set use a different representation. Using state again, one source might use full state names while another uses the 2-character Postal abbreviations.
- At the structural layer, in which the same concepts are managed using different data types, such as CHAR(2) vs. VARCHAR(25).
- At the conceptual layer, in which entities form different sources have slightly different meanings. For example, that customer database from sales may contain prospects that are not yet customers in the finance system.
- Qualitative inaccuracies associated with data errors.
The fact that we can federate data sets using access services does not mean that what we have accessed can be glommed together into the analytical environment. The existence of the inconsistency and inaccuracy of source content throws a monkey wrench into the federation solution. Additionally, a key point to note is the need for a business user to sign-off on consistency and quality – as the business user knows the data the best. In my next post, we’ll examine this issue in greater detail and formulate some ideas for how they can be solved with a comprehensive solution for data virtualization.