Data federation techniques help mitigate both the accessibility and the latency issues, but we still need to deal with the need for quality of content when employing a data virtualization approach. Within the ETL world, data inconsistencies and inaccuracies are dealt with through a separate data profiling and data quality phase. This works nicely when the data has already been dumped into a separate staging area. But with federation, not all the data is situated within a segregated staging area. Loosely-coupled integration with data quality tools may address some of the problem, but loosely-coupling data quality can’t eliminate those situations in which the inconsistencies are not immediately visible.
This becomes even more of an issue when considering use cases. In many scenarios, the role of the data virtualization layer is providing homogenized access to heterogeneous sources from a semantically-consistent perspective. The consumers of that data want to execute queries directly against the view, without having to qualify the underlying sources. The data virtualization layer must enable those views with high performance, integrating optimizations into the query engine that do not violate any underlying semantic constraints.
A comprehensive data virtualization solution has to provide a deeper level of knowledge about the structure, formats, and the semantics associated with the data sources. This type of solution will go beyond just delivering data and become a provision layer that identifies data inconsistencies and inaccuracies from the structural and semantic perspectives and, more importantly, can distinguish issues and fold their resolution directly into the data virtualization layer.
In other words, to truly provision high quality and consistent data with minimized latency from a heterogeneous set of sources, a data virtualization framework must provide at least these capabilities:
- Access methods for a broad set of data sources, both persistent and streaming
- Early involvement of the business user to create virtual views without help from IT
- Software caching to enable rapid access in real time
- Consistent views into the underlying sources
- Query optimizations to retain high performance
- Visibility into the enterprise metadata and data architectures
- Views into shared reference data
- Accessibility of shared business rules associated with data quality
- Integrated data profiling for data validation
- Integrated application of advanced data transformation rules that ensure consistency and accuracy
What differentiates a comprehensive data virtualization framework from simplistic layering of access and caching services via data federation is that the comprehensive data virtualization solution goes beyond just data federation. It is not only about heterogeneity and latency, but must incorporate the methodologies that are standardized within the business processes to ensure semantic consistency for the business. If you truly want to exploit the data virtualization layer for performance and quality, you need to have aspects of the meaning and differentiation between use of the data engineered directly into the implementation. And most importantly, also make sure the business user signs-off on the data that is being virtualized for consumption.
