Those moving to Big Data, and that is a lot of enterprises right now, should also consider the need for data integration to support their new data platform. In many cases, the use of proper data integration procedures and technology is an afterthought. However, with a bit of planning and the right data integration technology, the transition to Big Data can be a smooth and productive one. Here are a few things to consider:
Data quality becomes even more important. Considering that Big Data systems, no matter if they are within the cloud or the data center, manage massive amounts of data, both structured and unstructured. Thus, the ability to manage data quality becomes more of a priority.
While there are many ways to deal with data quality, you need to consider the massive amounts of data contained within Big Data systems (typically Hadoop), distributed across many storage instances. Thus, in many instances it’s a better approach to deal with the data quality issues within the data integration technology, as the data moves in and out of the clusters.
Data integration performance becomes more critical. Moving a few MB to and from a database is something that many data integration solutions can handle. Now try a few GB, or several hundred GB an hour, to support integration with Big Data clusters.
The data integration technology you pick has got to keep up. Not only now but as the database experiences organic growth over time, including sharp increases in the size and complexity of the data being sent to the database.
Performance has to be considered during design and selection of the technology. If you hit a wall down the road, it’s typically too late to start swapping in new solutions. Thus, you need to work data integration as if you’re working engine design for aircraft…you need to design in performance that may one day get you out of trouble. Once you’re heading for the ground, it’s too late.
Data virtualization becomes more valuable. The ability to keep the physical database behind some common virtual structure becomes more of an advantage when dealing with Big Data systems. Consider the complexity of the data, and the fact that we’re dealing with both structured and unstructured data. The ability to adhere to a common virtual schema that provides a better representation of the data for one or many use cases is a huge advantage when you take into account how the Big Data system will be leveraged as both BI and operational BI.
The data integration solution is typically the best place for this data virtualization layer to exist. It provides the mapping between the physical database and a virtual schema that does not require a restaging of the data. Those moving to Big Data should look into data virtualization as they move through their final design, and consider the use cases.