Darren Cunningham, in his recent blog post How to Migrate To The Cloud, made some great points around the use of staging for data integration for cloud computing. The reasons he would leverage a staging area for cloud computing include:
- It enables better business control before the data is pushed from one system to the other.
- It enables tracking and reconciliation of a business process.
- It enables the addition of new sources or targets with reuse instead of building the spaghetti plate of point to point direct interfaces. It responds to the SOA paradigm.
- It breaks the dependencies between the two systems enabling asynchronous synchronization or synchronous with different size of data set (single message or bulk).
The reality of data integration is that the patterns of use are largely dependent upon the requirements. While real-time or near-time data integration is required in certain circumstances, the use of a staging area for data integration is gaining ground as more and more corporate data sets reside in clouds.
As Darren points out above, the core reason is control of the data as it moves from system to system, typically on-premise to the cloud, and back again. The use of intermediary staging where the data can be viewed, manipulated, and cleaned, insures that the data quality and any data governance required occurs consistently.
Moreover, most on-premise to cloud computing problem domains typically deal with more than one system. The use of staging allows you easily to add and delete systems, using the staging area as a place where several data sources are combined, processed, and retransmitted to the target system. When attempting this using real- or near-time technology the transformation becomes complex, and thus difficult to manage and execute.
In the world of SOA we often argue about the use of loose coupling, and the use of a staging area makes that possible. Considering that the systems are largely decoupled, any changes to the source or target systems are easily managed within the staging area with simple tweaks to the transformation and translation logic.
When leveraging a staging area for on-premise to cloud computing integration, there are a few key areas of guidance that I would provide.
First, the design of the integration path is very important. Take time to understand the source and target schemas, and thus design the transformation and routing logic accordingly.
Second, when selecting technology make sure to understand your own use cases, and any growth or changes that will occur within your problem domain. Many enterprises only purchase for what they need now, and end with layers upon layers of different integration technologies, with none truly solving their problems.
Finally, consider the latency of the Internet. In many instances, attempting to transfer GBs of data just won’t happen in minutes as it does from on-premise system to on-premise system. You need to account for this within your data integration design, and consider the growth of the data sets over time.