From Hand Coding to Automated, Intelligent Data Integration – A Year in the Life of a Data Management Team

At a Monday morning scrum meeting, Sam, VP of Data Management, discusses plans to modernize current on-premises data infrastructure to a cloud-based data warehouse and data lake. This IT-led modernization project will achieve scalability, elasticity, agility, and bring it better performance. Part of the project plan includes a data ingestion and processing framework  to process and store data in a cloud lakehouse. Sam asks if such a solution can be built, operationalized, and supported for the new initiative. 

Tim, the lead data engineer, takes on the challenge and commits to a one-week deliverable for a data ingestion and processing framework up and running in the cloud. Feeling confident, Tim and his four developers get down to hand coding the solution for five data sources that have been identified for this pilot.   

The first data source is relatively easy to work with, as the data is well structured. Tim moves to the next datasets and that’s when trouble starts.  The data comes from an OLAP system; however, the base tables to support the OLAP system are derived from a CRM system, a mainframe system, and an MDM system. Tim’s code needs to connect, parse, integrate, and cleanse these varied data types that are siloed in different systems. What’s more, Tim is not even sure how to maintain the quality of the data.  And, as the lead engineer, Tim has a new dilemma: He needs to figure out where and how code will scale to process the large volume of data.

And now the deliverable due date doesn’t seem achievable, potentially setting the team back months or more.

Hand Coding in Hindsight

The above scenario fits any IT organization supporting a cloud-based AI or analytics initiative that charges its technical developers to design and develop a hand-coded data integration solution.                           

After several months of project delays, Sam jots down a few lessons learned, based on the team’s initial attempts to hand code a data integration solution:

  • Hand coding is expensive – The price of a homegrown solution rises as complexity increases in the form of more sources, more targets, more advanced data transformations, or simply scheduling. Sam also notes the bulk of hidden costs are in operating and maintaining a hand-coded integration solution.
  • Hand coding needs recoding – For each cloud vendor, Tim and his developers needed to reengineer or recode the solution based on the vendor’s technologies, preventing reuse of the existing solution.
  • Hand coding needs optimization – After development and testing, Tim’s team needs to make sure the code runs optimally in production, leading to more tweaking more of their hand-coded solution.
  • Hand coding needs skilled developers – Sam notices that all the inner-workings of this hand-coded solution reside with Tim and his four data engineers. No one in the organization, not even the operations team, knows how to debug this solution in case there is a production failure. This means the development team is always on call. Even worse, if a data engineer leaves the company, there is no one skilled to take over.

These are only a few points Sam learned from the first phase of this project. Additionally, Sam thinks about reusability – the team built a solution for a specific system. How will the solution manage other systems within the enterprise? The team will have to customize a solution for 70-plus systems. Secondly, Sam also manages the operations team. How will this team deploy and maintain 70-plus hand-coded solutions?

The Light-Bulb Moment: Cloud Data Management

Nine months after kicking off the initial modernization project, Sam realizes he needs to re-evaluate his cloud data management strategy. With the benefit of hindsight, he recognizes that this strategy needs to improve productivity, reduce manual work, and increase efficiencies through automation and scale. Sam outlines the requirements for a cloud data management solution that includes:

  • The ability for both business and IT users to understand the data landscape, through a common enterprise metadata foundation that provides end-to-end lineage and visibility across all environments
  • The ability to reuse business logic and data transformation, which improves developer productivity and allows business continuity as it promotes integrity and consistency of reuse
  • The ability to abstract the data transformation logic from the underlying data processing engine which will future proof it from the rapidly changing cloud environment
  • The ability to connect to a variety of sources, targets, and endpoints without the need for custom code connectivity
  • The ability to process data efficiently with a highly performant, scalable, and distributed serverless data processing engine or the ability to leverage cloud data warehouse pushdown optimization
  • The ability to operate and maintain data pipelines with minimal interruptions and cost

Sam realizes over time that his team’s hand coding data integration solution leads to higher costs, especially for enterprise environments.  Sam is looking to avoid mistakes that prevent the organization from implementing successful cloud analytics initiatives.  Like many IT leaders, Sam is looking to de-risk cloud modernization projects by avoiding hand coding or limited point solutions and lowering cost.

A year after that first scrum meeting, Sam has invested in an intelligent and automated data integration solution. The solution bridges the gap between on-premises and cloud deployments and accelerates time to value. Sam’s team now has native access and visibility to various data types, enabling them to standardize on data variety as well as validate, verify, and process the data without having to rebuild everything from scratch. The result? The company can now reap the benefits of a successful cloud lakehouse implementation. 

Next Steps