The Need for Full Life-cycle Data Management
Over the last 35-40 years we have been struggling with many aspects of enterprise data management. It seems to me that in large part this is due to neither the software industries nor customers having realized the importance of the full data life-cycle. What do I mean by the full data life-cycle? In this series of three posts I will start to explore the many phases of the data life-cycle. In many ways this is nothing new – just think for a few minutes about what can happen to a piece of data from the time it’s created or time it’s ingested from an external source. The list of possible happenings is pretty long. Let’s take a look at the major initial points in that full data life cycle.
Let’s start at the very beginning of the full data life-cycle and pick a few key areas to start with:
Being able to access any type of data from any current or future source
The requirement is straightforward, the ability to access any type of data from any application, system, database and any other source of data. Data is often spread across numerous systems including legacy mainframes and other “legacy” systems such IMS, VSAM, DB2, SQL Server, Peoplesoft and so on. It’s important to ensure that any current and future data sources can be integrated easily, including data feeds from Facebook, Twitter and other social platforms that could well be required as more external data is drawn in for better fraud waste and abuse prevention and detection. There are already a sometimes bewildering array of data sources with only more to come – IoT, sensor data and who knows what:
o Traditional relational databases such as DB2, SQL Server and Oracle
o Legacy mainframe sources such as IMS/DB, VSAM, Adabas, DB2, IDMS
o Packaged applications such as SAP, Siebel, Oracle financials
o Appliances such as Greenplum, Teradata
o Social media such as Facebook, Twitter and LinkedIn
o Hadoop distribution such as Cloudera, HortonWorks and MapR
o Cloud applications such as Salesforce.com, Workday, NetSuite
o Self-serve analytical tools such as QlikView, Tableau, Redshift, SFDC Wave
o NoSQL databases based on Apache Casandra such as MongoDB and DataStax
Discovering the data you have to deal with
At first glance this might seem to be at worst unnecessary or at best a trivial piece of the puzzle – beware of complacency. Getting a full understanding of the data you are dealing with is vital if the follow-on activities are to stand the best chance of success. And why is this? you might ask. The answers range from the simple; you can’t deal effectively with data you don’t understand, to the complexities of starting to treat data as a valuable organizational asset and the first step on that journey is to gain a thorough understanding of the data. That understanding needs to have two components. First is the structural and content based quality of the data: are field values conflated? null? not standardized in format and so on. The second is equally if not more important: how well does the data conform to the “business rules” that so often are buried in the minds of experienced business analysts and subject matter experts. They need to be deeply involved and actively participate in the building of profiling rules that identify and quantify adherence to those buried business rules. With this in place you can start to produce data quality scorecards that measure changes in quality over time (more of this later).
Migrating data from the old to the new system
Often this is seen as a trivial piece of the implementation when in fact our experience shows that it can be among the most problematic and time consuming phases. The requirement is straightforward, the ability to aces any type of data from any system or other sources of data and then to make the required transformations to the required data structures, formats and quality requirements of the new system. This often entails; access to mainframe legacy systems such as IDMS, IMS, VSAM, Adabas and so on and then delivering the newly structured data to a relational DBMS system or even to a cloud platform. The migration requirements become more complex when the new system implementation is phased over time, as the data will need to be synchronized bi-bidirectionally between the old and the new systems.
Ensuring that new system has the highest quality data
It’s easy to assume that the data coming from the old system is up to snuff “it worked for the old system so it will be fine for the new system”. Experience shows that this is most often not the case. The data coming from the old system is most often of poor quality; missing data; misfielded data; bad addresses; close duplicates within and across sources and so on. It is essential that this data is cleansed before being delivered to the new system. To effectively cleanse that data requires a comprehensive approach that includes; profiling to understand current quality issues; parsing to extract meaningful data from concatenated fields; address cleansing and validation and augmentation with geocoding, house holding and maybe data from external sources such as D&B; identifying, matching and merging close duplicates before delivery to the new system.
In the next post we will look at the several more key aspects of the full data life-cycle, including modeling, resolving and mastering. Meanwhile keep in mind that this is tremendously difficult problem for most organizations to fix. Not necessarily because of technical aspects – which are tough, but because of organizational inertia and the feeling that “this is my data, leave it be” and other ownership related issues.