Big data and related technologies such as Hadoop present significant opportunities and challenges to businesses. Nearly everybody in IT reports that they are actively evaluating big data technologies. And, just as you would expect, they are in a variety of stages of implementation. So, who has time to think about data governance when dealing with a massive change like this?
First, you have to get your hands around the new technology, right? Actually, this is exactly the right time to think about data governance for big data; before the wild, untamed data from outside the company starts getting mixed with your potentially more trustworthy, tamed, internal data.
Consider this example: A leading pharmaceutical company told me that in the past all of their research data had been internally-generated and, as a result, was trusted. Their internal research was the proprietary edge that their company had against their competitors. Times have changed. Now, there is a massive amount of research available through public sources on the Internet. Any company trying to compete in pharmaceuticals would be crazy not to take advantage of the wealth of outside research, if for no other reason than to avoid recreating other people’s mistakes. But, “free” third-party data (like free puppies) from outside sources bring issues with them:
- How trusted is the source of this information?
- How current is the information?
- Is there other research related to this information? Internal or external?
- Was the information correctly moved into the company? For example: Was it loaded correctly? Were the data transformations done correctly?
- Can the people who need to use this data find it?
Everybody working with data from sources outside their company (examples include social interaction data, mobile data and data from B2B partners) firewalls will soon be dealing with the same issues, plus a few new issues. These include:
- What is the structure of the external data? If you use a technology like Hadoop, for example, it assumes no structure and is schema-on-read. If you are going to combine Hadoop data with other internal data, as many people are, you will need to apply some metadata with technologies such as Hive or HCatalog or store in a NoSQL datastore such as HBase.
- Will the data structure enable you to create data lineage: Visual diagrams of the flow of data through your organization so that you can understand and manage it.
- Is the data correct? Can it be cleansed of errors or ambiguities?
- Where is the best place to deal with the data cleansing issues; at the source (such as Hadoop Distributed File System) or at the application level where the application data experts are matching it with their internal transactional data. There are good arguments for both use cases.
- How do we clearly label the owner and source of this data?
- How do we put a ranking on the data that represents how trusted the source is?
- Who is moving and transforming the data? Is this somebody in IT or somebody in another organization? Do we want to treat the data differently if it is not from an IT source?
- How is the data matched, related and linked to enrich customer, product and other master data?
- Does the data contain any sensitive data such as personally identifiable information (PII) that needs to be masked (e.g. social security and credit card numbers) for regulatory compliance?
- How do you search and find data that may already exist and may have already been curated to avoid creating endless copies of data and further contributing to the cost of managing big data?
The time to think about data governance for big data is up-front when the big data projects are being architected. Trying to catch up and govern the projects later will be much harder to accomplish. Here are a few recommendations:
- Think about your data governance processes up front, before the data starts flowing.
- Do not attempt to govern everything. Think about where the high-value and high-risk parts of these initiatives are and focus there. Ask: what has the highest business value to the organization and what would be the impact of bad data in this business initiative?
- Assign data stewards for these new sources of data and hold them accountable for the quality of their data.
- Make sure that the data stewards assign business terms and clear, unambiguous definitions classifications and taxonomies to better manage and standardize use of the new sources of data.
- Start to define a way to rate the quality of the data sources in a way that everybody can agree upon.
- Take a serious look at where to cleanse and where to match new data as it’s captured and starts flowing through the environment.
- Don’t think of big data as something separate. It is all part of your overall data governance process and should be treated that way.
Think about your data governance plans in your architectural stages, before the initiative is implemented. Being proactive will be much more effective than trying to retrofit data governance, like closing the doors after a horse that has already left the barn. And, just like data governance in general, the key to success will be to only manage what is really important to the company and to be able to show a clear ROI for your efforts. Not to be sensational here, but the takeaway I want to share is this: If you are not designing data governance into your big data initiatives, you are going to competing with companies that are. For more information: Check out the Ralph Kimball White Paper – Newly Emerging Best Practices in Big Data and also see Rob Karel’s blogs on Data Governance.