To level set, let’s make sure you understand my definition of dark data. I prefer using visualizations when I can so, picture this: the end of the first Indiana Jones movie, Raiders of the Lost Ark. In this scene, we see the Ark of the Covenant, stored in a generic container, being moved down the aisle in a massive warehouse full of other generic containers. What’s in all those containers? It’s pretty much anyone’s guess. There may be a record somewhere, but, for all intents and purposes, the materials stored in those boxes are useless.
Applying this to data, once a piece of data gets shoved into some generic container and is stored away, just like the Arc, the data becomes essentially worthless. This is dark data.
Opening up a government agency to all its dark data can have significant impacts, both positive and negative. Here are couple initial tips to get you thinking in the right direction:
- Begin with the end in mind – identify quantitative business benefits of exposing certain dark data.
- Determine what’s truly available – perform a discovery project – seek out data hidden in the corners of your agency – databases, documents, operational systems, live streams, logs, etc.
- Create an extraction plan – determine how you will get access to the data, how often does the data update, how will handle varied formats?
- Ingest the data – transform the data if needed, integrate if needed, capture as much metadata as possible (never assume you won’t need a metadata field, that’s just about the time you will be proven wrong).
- Govern the data – establish standards for quality, access controls, security protections, semantic consistency, etc. – don’t skimp here, the impact of bad data can never really be quantified.
- Store it – it’s interesting how often agencies think this is the first step
- Get the data ready to be useful to people, tools and applications – think about how to minimalize the need for users to manipulate data – reformatting, parsing, filtering, etc. – to better enable self-service.
- Make it available – at this point, the data should be easily accessible, easily discoverable, easily used by people, tools and applications.
Clearly, there’s more to shining the light on dark data than I can offer in this post. If you’d like to take the next step to learning what is possible, I suggest you download the eBook, The Dark Data Imperative.
What I love about the cloud is it has something of value to offer practically any government organization, regardless of size, maturity, point of view, approach. Even for the most conservative IT shops, there are use cases that just plain make sense. And with the growing availability of FEDRAMP certified offerings, it’s becoming easier to procure. But, thinking realistically, for reasons of law, budget, time, architecture, we know the cloud will not be the solution for every public sector problem. Some applications, some data will never leave your agency’s premises. And here in lies the new complexity. You have applications and data on-prem. You have applications and data in the cloud. And you have business requirements that require these apps to work together, to share data.
So, now that you have a hybrid environment, what can you do about? Let’s face it, we can talk about technology, architecture and approaches all day long, but, it always comes down to this, what should be done with the data. You need answers to questions such as; Is it safe? Is it accessible? It is reliable? How do I know if the integrity has been compromised? What about the quality? How error-prone is the data? How complete is the data? How do we manage it across this new hybrid landscape? How can I get data from a public cloud application to my on-prem data warehouse? How can I leverage the flexibility of public IaaS to build a new application that will need access to data that is also required for an on-prem legacy application?
I know many government IT professional are wrestling with these questions and seeking solutions. So, here’s an interesting thought. Most of these questions are not exactly new, they are just taking on the added context of the cloud. Prior to the cloud, many agencies discovered answers in the form of a data integration platform. The platform is used to ensure every application, every user has access to the data they need to perform their mission or job. I think of it this way. The platform is a “standardized” abstraction layer that ensures all your data gets to where it needs to be, when it needs to be there, in the form it needs to be in. There are hundreds of government IT shops using such an approach.
Here’s the good news. This approach to integrating data can be extended to include the cloud. Imagine placing “agents” in all the places where your data needs to live, the agents capable of communicating with each other to integrate, alter or move data. Now add to this the idea of a cloud-based remote control that allows you to control all the functions of the agents. Using such a platform now enables your agency to tie on-prem systems to cloud systems, minimizing the effect of having multiple silos of information. Now government workers and warfighters will have the ability to more quickly get complete, accurate data, regardless of where it originates and citizens will benefit from more effectively delivered services.
How would such an approach change your ideas on how to leverage the cloud for your agency? If you live near the Washington, DC area, you may wish to drop in on the Government Cloud Computing and Data Center Conference & Expo. One of my colleagues, Ronen Schwartz will be discussing this topic. For those not in the vicinity, you can learn more here.
I just finished reading a great article from one of my former colleagues, Bill Franks. He makes a strong argument that Big Data is not inherently good or evil anymore than money is. What makes Big Data (or any data as I see it) take on a characteristic of good or evil is how it is used. Same as money, right? Here’s the rest of Bill’s article.
Bill framed his thoughts within the context of a discussion with a group of government legislators who I would characterize based on his commentary as a bit skittish of government collecting Big Data. Given many recent headlines, I sincerely do not blame them for being concerned. In fact, I applaud them for being cautious.
At the same time, while Big Data seems to be the “type” of data everyone wants to speak about, the scope of the potential problem extends to ALL data. Just because a particular dataset is highly structured into a 20 year old schema that does not exclude it from misuse. I believe structured data has been around for so long people are comfortable with (or have forgotten about) the associated risks.
Any data can be used for good or ill. Clearly, it does not make sense to take the position that “we” should not collect, store and leverage data based on the notion someone could do something bad.
I suggest the real conversation should revolve around access to data. Bill touches on this as well. Far too often, data, whether Big Data or “traditional”, is openly accessible to some people who truly have no need based on job function.
Consider this example – a contracted application developer in a government IT shop is working on the latest version of an existing application for agency case managers. To test the application and get it successfully through a rigorous quality assurance process the IT developer needs a representative dataset. And where does this data come from? It is usually copied from live systems, with personally identifiable information still intact. Not good.
Another example – Creating a 360 degree view of the citizens in a jurisdiction to be shared cross-agency can certainly be an advantageous situation for citizens and government alike. For instance, citizens can be better served, getting more of what they need, while agencies can better protect from fraud, waste and abuse. Practically any agency serving the public could leverage the data to better serve and protect. However, this is a recognized sticky situation. How much data does a case worker from the Department of Human Services need versus that of a law enforcement officer or an emergency services worker need? The way this has been addressed for years is to create silos of data, carrying with it, its own host of challenges. However, as technology evolves, so too should process and approach.
Stepping back and looking at the problem from a different perspective, both examples above, different as they are, can be addressed by incorporating a layer of data security directly into the architecture of the enterprise. Rather than rely on a hodgepodge of data security mechanisms built into point applications and silo’d systems, create a layer through which all data, Big or otherwise, is accessed.
Through such a layer, data can be persistently and/or dynamically masked based on the needs and role of the user. In the first example of the developer, this person would not want access to a live system to do their work. However, the ability to replicate the working environment of the live system is crucial. So, in this case, live data could be masked or altered in a permanent fashion as it is moved from production to development. Personally identifiable information could be scrambled or replaced with XXXXs. Now developers can do their work and the enterprise can rest assured that no harm can come from anyone seeing this data.
Further, through this data security layer, data can be dynamically masked based on a user’s role, leaving the original data unaltered for those who do require it. There are plenty of examples of how this looks in practice, think credit card numbers being displayed as xxxx-xxxx-xxxx-3153. However, this is usually implemented at the application layer and considered to be a “best practice” rather than governed from a consistent layer in the enterprise.
The time to re-think the enterprise approach to data security is here. Properly implemented and deployed, many of the arguments against collecting, integrating and analyzing data from anywhere are addressed. No doubt, having an active discussion on the merits and risks of data is prudent and useful. Yet, perhaps it should not be a conversation to save or not save data, it should be a conversation about access