I have harped about one specific aspect of data sharing and uncontrolled data repurposing: when the data consumer is not part of the requirements analysis and design processes, that consumer’s perception of data semantics must be based on his/her own reinterpretation. For example, if I were to access a data set, say from data.gov, that had a list of failed banks, I am fed a csv file with the following columns headers:
|Bank Name||City||State||CERT #||Acquiring Institution||Closing Date||Updated Date|
With no metadata, it is up to me, the consumer, to figure out exactly what is meant by each of these columns. I might be particularly confused by “Updated Date” – is that the date the record was updated, or what? And without searching for (and finding) this web page, I would not know that a “CERT #” is defined as “(Cert, FDIC Certificate Number) A unique number assigned by the FDIC to identify Institutions and for the issuance of insurance certificates.” Also, the details of “Acquiring Institution” are vague as well. I think you get the point: as a data consumer accessing data, I may be forced to derive my own understandings of the data elements, which may conflict with the original intent.
This becomes an issue for a few reasons:
1) There is a misalignment between the intent of data provision or transparency and the actual use in that the data providers cannot have a complete understanding of the possible uses of the data and consequently cannot completely integrate the data quality requirements into the design of the system for collecting, processing, and delivering the data.
2) The data providers are also likely to be subject matter experts, and they may rely on that expertise in presuming a general understanding of data semantics. Consumers may or may not share the same expertise, which will lead to inconsistencies in analysis and results between the provider and the consumers.
3) There are no standards for the consumers, which basically enables any and all reinterpretations to be considered “valid.”
4) Once the data is released, it can be copied and manipulated without any further input and/or control from the providers.
In essence, releasing data into the public domain changes the characteristics of the embedded information to have some aspects defined by the producer and all others defined by a potentially wide variety of consumers. Controlling the use of the data once it is out of the barn implies a different need for data governance, one established by mutual agreement between the producer and any single consumer and then the consumer community as a whole. We’ll look at that in the next post…