Data Curation with an Intelligent Data Catalog

It’s not often that I run into the word “metadata” when I indulge in my favorite weekend habit of reading the Sunday New York Times (NYT). A few months ago, the paper published a special section focused on how NYT is digitizing and bringing back to life photographs published in the paper over fifty years ago. In this case, the metadata on each photograph came in the form of hand-written notes on when the picture was published and other related information on the back of each photograph. NYT partnered with Google Cloud to digitize the photographs and the metadata. The digitized metadata provided invaluable context to bring these pictures back to life for today’s readers. (You can read about the digitization project here, and see ongoing storytelling features from the archives at the NYT “Past Tense” site.)

Intelligent Data Catalog
A picture is worth a 1,000 words—more, if you’ve got the metadata

A simple picture of a Christmas tree, paired with a picture of the back of the photo, grabbed my attention. The picture by itself doesn’t say much. It looks like a typical New York city street with a Christmas tree. But the stamps and notes on the back of the picture tell a much richer story. They indicate when and where the picture was taken along with the caption that accompanied it. This was a picture of a nation in mourning taken two days after President Harry Truman passed away. Using the digitized metadata from the back of the picture, it can be linked to the NYT article related to the picture and also linked to other online content providing valuable historical and social context for today’s readers.

Now, just imagine doing this for the thousands of pictures in NYT’s archives and making the entire archive easily searchable. NYT is doing this through their partnership with Google Cloud to first digitize the pictures and associated metadata, then using image and text recognition to automatically unearth important context (e.g., when and where the picture was taken), which can further be enhanced with human curation to link to additional historical and social context. Suddenly a simple picture is curated and transformed into a “mini museum” with relevant historical context through a combination of machine and human intelligence.

Transforming data into information and intelligence

Now, if a picture that can supposedly speak a thousand words can be enriched so effectively with the addition of a few words of contextual information, just imagine the need for this context in today’s complex data landscape where you have thousands of new datasets that are being created on a daily basis and there’s business demand to quickly understand and access this data from different types of users. After all, a dataset is just a collection of numbers and text. In many cases, you may at best have meaningful descriptions and column headers to guide you, and at worst just a cryptic set of column names. So how do you broaden visibility and understanding of all this data beyond the small closed circle of data owners and subject matter experts?

The first step is to catalog and index all data assets across the enterprise, so they can be searched using a simple interface and simple business terms – just like NYT made their photo archives searchable by first digitizing the pictures and the associated metadata. But the search results are only as good as the metadata associated with the datasets. There’s some technical metadata that often comes from the source data system (e.g. schema definition for a database table). But to enable business users to find relevant data, it’s critical that the technical datasets are enriched with business context. This is easier said than done when you have hundreds of thousands of datasets – both structured and unstructured – across the enterprise. AI/ML-driven automation plays a critical role in enabling this at scale (for a deeper look at the role of AI in data cataloging, read this EMA report: AI/ML-powered Data Management Enables Digital Transformation). For instance, domains and entities (such as date, location, customer, product, etc.) can be automatically identified and datasets can be tagged with this information. Business terms and definitions that are typically defined in a business glossary can be automatically associated to technical data assets. Now the data can be searched using simple business terms.

AI + human expertise

As powerful as these AI-driven capabilities can be, it’s also equally important to complement this with shared data knowledge and human expertise that’s distributed across the enterprise. Business stakeholders can enrich the data with contextual business descriptions and custom tags and annotations. Data consumers can rate and review datasets further enriching the data with valuable usage-based context. And in a virtuous cycle, all of this human input can make the platform more intelligent, enabling more automation and delivering more relevant search results with richer contextual information, ultimately leading to deeper understanding of the data for business users. A dataset is no longer just a collection of numbers and text. It’s transformed into a treasure trove of valuable information through data curation and enrichment. To learn more about how Informatica’s intelligent data catalog enables this at enterprise scale, visit the Informatica Enterprise Data Catalog website.

Comments