Playing I Spy with your Data?
Try this: upload a recent photo of yourself to Google image search. I did that last week, and the result that came back was that I am a “person”. Which I guess is a compliment? I mean I wasn’t labeled a “lamp” or a “frog”.
Image recognition is a hot topic now. It has practices in an array of fields from social networks to security management. There are many machine learning mechanisms for image recognition, one of the commonly used one is clustering based on similarities utilizing Jaccard coefficient, which quantify intersection of unions, and Bray-Curtis coefficient which quantify the compositional dissimilarity. These methods analyze the degree of similarities and diversity between group of objects and the degree of diversity to cluster the instances into groups.
First step of this specific methodology is identifying key values of the object portrayed in the image, if it is a face – eyes, hair, nose, mouth, ears, distance of features from each other, etc., then match each one of the features to the group of same features in similar known images in the data store, finally score how similar and how different from known faces. The score will determine whether the object belongs to this group of images, and if it does, will return a label match.
How does that relate to Data Management? Well, one of the key requirements for achieving digital transformation is knowing what information you have and finding it when you need it. For this purpose, we have developed the CLAIRETM engine, a data similarity machine learning module. Like image recognition, we want to recognize data, group it and tag it correctly so users can find it. The way we do it is first cluster-based on column features, then, data overlap is computed for unique values in each of these clusters. Finally, the most promising pairs are chosen for computing data similarity using the Bray-Curtis and Jaccard coefficients.
Take for example one of our customers, a global energy corporation who was looking to streamline their analytics for business users. Their data environment was exponentially growing, with approximately 10,000 databases, 70,000 reports and 4,000 ETL/ELT workflows. Their existing manually documented metadata system was hard to use, maintain and refresh, resulting in 80% of their data scientists’ time spent on finding data rather than analyzing it. This customer implemented Informatica Enterprise Data Catalog powered by CLAIRE to automatically extract metadata from all the data resources, cluster and tag the data, so both data analysts and data scientist can use semantic search to find relevant data sets for analysis. For this customer data discovery time was reduced from 4-6 weeks to minutes.
Back to me. I wasn’t fair to the Google engine. I uploaded an image of myself in an angle, which made it difficult for Google to extract the key features of my face. The engine could detect enough “values” from my face to identify human features, but it didn’t have enough data to determine if I am a female or a male. The other problem the engine had, is not having other images of me that are tagged. If I had uploaded a picture of say, Kim Kardashian, there would be enough appearances of similar images in its search engine to tag it not just as a person or a woman but as The Kardashian.
To stop playing “I spy with my little eye” with your enterprise, get more information about Informatica Enterprise Data Catalogue. Read about other examples where metadata-driven artificial intelligence can take the drudgery out of discovering and working with your data.