Before We March Into AI, Let’s Make Sure Our Data is Good


There’s been quite a bit of buzz and investment in cognitive computing, and all its associated pieces – machine learning, natural language processing and deep learning. It’s good to see that the AI concept has finally gained traction, as it has been the subject of excitement on and off for three decades now. Each AI wave has ended in some disappointment as enterprises found it difficult to apply the technology to everyday business problems and opportunities.

Maybe this time, things will be different, and we will see AI progress to new levels. With machine and deep learning the wind in its sails, AI may offer ways for systems and applications to serve businesses and customers with a minimum of human blood, sweat and tears trying to make it all flow.

There is one important and necessary element to this success: good data. The data that will renew and refresh AI-driven algorithms or applications must be timely, trustworthy and beyond reproach. Otherwise, we will see some businesses spectacularly automate themselves off a cliff.

I can’t emphasize the importance of trust in data enough when it comes to depending on it for the insights that will power a business.

This point was recently underscored by Ophir Tanz and Cambron Carter, both with GumGum, an artificial intelligence company, in a recent TechCrunch post.

As they note, “despite our world being quite literally deluged by data — currently about 2.5 quintillion bytes a day, for those keeping tabs — a good chunk of it is not labeled or structured, meaning that for most current forms of supervised learning, it’s unusable.”

Deep learning – along with the rest of AI – “depends on a steady supply of the good, structured and labeled stuff,” they state. The problem is, they state, “data is fed to machines through an elaborate sausage press that dissects, analyzes and even refines itself on the fly. This process is considered supervised learning in that the giant piles of data fed to the machines have been painstakingly labeled in advance.” Unstructured data, they state, needs a more automated way to be labeled and indexed. Right now, the process is cumbersome or non-existent with images, graphics and documents.

As much of the data now being built into analytics and AI-based solutions is unstructured – which is necessary to get a more complete view of the enterprise and customer – there needs to be a better way to “train” machines to label and make sense of it. Tanz and Carter urge the adoption of “unsupervised” training for machines to build their own awareness of the distinctions seen across unstructured data – much in the way newborn babies and children develop their recognition and cognition skills.  “Short of hiring people to label data — which is a thing, by the way, and it’s pricey — or all of the companies of the world suddenly agreeing to open up all their proprietary data and distribute it happily to scientists across the globe, then the answer to the shortage of good training data is not having to rely on it at all. Rather than working toward the goal of getting as much training data as possible, the future of deep learning may be to work toward unsupervised learning techniques.”

There’s quite a bit of ground to be covered in the years ahead with this approach. Being on the verge of robust AI and data-driven capabilities as we are, this is an essential stage that needs to be reached. Otherwise, AI and its components will be limited to structured or relational data, which will only provide a one-dimensional view of customers and enterprise operations.


  • Great information by the expert. Looking forward for your next insights.
    Keep posting.
    Princeton Blue’s Cognitive Computing enable you to analyze unstructured data and more.