Getting Started with AI and ML: #1: Getting the Data!

Artificial Intelligence

Every day after waking up I constantly hear about the promises and benefits of machine learning and Artificial Intelligence. I hear about the promises of Self-Driving Cars, folks asking KITT to recommend songs for the day, the list goes on and on. So, I think, you know what would be fantastic is If I could somehow predict the outcomes of sporting events using Machine Learning (ML) and Artificial intelligence (AI). How cool would it be if every year I could be the person who wins the March madness sports pool at Informatica. Like many of you I wanted to learn more about this topic, so I did a little research and it appears ML/AI has not yet progressed to where it can do this for us. Maybe It seems obvious to you but from my perspective how is this use case different from a machine predicting a song or a beverage you might enjoy. In all cases we have lots of data documenting previous outcomes so why can’t a machine with some degree of accuracy make a recommendation on the winning team.   

The fact this isn’t yet possible implies to me there is an edge to what ML/AI can do that is perhaps like the edge of cliff. If you cross it, you’ll start plunging toward earth with very limited options. Basically, it is the point where the complexity of the task you are asking the machine to do becomes so overwhelming that building a realistic model becomes impossible. My next thought is how I can I understand this conceptual edge and be able to explain why and where it exists. One option is I could read a bunch of articles, but a better option would be to create a task or project where I do something that can start me on this journey. I’m sure there is a 3rd option where I find someone really smart who could explain it to me in 5 minutes but what is the fun in this path because with this option I wouldn’t have anything to write about. As part of my journey I’m going to kick off a series of posts that will share the details of how I am getting started with the hope maybe others can benefit but also provide an avenue for folks to provide feedback.

Getting access to the data you need to fuel AI

Since we know predicting sports outcomes isn’t a good place to start with I need to find something else that presents enough complexity to make sure I’m really learning and absorbing these concepts. Building a recommendation engine for craft beers is what I landed on. Plenty of public data and I have an app on my phone which sort of already does this. To ensure the results are relevant I need to use data that is current and it contains a subset of data that is related to people I know. This will enable me to test out the results as I’m building it.

Step 1 of this process is to start collecting data. Untappd is the application on my phone I use to rate craft beers and in order to get the data from their website I coded a java function using HtmlUnit, a headless java browser package, to consume the source html and write it into a text file. If you are curious here is snippet of the java code that writes out the html text to a string variable.

HtmlPage usrPage = (HtmlPage) webClient.getPage (p_url);

String outputHTML = usrPage.asXml ( );

Step 2 is to parse out the html text and capture the values I’m most interested in looking at. I could extend my java code to accomplish this parsing but I find it helpful as I’m constructing my logic to visually understand how well it is performing. Informatica does provide a transformation that provides this capability and after dropping the configured transformation into a pipeline I am able to start processing the html text to pull out the values I want. As an example, below is the logic I constructed to capture the name of the beer for each review. You can see the opening/closing text markers as well as the regular expression used to clean up characters that are not needed. If I had coded this in java those opening/closing markers plus the cleanup would still be the same.

You may ask why I would want to take this approach. Why not call a published API? It really boils down to the fact I wanted to create a long term mechanism I can use to collect data from any external source regardless if an API exists are not. As example ESPN no longer allows access to their API but do allow browser access to most of their site.

As I’m writing this I have most of what I’ve described here built and used it to collect over 30,000 reviews with the plan to add more each week. This is more than enough for me to get started on step 2 which is what do I do with this data once I have it.

A very smart person I work with mentioned I should look at k-means clustering which is a machine learning technique to help me start making recommendations using the data I have. I’ll need to find out more about this and it also seems like a good starting point for the next blog in this series.

If you have an interesting approach that you are using to enable yourself on ML/AI topics or have feedback on the approach I came up with, please drop a comment. Feedback is always appreciated.