Information = Data + R

Data + R
Information = Data + R

Over and over, when talking with people who are starting to learn Data Science, there’s a frustration that comes up: “I don’t know which programming language to start with.”

Moreover, it’s not just programming languages; it’s also software systems like Tableau, SPSS, etc. There is an ever-widening range of tools and programming languages and it’s difficult to know which one to select.

I get it. When I started focusing heavily on data science a few years ago, I reviewed all of the popular programming languages at the time: Python, R, SAS, D3, not to mention a few that in hindsight, really aren’t that great for analytics like Perl, Bash, and Java. I once read a suggestion to use arcane tools like UNIX’s AWK and SED.

There are so many suggestions, so much material, so many options; it becomes difficult to know what to learn first. There’s a mountain of content, and it’s difficult to know where to find the “gold nuggets”; the things to learn that will bring you the high return on time investment.

That’s the crux of the problem. The fact is – time is limited. Learning a new programming language is a large investment in your time, so you need to be strategic about which one you select. To be clear, some languages will yield a very high return on your investment. Other languages are purely auxiliary tools that you might use only a few times per year.

Let me make this easy for you: learn R first. Here’s why:

R is becoming the “lingua franca” of data science

R is becoming the lingua franca for data science. That’s not to say that it’s the only language, or that it’s the best tool for every job. It is, however, the most widely used and it is rising in popularity.

As I’ve noted before, O’Reilly Media conducted a survey in 2014 to understand the tools that data scientists are currently using. They found that R is the most popular programming language (if you exclude SQL as a “proper” programing language).

Looking more broadly, there are other rankings that look at programming language popularity in general. For example, Redmonk measures programming language popularity by examining discussion (on Stack Overflow) and usage (on GitHub). In their latest rankings, R placed 13th, the highest of any statistical programming language. Redmonk also noted that R has been rising in popularity over time.

A similar ranking by TIOBE, which ranks programming languages by the number of search engine searches, indicates a strong year over year rise for R.

Keep in mind that the Redmonk and TIOBE rankings are for all programming languages. When you look at these, R is now ranking among the most popular and most commonly used over all.

Data wrangling

It’s often said that 80% of the work in data science is data manipulation. More often than not, you’ll need to spend significant amounts of your time “wrangling” your data; putting it into the shape you want. R has some of the best data management tools you’ll find.

The dplyr package in R makes data manipulation easy. It is the tool I wish I had years ago. When you “chain” the basic dplyr together, you can dramatically simplify your data manipulation workflow.

Data visualization

ggplot2 is one of the best data visualization tools around, as of 2015. What’s great about ggplot2 is that as you learn the syntax, you also learn how to think about data visualization.

I’ve said numerous times, that there is a deep structure to all statistical visualizations. There is a highly structured framework for thinking about and creating all data visualizations. ggplot2 is based on that framework. By learning ggplot2, you will learn how to think about visualizing data.

Moreover, when you combine ggplot2 and dplyr together (using the chaining methodology), finding insight in your data becomes almost effortless.

Machine learning

Finally, there’s machine learning. While I think most beginning data science students should wait to learn machine learning (it is much more important to learn data exploration first), machine learning is an important skill. When data exploration stops yielding insight, you need stronger tools.

When you’re ready to start using (and learning) machine learning, R has some of the best tools and resources.

One of the best, most referenced introductory texts on machine learning, An Introduction to Statistical Learning, teaches machine learning using the R programming language. Additionally, the Stanford Statistical Learning course uses this textbook, and teaches machine learning in R.

Summary: Learn R, and focus your efforts

Once you start to learn R, don’t get “shiny new object” syndrome.

You’re likely to see demonstrations of new techniques and tools. Just look at some of the dazzling data visualizations that people are creating.

Seeing other people create great work (and finding out that they’re using a different tool) might lead you to try something else. Trust me on this: you need to focus. Don’t get “shiny new object” syndrome. You need to be able to devote a few months (or longer) to really diving into one tool.

And as I noted above, you really want to build up your competence in skills across the data science workflow. You need to have solid skills at least in data visualization and data manipulation. You need to be able to do some serious data exploration in R before you start moving on.

Spending 100 hours on R will yield vastly better returns than spending 10 hours on 10 different tools. In the end, your time ROI will be higher by concentrating your efforts. Don’t get distracted by the “latest, sexy new thing.”

Twitter @bigdatabeat

Comments