The R vs Python debate has been around a long time. Choosing between these immensely popular languages has been the source of countless infographics, Twitter-wars, and blog posts.
For me, R has always been the data cleaning, manipulation, plotting, and stats-focussed language and Python has been the place that I made models suitable for deployment into production and built complex deep learning systems. I’ve always used both.
Recently though, I’ve been a little frustrated by the Python ecosystem. A project I’m currently working on needed to use some of the more advanced techniques from survival analysis and, not being able to find any implementations in Python, I used R for an end-to-end prediction project for the first time in a long while. I realised just how much I missed it.
For anyone who’s considering which language to learn, here’s my (somewhat unorganised and perhaps vitriolic) thoughts on why R is better than Python.
dplyr vs pandas
This section should really be called the Pipe Operator. This guy (
%>%) is one of
the best inventions in data programming. As someone who has extensive SQL development
experience, the pattern of performing a manipulation one step at a time feels like
a cleaner way of using CTEs or temporary tables.
In pandas you have two choices: store step-by-step manipulations in multiple DataFrame objects or suffer through the pandas documentation trying to write a one-operation manipulation. R allows you to take a few fundamental operations (mutate, group_by, select) and perform incredibly complex transformations that are all coherently linked.
I know that there are some options for bringing the pipe operator to Python, but in my experience these are lacking and require you to think in the reverse order of what you’d like to achieve but YMMV.
Plus, if you want to interact with your R data in a SQL-y way, you only need to install sqldf!
I hate Matplotlib.
In much the same way that the pipe operator allows you to take a data.frame and manipulate it in a logically linked way, ggplot2 does the same for visualisations.
The grammar of graphics theory that ggplot2 is built around is very sound and intuitive. That has not been my experience with pyplot. Pyplot seems confused, you have to initialise a bunch of different objects; a plot, axes, some others that I haven’t so much forgotten as never learnt. Resizing the charts is a pain. And what the hell is with tight_layout? Should that not just be the default?
You add objects to plots layer by layer in both packages, but only in R does it seem like that’s what you’re actually doing.
RStudio vs Jupyter notebooks
Obviously, you can use Jupyter notebooks with R. But you cannot use RStudio with Python. The Jupyter notebook is a great invention but I seriously wonder if it’s the best way to do data science.
It was invented to be centrally hosted, to be accessed by lots of different collaborators. But that’s not the way it’s being used by 90% of data scientists - most people use it like an individual sandbox.
Have you ever tried to install a package from pip in a Jupyter notebook so that you don’t have to stop and restart the server? In RStudio, you can do that in the terminal without missing a beat.
How about zooming into a chart - in a notebook you have to change code and rerun the cell, in RStudio you just click on a button that says zoom.
If you want to take a look at your data in RStudio, you can just
open the data.frame in a new tab, browse, and then go back to what you were doing.
In a notebook, though, you have to write code (
data.head() or similar), find
what you were looking for and then either delete the cell or live with a notebook
that has a randomly placed sample of your data in it.
What if you want to look at your data and sort it? I won’t go into it - suffice it to say, it’s easier in RStudio. Because RStudio works with a decades old invention - the mouse.
R has a unique syntax. That’s the polite way of saying it. But this unique syntax is a gift in disguise for people who come from any kind of software development background. It changes the way you think.
You’re not in Kansas anymore, objects do not rule supreme and middle-management types are still scratching their heads as to whether or not Agile principles apply - welcome to data science (previously known as statistics, population: more than the natives would like.)
Python is one of the best general purpose programming languages ever designed. But since when does most widely applicable == best?
Bonus features of the R syntax: whitespace doesn’t matter, leaving the parentheses off of a function call shows you the source code for that function, you can just type the raw string of feature names when you fit a model - no need to subset your data using square brackets and many, many more.
The history of data science is the history of statistics. I think that the wide adoption of Python over R will cause a wider but shallower adoption of machine learning techniques. Statistics is the backbone of every outstanding (and robust) achievement in data science.
The fact that retrieving fit summaries, metrics, and test statistics are not quite as baked in to the most popular Python frameworks as they are in R allows users to treat these things as second thoughts - they should not be second thoughts. A statistically sound discovery will prove more applicable, accurate (obviously) and profitable than one that is not.
It’s tough to task-switch to R if you’re already working in Python and just need
to fit a quick linear regression, but the
fit.summary method will reward
you for your bravery over and over again.
There are two things I want to say about packages in R vs Python.
While it might not be true (for long) that CRAN has more data-related packages than pip does, it is fair to say that the ones on CRAN cover a much broader field of ideas. As I mentioned, I’ve been working on a survival analysis project. I had to stop using Python (and the otherwise fantastic lifelines package) when I had to switch to using the Andersen-Gill alteration to the Cox Proportional Hazards model.
I Googled for so long. I debated coding it up myself to share with everyone else. And then I caved, gave CRAN a quick search and found the answer nearly immediately.
If you need to do anything other than just tweak the hyperparameters of the most common models, you should at least take a look at what packages are available in R.
The second thing I’d like to to mention is package consistency. This is kind of
tied in to the previous section about statistics. Every modelling package I’ve
used in R has exposed the same (or very similar) APIs for accessing metrics,
summaries and fit statistics. They could all be fit using the standard R formula
Target~Feature_1+Feature_2). The inconsistency between Scikit-Learn and
Tensorflow and Lifelines and PyTorch can really rub a person the wrong way.
ISLR and ESL
The Introduction to Statistical Learning and the Elements of Statistical Learning are, for my money, the best two books on data science. They both use R.
With the right abstraction skills, you can port this knowledge from R to Python, but it’s so much easier to use the books as a reference (after you’ve read them through) and just pull code, packages, and tips right from the pages.
I think the reason that books of this caliber don’t (yet) exist for Python is the problem with consistency that I mentioned above. The authors of ISLR and ESL can go broad and deep with their explanations without the reading feeling fragmented and piecemeal.
There are a lot of good machine learning books that use Python, but these mostly focus on one particular framework or library. They also deal with syntax a lot, and deployment, and charting, and data manipulation. Most of the Python books are all aimed at the same level, the introductory student who wants to get their hands dirty.
These two books go for a different market - those up for the challenge of becoming competent data scientists who are more concerned with results than syntax.
(P.S. If anyone knows of any books similar to ISLR and ESL that use Python, please do let me know!)
Some problems with R
R is not all plain-sailing, I’d never pretend otherwise. It can be slow, at times. And the syntax really is unique, which may present more of a challenge than many would like to undertake.
One of the biggest issues (in my humble opinion) with R, though, is the emerging domination of the community by the RStudio team.
Now, before you come for me with pitchforks, let me explain.
The work Hadley Wickham has done on the tidyverse is incredible, and RStudio is the best IDE for data science. But one of the best things about R is the contributions of the community.
As more and more people look in to data science as a career and hobby, and as more online courses teach these tools, less users will become individual contributors. This may be inevitable, and I will continue to happily use all the tools the RStudio team provide (Shiny, flexdashboard, rmarkdown, etc.), but it would be a great shame to lose the work of all those people because they’re afraid that their contributions won’t equal those of the giants in the language.
Of course, the inverse could also be true - the increase in productivity (and articles like this) could bring more people to R, and allow them to contribute helpful tools for the whole community to use. So please, give R a try!
The final analysis
Overall, I’m switching to R (for now) because of all the gains I get in workflow when I use it in place of Python. The GUI of RStudio is an amazing tool. Packages that implement exotic statistical models are abundant. And ggplot charts still look the best, for my money.
And while R is not the best language for creating an API, or a web app, or doing any kind of OOP, it is the best language for keeping this data scientist productive.