If you’re looking to become a professional data scientist, you’re going to need to learn at least one programming language. But how to decide between Python and R, the two most popular languages for data analysis? If you’re interested in learning about their respective strengths and weaknesses, read on!
As a data scientist, you probably want and need to learn Structured Query Language, or SQL. SQL is the de-facto language of relational databases, where most corporate information still resides. But that only gives you the ability to retrieve the data — not to clean it up or run models against it — and that’s where Python and R come in.
A little background on R
R was created by Ross Ihaka and Robert Gentleman — two statisticians from the University of Auckland in New Zealand. It was initially released in 1995 and they launched a stable beta version in 2000. It’s an interpreted language (you don’t need to run it through a compiler before running the code) and has an extremely powerful suite of tools for statistical modeling and graphing.
For programming nerds, R is an implementation of S — a statistical programming language developed in the 1970s at Bell Labs— and it was inspired by Scheme — a variant of Lisp. It’s also extensible, making it easy to call R objects from a number of other programming languages.
R is free and has become increasingly popular at the expense of traditional commercial statistical packages like SAS and SPSS. Most users write and edit their R code using RStudio, an Integrated Development Environment (IDE) for coding in R.
A little background on Python
Python has also been around for a while. It was initially released in 1991 by Guido van Rossum as a general purpose programming language. Like R, it’s also an interpreted language, and has a comprehensive standard library which allows for easy programming of many common tasks without having to install additional libraries. It’s also available for free.
For data science, there are a number of extremely powerful Python libraries. There’s NumPy (efficient numerical computations), Pandas (a wide range of tools for data cleaning and analysis), and StatsModels (common statistical methods). You also have TensorFlow, Keras and PyTorch (all libraries for building artificial neural networks - deep learning systems).
These days, many data scientists using Python write and edit their code using Jupyter Notebooks. Jupyter Notebooks allow for the easy creation of documents that are a mix of prose, code, data and visualizations, making it easy to document your process and for other data scientists to review and replicate your work.
Picking a language
Historically there has been a fairly even split in the Data Science community. Typically data scientists with a stronger academic or statistical background preferred R, whereas data scientists who had more of a programming background tended to prefer Python.
The strengths of Python
When compared to R, Python is . . .
General purpose: Python is a general purpose programming language. It’s great for statistical analysis, but Python will be the more flexible, capable choice if you want to build a website for sharing your results or a web service to integrate easily with your production systems.
Increasingly popular: In the September 2019 Tiobe index of the most popular programming languages, Python is the third most popular programming language (and has grown by over 2% in the last year), whereas R has dropped over the last year from 18th to 19th place.
Better for deep learning: Most serious deep learning projects use either TensorFlow or PyTorch. Both work really well with Python, and while there is now an R interface for TensorFlow, much more deep learning work is being done with Python than with R. As deep learning becomes applicable to an increasingly wide range of domains (it started off with computer vision, now it’s becoming the default approach for most Natural Language Processing tasks as well) that’s increasingly important.
There are still plenty of jobs where R is required, so if you have the time it doesn’t hurt to learn both, but I’d suggest that these days, Python is becoming the dominant programming language for data scientists and the better first choice to focus on.
Head of Data Science
Peter is a veteran technologist, CTO, entrepreneur, and longtime educator, having taught digital literacy at Columbia and authored numerous programming books.