Historically, many data scientists didn’t use “software development” tools like version control systems. These days as their code becomes more sophisticated and data scientists are increasingly influenced by their software engineering partners, it’s becoming increasingly important to learn how to skillfully use a version control system like Git.
In this brief, hands-on introduction, you’ll learn just enough Git so that when you get a job as a data scientist you’ll be able to keep track of your changes to share them with your peers.
What is version control anyway?
Version control systems allow you to keep track of the changes you’ve made to your work over time. It’s a little like “track changes” in Google docs, but the difference is that you can save changes across a set of files, not just within an individual file. Most version control systems also support the idea of branching, allowing different people to make different sets of changes to the same underlying files, and then to merge their work together later.
How do data scientists use version control?
As a data scientist, even when you’re working with a single file (say a Jupyter Notebook), you can keep track of your changes using a version control system (usually it’ll be Git). It allows you to save your work periodically, which makes it easy to revert your notebook back to an earlier version.
As your project becomes more complex, version control becomes even more valuable. It’s quite common to start off a project with a single Jupyter Notebook. Over time, that notebook becomes so full of little functions to clean up your imported data that it becomes hard to focus on the important parts of the notebook.
A good way to solve that problem is to break out those functions into separate Python files that you can call using a single line of code. In that way, anyone looking to understand your project can get a high level view from the Jupyter Notebook, and then they can always dig into your supporting Python files if they want to understand the nuances of your data cleaning scripts. This strategy also makes it easier to write automated unit tests to confirm that your data cleaning scripts apply the transformations that you expect to various kinds of inputs.
Once your project has multiple files that you need to keep in sync, a version control system like Git is particularly useful as it will allow you to make a set of changes across multiple files and then “commit” them together so you can easily get all of the files back to that state in the future just by “checking out” that commit.
If you don’t have Git installed, go here and follow the installation procedure for your operating system.
Let’s start off by defining a few key concepts that will help when we’re talking about Git:
A repository - This is Git’s name for a project. It includes all of the files in the project along with all of the information about how they have changed over time. If you have a full copy of a repository (often referred to as a “repo”), you can view the current state of the project, but you can also view any previous state that the project used to be in.
A commit - In Git, history is made up of a series of commits which are stored in the changelog. Every time you make a meaningful set of changes to your project, you should commit them so that you can always get back to the project in that state in the future.
The Staging Area - This is like a shopping basket for version control. It’s where you load up the sets of changes that you’d like to put in your next commit, so if you have edited three files, but want to make one commit with two of them and another commit with the third, you just “stage” the first two using the
Git addcommand, then commit them with an appropriate message and then add and commit the last file separately.
Getting started with Git
Let’s do a little hands on practice with Git. If you’re on windows, open the “Git Bash” program, if you’re running Mac or a flavor of Linux, just open up a terminal window. It’s important not to just open up Powershell or the default terminal on a windows machine - it won’t work correctly.
Go to a directory somewhere within your home directory (so you have write permissions). Let’s make sure you are not already in a directory that is part of a Git repository (unlikely, but it happens):
> git status fatal: not a git repository (or any of the parent directories): .git
Good. We asked Git for the status of the repository we were in, and it let us know we’re not in a git repo. That's good — creating one Git repo inside of another will confuse both you and Git!
Now let’s create a new Git repo and directory all in one
> git init my_first_repo Initialized empty git repository in /Users/peterbell/Dropbox/code/my_first_repo/.git/
Perfect. So it created a repository under the directory I was in. Let’s use the Unix “change directory (cd)” command to go there:
> cd my_first_repo my_first_repo git:(master)
OK, so my terminal tells me when I’m in a Git repo by showing the
git:(master) message. Let’s see the status of this project:
1 2 3 4
> git status On branch master No commits yet nothing to commit (create/copy files and use "git add" to track)
Cool. Don’t worry if you see slightly different messages — they vary by operating system and Git version, but the bottom line is that Git is telling us that we don’t have any commits yet, we’re on the “master” branch (the main branch) and there aren’t any files here to save into version control.
Let’s just check that you have the basic configuration for Git so that when you save files it knows your name and email address.
> git config --global user.name Peter Bell
With the command above, we’re accessing the configuration settings for Git on your computer. the
--global flag means we’re looking at the configuration settings that will apply to all of the projects you work on logged in as this user on this machine. The uncommon
--system flag accesses settings that are shared across all users on your machine. Lastly, the
--local flag accesses settings for a specific project - so it only works if you’re within a Git repo when you run the command.
When you pass a key to git config without a value (in this case, the
user.name key), it returns the existing value. If you also pass a value, it sets that value.
Now depending on your setup you might have seen your name, nothing, a message that Git hasn’t been set up properly, or even an error message that a file could not be found. If you see anything other than your name, set your name like this:
> git config --global user.name ‘Your Name’
> git config --global user.name Your Name
And you should now see your name.
Let’s do the same for your email address:
> git config --global user.email email@example.com
If it doesn’t have the value you want, set it to something. No quotation marks required:
1 2 3
> git config --global user.email firstname.lastname@example.org > git config --global user.email email@example.com
There are a lot of other settings but at least Git now knows what name and email address to save with your commits.
Adding some files
The easiest way to create a test file is to use the Unix command “touch.” If the file exists, it’ll just update the timestamp. If it doesn’t, it’ll create a blank file we can then add into version control.
So let’s create three files. They won’t have any content, but we’ll give them names that we might use when working on a real data science project.
1 2 3 4 5 6 7 8 9 10 11 12
> touch index.ipynb > touch import.py > touch clean.py > git status On branch master No commits yet Untracked files: (use "git add <file>..." to include in what will be committed) clean.py import.py index.ipynb nothing added to commit but untracked files present (use "git add" to track)
OK, so we’re still on the master branch. We haven’t committed (saved into permanent history in Git) yet, and the three files are “untracked” — Git isn’t really paying much attention to them until we add them.
Now imagine we want to make an initial commit for the Jupyter Notebook file (the index.ipynb) and then another commit for the import and cleaning scripts.
1 2 3 4 5 6 7 8 9 10 11
> git add index.ipynb > git status On branch master No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: index.ipynb Untracked files: (use "git add <file>..." to include in what will be committed) clean.py import.py
So this is telling us that when we do make a commit now, the index.ipynb file is the one that’ll get saved. Let’s do that:
1 2 3 4
> git commit -m ‘Add Jupyter Notebook file’ [master (root-commit) 998db10] Add Jupyter Notebook file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 index.ipynb
OK, so what’s going on here? Firstly, I told Git to make a commit — to save this set of changes into history. I passed it the
-m flag to pass a message for the commit. And I followed the
-m with the message I wanted to associate with this commit, enclosed within single or double quotes. The purpose of the commit message is to make it easier for anyone in the future to understand what changes I made and why I made them.
It’s important to know that every commit requires two things — a commit message and at least one added, modified, renamed or deleted file. Depending on your operating system and version of Git, if you don’t pass a commit message it’ll either create a default message for you or it will throw you into whatever text editor you use with Git (look out - it might be something a little cryptic like vi) to add a commit message.
And what does the response mean? Well, it’s telling us we’re still on master and that we have just made the root (very first) commit. It’s giving us the first 7 characters of the hexadecimal SHA-1 hash which is the unique identifier for every commit in a Git repository, and it’s sharing my commit message and how many files were changed. In this case we added 1 file, but didn’t add or remove any lines of content because the file was empty. It also shows me the file within the commit (index.ipynb), and it says “create mode 100644” which you can pretty much ignore because there’s no useful information there.
Cool. And what’s our current Git status now?
1 2 3 4 5 6 7
> git status On branch master Untracked files: (use "git add <file>..." to include in what will be committed) clean.py import.py nothing added to commit but untracked files present (use "git add" to track)
Perfect. So it sees that we still have two untracked files. Let’s add and commit them.
> git add .
There are a lot of ways of adding files to the “staging area” in Git. You can name them one at a time (
git add clean.py import.py). You can match a set of files using a fileglob pattern (
git add *.py) or you can just add all of the files in the repo (
git add .).
Whichever approach you take, that adds the other two new files to the staging area.
1 2 3 4 5 6
> git status On branch master Changes to be committed: (use "git reset HEAD <file>..." to unstage) new file: clean.py new file: import.py
So all we have to do is commit them:
1 2 3 4 5
> git commit -m ‘Add import and cleaning scripts’ [master 625e7a1] Add import and cleaning scripts 2 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 clean.py create mode 100644 import.py
Great - it’s made a new commit on master (
625e7a1 in my case - for you it’ll be different as it’s based in part on the username and email used in this and previous commits) and added two new files (but no lines of text because in this simple tutorial example they were both blank files).
Congratulations! You just created a new Git repo, and staged and added some files.
What’s with the staging area?
Now, you might be asking the quite reasonable question “why do we have to run two separate commands -
git add and then
git commit just to save our work?”
Firstly, it’s not something you’ll have to do all the time. As a data scientist, you spend most of your time modifying files — typically your Jupyter Notebook and maybe a few supporting Python files. When you’re modifying files, Git gives a shortcut of
git commit -am “your message here" which will both add modified files and commit them in a single line, so most of the time you only have to type a single command.
But the real power of the staging area is whenever you make multiple changes and then want to go back and sort them into separate commits.
You might well ask “why bother having a bunch of different commits?” This is a particularly common question from software developers who have used older version control systems like subversion where committing is a slower process and it’s common for devs to just code all day and save their changes with a message along the lines of “stuff I did on Monday!”
The reason it’s important to create meaningful commit messages with one commit for each kind of change made (“updated visualization”, “added one hot encoding of categorical data”, etc) is so that when you or your team go back to your log, it’s easy to understand how you got here, and to find and perhaps even revert (undo) anything that’s problematic. It’s the same reason you don’t name all your variables “a”, “b” and “c” - the computer wouldn’t mind, but it will not make your life easier the next time you pick up the code and try to figure out what it’s all about!
There’s a lot to learn about Git. We haven’t covered branching, pushing and pulling from a remote server, undoing your changes, more advanced configuration settings, or how to check out previous commits, but once you understand the basic principles of the staging area, you’ll be ahead of a lot of people who have been using Git for a while. And keep an eye out for more articles in this series over the upcoming weeks!
Head of Data Science
Peter is a veteran technologist, CTO, entrepreneur, and longtime educator, having taught digital literacy at Columbia and authored numerous programming books.