An Intro To Git

I don’t like git. Now that I’ve said that, I also don’t like the way using git is taught to new users. It results in problems like this:

I feel like the person in the alt text most of the time, and it’s absolutely not the fault of the people don’t understand this thing. I’m not going to claim this “uses a beautiful distributed graph theory tree model (it doesn’t), or “pretty simple” (it’s not); instead, I’m going to present the git tutorial I wish someone else would’ve showed me.

One final note: there is more than one way to do most things in git. I am opting for the most cohesive and that which I find to apply to most scenarios.

Undo

One thing we are going to do is memorize a few shell commands. My first Java tutorial showed all of the boilerplate - the class, the public static void main(String[] args), and asked the reader to take it on faith that this would all make sense later. So I’m going to do that here, just this once.

The first thing you need to know is how to undo the previous action. Undo comes in two flavors: git reset --hard, and git reset --hard HEAD@{1}. (How’s that for obtuse boilerplate?)

The first is more of an “abort” - it’s used when an “operation” in progress has gone awry, or was the wrong operation. (I’ll define an operation more formally in a bit. If an operation is not in progress, this command is safe.)

The second is an undo proper: after a git operation has completed, to back it out, you run the obtuse long thing. Note that git reset itself counts as an operation, so to redo, one effectively just undoes again. Git does keep a state log much farther into the past, so it is possible to undo more than one operation, but I need to explain some other things before we get there.

Hello, world

Beyond starting with how to fix it when it breaks, I also really appreciate tutorials that use examples. In order to perform examples, though, we will need something to work with: a repository for our project. To do this, we will make our own using git init. Here is one way it can be called:

frozencemetery@kirtar:~$ mkdir testrepo
mkdir: created directory 'testrepo'
frozencemetery@kirtar:~$ cd testrepo
frozencemetery@kirtar:~/testrepo$ git init .
Initialized empty Git repository in /home/frozencemetery/testrepo/.git/

Note that one can git init . in a directory that is not empty, and git will not touch the existing files. In fact, git tries to be as apathetic about the world around it as possible; you can even move the repository around, or rename it, and git won’t care (it won’t even notice).

That’s all fine and good, but most of the time we don’t work with repositories created in this way. Most of the time, “someone else” made the repository - whether that person is another person running git init for their codebase, or a piece of software (e.g., GitHub) running per user request.

How do we get this code? Well, for example, to get the source for this website, one could do this:

frozencemetery@kirtar:/tmp$ git clone https://github.com/frozencemetery/mivehind.net
Cloning into 'mivehind.net'...
remote: Counting objects: 277, done.
remote: Compressing objects: 100% (4/4), done.
remote: Total 277 (delta 0), reused 0 (delta 0), pack-reused 273
Receiving objects: 100% (277/277), 2.12 MiB | 3.14 MiB/s, done.
Resolving deltas: 100% (139/139), done.
frozencemetery@kirtar:/tmp$ cd mivehind.net 
frozencemetery@kirtar:/tmp/mivehind.net$ ls -A
about.md  _config.yml  css       .git        _includes   _layouts  _posts
assets    COPYING      feed.xml  .gitignore  index.html  LICENSE

And it’s all there. Be careful - there will have been new commits since this writing, and not all outputs will match exactly.

Dotfiles?

In the above example, there are two entries whose names start with a dot, and so are invisible in normal ls. The first is the “.git” directory. This contains internal git state - everything git knows about the repository. It’s very interesting to look at what’s in there, if you’re me and are interested in the inner workings of version control. Otherwise I don’t recommend it.

The other dotfile, “.gitignore”, is more interesting. As the name might suggest, it is a list of filename (patterns) for git to ignore. So if you look at mine:

frozencemetery@kirtar:/tmp/mivehind.net$ cat .gitignore 
*~
out.html

What does it mean to ignore a file? Well, git, as version control software, has a notion of what files it “tracks”. Files can be in any of three states: tracked, untracked, and ignored. I use git status to compare this:

frozencemetery@kirtar:/tmp/mivehind.net$ echo "foo" > out.html
frozencemetery@kirtar:/tmp/mivehind.net$ echo "foo" > untracked
frozencemetery@kirtar:/tmp/mivehind.net$ echo "foo" > feed.xml 
frozencemetery@kirtar:/tmp/mivehind.net$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)
    
        modified:   feed.xml
        
Untracked files:
  (use "git add <file>..." to include in what will be committed)
          
        untracked
              
no changes added to commit (use "git add" and/or "git commit -a")

There’s a lot going on in the output of that command, so let’s go through some of it (and we’ll do more in a bit).

First, we edit the contents of, in order, an ignored file, an untracked file, and a tracked file. Then, when we ask git about the state of the world (git status), it tells us nothing about the first file (because it’s ignored), that an untracked file exists (because we just made it), and that a file has changed (modified).

Now, as a demonstration, here’s what undo does:

frozencemetery@kirtar:/tmp/mivehind.net$ git reset --hard
HEAD is now at e2840c0 [post] Write about numbers
frozencemetery@kirtar:/tmp/mivehind.net$ git status
On branch master
Your branch is up-to-date with 'origin/master'.
Untracked files:
  (use "git add <file>..." to include in what will be committed)
  
        untracked
    
nothing added to commit but untracked files present (use "git add" to track)

Note that it only undid the changes to the file(s) that git tracked - the untracked file and the ignored file are left alone.

Basic commit workflow

git status really likes to talk about commits, so I’ll humor it. But first: what is a commit? It sounds simple, but…

The answer is nontrivial, and requires understanding how git works internally because it provides no abstraction over this concept. In git, a commit is a snapshot of the repository contents (in its entirety), at a particular point in time, with a message, that has a notion of the commit that preceded it (which git calls parent). In order to accomplish this last bit, each commit has an associated commit hash that is (hopefully) unique. (It’s currently sha1, though I hope it changes for collision reasons.)

Git organizes commits into branches. Branches are just readable pointers for the hashes, and can be moved to point to a different hash. (This will make sense in a moment.) The default branch is named master, so that is the branch we are on now, as git status informed us. More formally, HEAD is pointed at the tip of the master branch, where HEAD is a pointer at to the hash which reflects the repository state.

So let’s say, for the sake of example, we wanted to include the “untracked” file in my blog. Well, the first thing we should do is switch to another branch because most projects consider it bad practice to develop on master. That would look like this:

frozencemetery@kirtar:/tmp/mivehind.net$ git checkout -b new_file
Switched to a new branch 'new_file'
frozencemetery@kirtar:/tmp/mivehind.net$ git status
On branch new_file
Untracked files:
  (use "git add <file>..." to include in what will be committed)
  
       untracked
    
nothing added to commit but untracked files present (use "git add" to track)

git checkout is a command which manipulates both HEAD and the repository contents simultaneously. In this case, we have asked it to create a new branch, called “new file”, which points to the same hash as HEAD (which, one will recall, is at the tip of master). And once it’s done that, git checkout will switch HEAD to point to it and update repository contents to match.

Since they point to the same hash, no changes are actually made to the repository contents. However, if we were switching to an existing branch (call git checkout new_file instead), it is possible that there would have been changes to repository contents.

Git requires committing changes to happen in two steps: first, we stage what we want to commit, and then we actually commit it. But we also can only commit changes to tracked files. Fortunately, the command for adding a file to git’s tracked file index and the command for staging are the same:

frozencemetery@kirtar:/tmp/mivehind.net$ mv untracked f
frozencemetery@kirtar:/tmp/mivehind.net$ git add f
frozencemetery@kirtar:/tmp/mivehind.net$ git status
On branch new_file
Changes to be committed:
  (use "git reset HEAD <file>..." to unstage)
  
    new file:   f

I renamed the file first because this will get really confusing otherwise. But look at that: our file is ready to go. If we wanted to make a multi-file change, we could also run git add more times, but let’s not for now. Instead, we’ll go ahead and make a new commit. To do this, we just run git commit. There’s no fancy syntax highlighting here because this command will spawn an editor for you to type a message explaining your commit. Nah, here it is:

frozencemetery@kirtar:/tmp/mivehind.net$ git commit
[new_file 902e6ed] HEY I MADE A COMMIT
 1 file changed, 1 insertion(+)
 create mode 100644 f

A brief note on commit messages: the git tooling - and most projects - expects your commit messages to consist of a single, <50 character message. This can be followed by a <~78 character-wrapped paragraph separated by a blank line. There can also be colon-delimted tags (think HTTP headers, if you’re familiar). Commit message styling is something people care a lot about.

Of course, we probably wanted to see this commit. Not to worry; git allows us to view the commit history in a branch with git log. This will open a pager if the history does not fit in the terminal, but the top will look something like this:

commit 902e6edd93b9cf6f9de636a07a00c0c3c7f30151
Author: Robbie Harwood <ihate@spam>
Date:   Sun Feb 19 17:08:01 2017 -0500

    HEY I MADE A COMMIT
    
    HOW DO I TURN OFF CAPS LOCK AGAIN
    
    Resolves: #37

commit e2840c0c1cce1778261311378ca73ce8abfd89de
Author: Robbie Harwood <itsreal@bad>
Date:   Sun Feb 12 20:36:33 2017 -0500

    [post] Write about numbers

So, for each commit, it shows the hash, the message, (a few other things,) and then the first parent is just the next one in the list.

Pull requests and remotes

Git purists will say that pull request procedure is not part of git proper. To which I say that, while perhaps true, it does the reader a disservice to ignore it. Since this is the direction most tools are heading, and because this is how I handle the ~~victim~~ example repository, I assume something that works like GitHub.

So we go into our tool and click the “fork” button. (Please don’t actually do any of this to my website’s repo unless you have noncontrived changes to contribute.) A fork is your own copy of a (generally read-only) upstream repository, which therefore shares history and intends to contribute back changes. Once done, we need to tell git about this fork. Git tracks non-local versions of the repository as well as the local one; these are called remotes. So in order to add our fork, we would do something like this:

frozencemetery@kirtar:/tmp/mivehind.net$ git remote add my_fork https://github.com/my_user/mivehind.net
frozencemetery@kirtar:/tmp/mivehind.net$ git remote -v
my_fork https://github.com/my_user/mivehind.net (fetch)
my_fork https://github.com/my_user/mivehind.net (push)
origin https://github.com/frozencemetery/mivehind.net (fetch)
origin https://github.com/frozencemetery/mivehind.net (push)

At which point we can say git fetch my_fork, which will tell git to update its cache of the my_fork remote’s state. To go the other way, we use git push -u my_fork to create the current branch on our fork and update it with our contents. (Thereafter, we can invoke git push on the branch, since git tracks which remote a branch is tied to, or tracks.)

From there, it’s back into the web interface to make a PR from our fork’s branch.

Receiving a pull request

Suppose you were me, received this pull request, and decided to add it to the repository. Depending on how the project works, I would do one of two things, which I will call the merge workflow and the rebase workflow.

Most web tools, GitHub included, favor the merge workflow by providing a button to do it for you. Supposing I wanted to do this myself, though, I would first fetch your fork (add remote first), and then generate a merge commit. A merge commit is a special kind of commit that is identical to a normal commit except that it has multiple parent commits. The easiest way to generate a merge commit is to run git merge my_fork/new_file, where my_fork is the remote and new_file is the branch, which will create a commit uniting the new_file branch from the fork onto the current branch. A merge commit typically merges a smaller, development branch onto a main branch. Many people do not like merge commits because having multiple parents creates a nonlinear history, which is more difficult to work with later.

This contrasts with the rebase workflow, which is harder to execute but results in a cleaner repository history. Here, confusingly, one also runs git merge, but does it slightly differently: git merge --ff-only my_fork/new_file. If it all works, then the current branch looks as if the commits in new_file had happened on top of it originally. That is, no merge commit is generated. I’ll get into what happens for failure in both workflows in just a moment.

As an example of this in the wild: the Linux kernel uses the rebase workflow (without a web tool) for each subsystem, and then the subsystem maintainers periodically ask Linus to pull their subsystem into the mainline kernel branch using merge workflow.

Conflicts and rebaseing

If there have been changes to origin’s master branch since the fork’s branch was created, then both workflows may fail. (Rebase will always fail; merge will only fail if both branches have modified the same section of the same file.)

git merge, in the merge workflow, will prompt the operator (that’s us) to fix the conflict: git status reveals what is wrong, and git sets off the problematic regions of the file with “<<<” and “>>>”. git add the files once fixed, and then commit when done.

The rebase workflow is so named because of the way these conflicts are resolved. Typically the problem of failure here is given back to the contributor of the pull request, which I think is bad, but since it’s common I need to explain it.

First, get the changes to origin’s master branch. Then we run git rebase to edit history. This is a very dangerous operation. Or rather, it would be, if I hadn’t opened with how to undo. It looks like this:

frozencemetery@kirtar:/tmp/mivehind.net$ git checkout master
Switched to branch 'master'
Your branch is up-to-date with 'origin/master'.
frozencemetery@kirtar:/tmp/mivehind.net$ git pull origin
(Your output will vary depending on the state)
frozencemetery@kirtar:/tmp/mivehind.net$ git checkout new_file 
Switched to branch 'new_file'
frozencemetery@kirtar:/tmp/mivehind.net$ git rebase -i master

Hey look, a wild new command appeared! git pull is just a handy shortcut: it runs git fetch followed by git merge --ff (which will not generate a merge commit unless there is a conflict one needs to resolve).

The final invocation in that block performs what we call an interactive rebase. It will open an editor displaying the actions to be performed. Here, we’re just using it as a sanity check: it should show only commits from the fork’s branch, but sometimes it gets confused. git rebase is a very powerful history editing tool, and I’m not going to be able to explain it all here. Many people prefer not to work with it at all, and history editing is actively discouraged in other version control systems (e.g., mercurial).

Save and close when you’re done staring at the abyss, and git will carry out the changes. If there is a problem, or you requested it stop for editing, git status now gives helpful prompts about how to proceed. (There used to be a much longer section here.)

Reflog (or, undo explained)

Recall that git uses HEAD as a reserved name pointer to the checked out repository state. So check this out:

frozencemetery@kirtar:/tmp/mivehind.net$ git reflog
902e6ed HEAD@{0}: checkout: moving from master to new_file
e2840c0 HEAD@{1}: checkout: moving from new_file to master
902e6ed HEAD@{2}: commit: HEY I MADE A COMMIT
e2840c0 HEAD@{3}: checkout: moving from master to new_file
e2840c0 HEAD@{4}: reset: moving to HEAD
e2840c0 HEAD@{5}: clone: from https://github.com/frozencemetery/mivehind.net

(If you’ve been following this tutorial closely, yours will not match mine.)

What are we looking at? Well, it’s the history of where HEAD has been. HEAD@{0} is the same as HEAD - that is, the current state. And HEAD@{1} is the previous position, and so on. Note that git records the operation, including the type, as well as a truncated version of the commit hash. (People, including tools such as GitHub will use these truncated hashes in place of the full one; they really shouldn’t since it makes collision even more likely. The kernel has had a non-malicious instance of truncated hash collision already.)

But I’ve only explained half the story. git reset is a command that moves head. Passing “–hard” causes it to also adjust the repository directory to match; without “–hard” it will not change files on disk. And the reason “multiple undo” is nontrivial is that its own HEAD movement is recorded:

frozencemetery@kirtar:/tmp/mivehind.net$ git reset --hard HEAD@{1}
HEAD is now at e2840c0 [post] Write about numbers
frozencemetery@kirtar:/tmp/mivehind.net$ git reflog
e2840c0 HEAD@{0}: reset: moving to HEAD@{1}
902e6ed HEAD@{1}: checkout: moving from master to new_file
e2840c0 HEAD@{2}: checkout: moving from new_file to master
902e6ed HEAD@{3}: commit: HEY I MADE A COMMIT
e2840c0 HEAD@{4}: checkout: moving from master to new_file
e2840c0 HEAD@{5}: reset: moving to HEAD
e2840c0 HEAD@{6}: clone: from https://github.com/frozencemetery/mivehind.net

Extra: notation

Git has many different ways to delimit commits and commit ranges. If I’ve written this well, you should be able to now read most of man gitrevisions. Git has its own idiosyncratic documentation style that it’s worth getting used to eventually.

In particular, I recommend understanding “~”, “^”, and “..”; one may also find use for “…” on occasion. It is also important to be aware of which commands use “remote/branch” and which use “remote branch”. More on this at the end.

Extra: working with release branches

Occasionally, one may wish to apply a specific commit from another branch onto the current branch. Supposedly there was debate about including this functionality into git at all (though one can beat on the tool to do pretty much anything, especially with git rebase), but I have found it very useful in managing release branches. A release branch is a branch which is expected to be slower-moving than master; generally, it only receives new commits for bugfixes, and eventually it stops being supported.

Typically, these bugfixes will land in the master branch first, and then the stable branch maintainer will apply them. The invocation is git cherry-pick -x commithash (where commithash is of course the hash of the target commit, or a pointer to it, or a commit range). I recommend the use of “-x”, which will record the commit hash we cherry-picked from in the new commit. Remember that since commits include parent information, this will not be the same hash.

Extra: stash

Certain git operations demand full control over the tree and will complain if there are changes which have not been committed. The easiest things to do are of course commit the changes or discard them, but sometimes this isn’t possible.

Enter git stash. Running this command creates a temporary commit with your changes. One can then perform the persnickety operation, and then run git stash pop to restore the changes.

This is a very “hold my beer” kind of operation: its very common to push a stash and forget to pop it later. Since there are no commit messages logged, these are almost always incomprehensible when discovered. Incidentally, it is possible to make multiple stash commits, and they function as a stack; I do not recommend it.

Final thoughts

This is pretty much all of the git I use regularly. There is of course more out there. A lot more, actually:

frozencemetery@kirtar:~$ apropos ^git | wc -l
187

Historically, git has become extremely prevalent due to its speed. Its user interface is not intuitive at all; the user must be aware of its internal workings. The design did not believe in abstraction, and there are parts that were very clearly not designed as part of the whole. Please do not make software like this.