Diving into Git

This week I decided to convert my Ledger repository over to Git. Previously I’d been using Subversion for about 4 years, and CVS for 1 year before that. There was a brief flirt with Darcs, and Mercurial, but neither ever attracted me enough to convert the repository officially.

Why did I choose Git? Actually, I’d looked at Git before, maybe a year ago, and decided it was too complex and funky. But some recent articles – and new versions of Git – prompted me to look again. Yes, it still looks complex, but then again, UNIX is complex and I’ve never stopped loving that since I made my first terminal connection. In fact, when you look at Git in terms of the UNIX philosophy, rather than as a single application, it starts making a whole lot more sense. (It was written by a UNIX-ish kernel developer, after all).

Migrating my official repository represented a special challenge, because I decided I wanted my entire history, not just the Subversion parts of it. I mean, I wanted to pull the CVS repo out of the archives and thread it along with the Subversion repo into a nice, coherent history going all the way back to version 0.1.

With other tools – even Mercurial – I would have shied away from such an undertaking. But Git not only made it possible, it was even straightforward and rather fun to do. This article chronicles my adventures at manually pasting together a version control history, and how powerfully Git was able to handle this task – which would have been patently impossible using CVS or Subversion.

Importing the CVS history

The first step was to import the CVS history. The Ledger project began in late August 2003, but I didn’t start using version control to track it until 24 Sep. Luckily I had an old backup image on my laptop and was able to start hacking right away. The command to pull this initial history into Git was:

$ mkdir ledger.cvs; cd ledger.cvs; git init
$ git-cvsimport -d /tmp/cvs -v -m -p -Z,9 ledger

In this case, /tmp/cvs is where I copied the CVS repository from my backup image, since CVS requires write access in order to do fie and folder locking. The command ran very quickly, since the history was only 1 year long. After it completed, I was able to run git log right away and see what my initial commits looked like.

Importing the Subversion history

The next step was to import my Subversion history from the SourceForge server. Actually, I copied it to my local disk first which made things go much quicker. I imported this into a new repo, running the command from the same parent directory as ledger.cvs above:

$ git svn clone -s --no-metadata --prefix=svn/ \
    file:///tmp/svn/ledger

I used the --no-metadata flag because I didn’t want git-svn-id tags littering my commit comments with uselessly redundant information. Since I don’t plan on using Subversion again for this project, there was no need to retain the tracking info.

After about 30 minutes the command completed, and presented me with a repository where the trunk, and every branch and tag, existed as remote branches. When I ran git log trunk, I saw all my Subversion history.

Rebasing one history on top of another

The Subversion history was started by checking in the contents of my source tree at some particular moment in time. The question is, how do I now base the Subversion history on the CVS history, in such a way that the connection is seamless? It turns out this is incredibly easy to do with an amazingly powerful command: git-rebase.

I’ll go ahead and do this work in yet another repository, just to show how easily Git handles these kinds of things:

$ mkdir ledger.all; cd ledger.all; git init

$ git remote add cvs ../ledger.cvs/.git
$ git remote add svn ../ledger/.git

$ git fetch cvs   # bring in all the CVS commits
$ git fetch svn   # bring in all the Subversion commits

Now that both histories existed in one repository, I needed just one more bits of information. Namely, I needed to know the base commit of the Subversion tree (the first checkin I ever made to it). This checkin looks like a bunch of file adds, since all I did was copy in a big set of files.

$ git log svn/master | tail -10

I wrote this commit’s hash number down and kept it in a safe place. I really only needed to know the first 6 or 7 characters. Let’s assume it was bd39abb.

Next I needed to know if anything significant had changed between the last CVS commit and the first SVN commit. This would mean any changes made during the transition between version control systems. Ideally there would be none, but you never know. I went ahead and applied these changes as a patch within a new local branch, which was based on the old CVS history:

$ git checkout -b cvs-work cvs/master
$ git diff cvs/master..bd39abb | patch -p1
$ find . -type f | xargs git add
$ git commit -m "Changes between CVS and Subversion"

What this did for me is to create a branch whose final commit is identical to the starting state of the Subversion branch. It should be painless now to “rebase” the Subversion branch so that the parent of its first commit becomes the last commit of the CVS history:

$ git checkout -b svn-work svn/master
$ git rebase cvs-work

This command took a while, since it effectively “re-committed” every single commit object in the entire Subversion history. Also, since the first commit is now a null-op – the one where I checked in the current state of my files into Subversion – it just disappeared altogether from the history. The output from git log now shows my entire history from beginning to end.

I did encounter a problem here with commits that had no checkin comment. In that case, I had to supply a “no comment” string manually, and then resume the rebase operation with git rebase --continue. And if at any time I might have decided against the rebase operation, or if there were major problems, a simple git rebase --abort= would have put me right back where I started.

With the svn-work branch now representing my entire history from start to finish, I decided to make it my new local master:

$ git branch -D master
$ git branch -m svn-work master

Cleaning up history

There was a time during my Subversion days when I hastily checked in over 15 megabytes worth of dependent tool chains, thinking it would be easier for my users to obtain the exact version I was using. Many commits later I decided against this, but there was no way to avoid the fact that Subversion holds onto your mistakes forever, permanently cluttering the repository with these dead files. What I wanted to know was, can I clean those turds out of my Git history, thus reducing my ridiculously large 77 Mb repository (before packing, 31 Mb after)?

The answer was a surprisingly easy Yes; and one made possible, again, by the glorious rebase command.

The first step was to find two different commits: the one where I added the tool chain tarballs, and the one where I removed it. This can be done fairly quickly using the log command:

$ git log --stat

I just searched for .gz, since I knew all the tarballs ended with it. Sure enough, they were checked in by commit 87abc32 and removed by commit 7734ff0.

To edit a repository’s history, use the rebase command with its interactive option, starting it from the parent of the first commit you want to change:

$ git rebase -i 87abc32^

This command says: starting with the parent of commit 87abc32, I want the ability to rewrite, delete, or re-order all the commits that come after it. What you should see after a bit of thinking is a file with a bunch of lines that begin with “pick”. If you were to write this file out now and exit – not making any changes – it would reapply every commit in the file starting with the first. This changes the commit ids, so you can’t do this if you have observers pulling from your repository. Do it only in local branches, or before you publish your repo, as was my case here.

What I needed was to find the line pick 7734ff0 and move it right after the first line, which was pick 87abc32. I then changed the word “pick” to “squash” in the second line, meaning that I wanted rebase to put the two commits together, resulting in a commit whose diff represented the cumulative changes of the two. Since the first commit added the files (among other things), and the second commit removed them, the final result will be a commit with no tarballs in it at all, just all the other changes that happened in 87abc32.

It took about a minute for this to run, but at the end I was able to look at my new log and not see any trace of a tarball anywhere.

“Bring out your dead”

The size of my .git directory, however, was still a dismaying 77 Mb. I ran git prune – to remove the repository objects no longer being referenced – but it didn’t change. What was going on? I then ran this command:

$ git fsck --unreachable
$ git fsck --lost-found
dangling commit ....
dangling blob ....

Although the --unreachable option didn’t show anything as being available for pruning, the --lost-found option showed me the very commits I had just removed, and their associated blobs (the tarballs I was concerned about). But why was Git still holding onto them?

It turns out that Git has a very, very cool feature where it keeps track of every change you make to your repository. Say, for example, that you “pop off” the most recent commit in your branch, effectively deleting it:

$ git reset --hard HEAD^

This command removes the last commit from your repository’s history and resets your working tree to match the new HEAD. It’s like the commit never happened, and so it should be gone forever now, right? Well, the real answer is: not yet.

Git still holds a pointer to your commit in the form of a “reflog”. The reflog keeps track of every change you make to the repository, allowing you to examine and possibly recover them. For example, if you used the reflog command right after your reset command you might see something like this:

$ git reflog
bc180ef... HEAD@{0}: reset --hard HEAD^: updating HEAD

It even has a hash value, which is just like a regular commit! In fact, it is a commit, except that it’s more like a “meta commit”. That is, it’s not a commit reflecting a change you’ve made to your project’s sources, but rather a commit that represents the change you just made to the repository itself. Here’s a few commands you can use to examine the reflog commit more closely:

$ git cat-file -t bc180ef    # prove to me that it's a commit
$ git ls-tree -l bc180ef     # what data is it holding onto?
$ git show bc180ef           # show me a patch of what I dropped

Because this commit exists in your repository’s reflog, all the blobs it references – and the file copies reflecting those changes – will continue to live on. How long? The default is 30 days. Which means that git prune and git gc will not actually delete the space taken up by that commit for another month.

In the case of my giant tarballs I wanted to realize the space savings now. So I needed to prune the reflog itself such that no commit anywhere would reference my dead tarballs:

$ git reflog expire --expire=1.minute refs/heads/master

$ git fsck --unreachable      # now I see those tarball blobs!
$ git prune                   # hasta la vista, baby
# git gc                      # cleanup and repack the repo

These commands wiped out the reflog history for the specified branch (master in this case), cleaned up all the dead space, and squeezed out the redundant bits. That 77 Mb unpacked repository became a nicely packed, 2.1 Mb one.

The reality wasn’t quite so easy

Figuring all this out took me some time: about 16 straight hours, and the need to restart the whole process maybe 20 times. But once I got the hang of it, I found that git’s various component tools make a whole lot of sense. There is real power here, waiting to be tapped by higher-level commands and interfaces. The kind of surgery I was able to perform – in real-time – was far beyond anything I’d ever experienced in the realm of version control systems.

And it was fast!! I rarely ever had to wait long for a change to happen, even though I was rewriting years of change history.

After this experience, far from being put off by the learning curve, I’m completely sold now. I feel like my data is wholly under my control, not subject to arbitrary things like version numbers or branch labels, etc. Everything is just a commit to Git, and the objects linked to those commits. Chain commits together from parent to child and you have a history; if a commit has multiple children, that’s a branch, while multiple parents represent a merge. How much simpler can you get?

I’ve found that sometimes, the simpler a concept is the more complex its explanation becomes – because true simplicity allows for the greatest range of expressive forms.