This week I decided to convert my Ledger repository over to Git. Previously I’d been using Subversion for about 4 years, and CVS for 1 year before that. There was a brief flirt with Darcs, and Mercurial, but neither ever attracted me enough to convert the repository officially.
Why did I choose Git? Actually, I’d looked at Git before, maybe a year ago, and decided it was too complex and funky. But some recent articles – and new versions of Git – prompted me to look again. Yes, it still looks complex, but then again, UNIX is complex and I’ve never stopped loving that since I made my first terminal connection. In fact, when you look at Git in terms of the UNIX philosophy, rather than as a single application, it starts making a whole lot more sense. (It was written by a UNIX-ish kernel developer, after all).
Migrating my official repository represented a special challenge, because I decided I wanted my entire history, not just the Subversion parts of it. I mean, I wanted to pull the CVS repo out of the archives and thread it along with the Subversion repo into a nice, coherent history going all the way back to version 0.1.
With other tools – even Mercurial – I would have shied away from such an undertaking. But Git not only made it possible, it was even straightforward and rather fun to do. This article chronicles my adventures at manually pasting together a version control history, and how powerfully Git was able to handle this task – which would have been patently impossible using CVS or Subversion.
Importing the CVS history
The first step was to import the CVS history. The Ledger project began in late August 2003, but I didn’t start using version control to track it until 24 Sep. Luckily I had an old backup image on my laptop and was able to start hacking right away. The command to pull this initial history into Git was:
$ mkdir ledger.cvs; cd ledger.cvs; git init
$ git-cvsimport -d /tmp/cvs -v -m -p -Z,9 ledger
In this case, /tmp/cvs
is where I copied the CVS repository from my backup
image, since CVS requires write access in order to do fie and folder locking.
The command ran very quickly, since the history was only 1 year long. After it
completed, I was able to run git log
right away and see what my initial
commits looked like.
Importing the Subversion history
The next step was to import my Subversion history from the SourceForge server.
Actually, I copied it to my local disk first which made things go much
quicker. I imported this into a new repo, running the command from the same
parent directory as ledger.cvs
above:
$ git svn clone -s --no-metadata --prefix=svn/ \
file:///tmp/svn/ledger
I used the --no-metadata
flag because I didn’t want git-svn-id
tags littering
my commit comments with uselessly redundant information. Since I don’t plan on
using Subversion again for this project, there was no need to retain the
tracking info.
After about 30 minutes the command completed, and presented me with a
repository where the trunk, and every branch and tag, existed as remote
branches. When I ran git log trunk
, I saw all my Subversion history.
Rebasing one history on top of another
The Subversion history was started by checking in the contents of my source
tree at some particular moment in time. The question is, how do I now base the
Subversion history on the CVS history, in such a way that the connection is
seamless? It turns out this is incredibly easy to do with an amazingly
powerful command: git-rebase
.
I’ll go ahead and do this work in yet another repository, just to show how easily Git handles these kinds of things:
$ mkdir ledger.all; cd ledger.all; git init
$ git remote add cvs ../ledger.cvs/.git
$ git remote add svn ../ledger/.git
$ git fetch cvs # bring in all the CVS commits
$ git fetch svn # bring in all the Subversion commits
Now that both histories existed in one repository, I needed just one more bits of information. Namely, I needed to know the base commit of the Subversion tree (the first checkin I ever made to it). This checkin looks like a bunch of file adds, since all I did was copy in a big set of files.
$ git log svn/master | tail -10
I wrote this commit’s hash number down and kept it in a safe place. I really
only needed to know the first 6 or 7 characters. Let’s assume it was bd39abb
.
Next I needed to know if anything significant had changed between the last CVS commit and the first SVN commit. This would mean any changes made during the transition between version control systems. Ideally there would be none, but you never know. I went ahead and applied these changes as a patch within a new local branch, which was based on the old CVS history:
$ git checkout -b cvs-work cvs/master
$ git diff cvs/master..bd39abb | patch -p1
$ find . -type f | xargs git add
$ git commit -m "Changes between CVS and Subversion"
What this did for me is to create a branch whose final commit is identical to the starting state of the Subversion branch. It should be painless now to “rebase” the Subversion branch so that the parent of its first commit becomes the last commit of the CVS history:
$ git checkout -b svn-work svn/master
$ git rebase cvs-work
This command took a while, since it effectively “re-committed” every single
commit object in the entire Subversion history. Also, since the first commit
is now a null-op – the one where I checked in the current state of my files
into Subversion – it just disappeared altogether from the history. The output
from git log
now shows my entire history from beginning to end.
I did encounter a problem here with commits that had no checkin comment. In
that case, I had to supply a “no comment” string manually, and then resume the
rebase operation with git rebase --continue
. And if at any time I might have
decided against the rebase operation, or if there were major problems, a
simple git rebase --abort=
would have put me right back where I started.
With the svn-work
branch now representing my entire history from start to
finish, I decided to make it my new local master:
$ git branch -D master
$ git branch -m svn-work master
Cleaning up history
There was a time during my Subversion days when I hastily checked in over 15 megabytes worth of dependent tool chains, thinking it would be easier for my users to obtain the exact version I was using. Many commits later I decided against this, but there was no way to avoid the fact that Subversion holds onto your mistakes forever, permanently cluttering the repository with these dead files. What I wanted to know was, can I clean those turds out of my Git history, thus reducing my ridiculously large 77 Mb repository (before packing, 31 Mb after)?
The answer was a surprisingly easy Yes; and one made possible, again, by the
glorious rebase
command.
The first step was to find two different commits: the one where I added the
tool chain tarballs, and the one where I removed it. This can be done fairly
quickly using the log
command:
$ git log --stat
I just searched for .gz
, since I knew all the tarballs ended with it. Sure
enough, they were checked in by commit 87abc32
and removed by commit 7734ff0
.
To edit a repository’s history, use the rebase
command with its interactive
option, starting it from the parent of the first commit you want to change:
$ git rebase -i 87abc32^
This command says: starting with the parent of commit 87abc32
, I want the
ability to rewrite, delete, or re-order all the commits that come after it.
What you should see after a bit of thinking is a file with a bunch of lines
that begin with “pick”. If you were to write this file out now and exit – not
making any changes – it would reapply every commit in the file starting with
the first. This changes the commit ids, so you can’t do this if you have
observers pulling from your repository. Do it only in local branches, or
before you publish your repo, as was my case here.
What I needed was to find the line pick 7734ff0
and move it right after the
first line, which was pick 87abc32
. I then changed the word “pick” to “squash”
in the second line, meaning that I wanted rebase
to put the two commits
together, resulting in a commit whose diff represented the cumulative changes
of the two. Since the first commit added the files (among other things), and
the second commit removed them, the final result will be a commit with no
tarballs in it at all, just all the other changes that happened in 87abc32
.
It took about a minute for this to run, but at the end I was able to look at my new log and not see any trace of a tarball anywhere.
“Bring out your dead”
The size of my .git
directory, however, was still a dismaying 77 Mb. I ran git
prune
– to remove the repository objects no longer being referenced – but it
didn’t change. What was going on? I then ran this command:
$ git fsck --unreachable
$ git fsck --lost-found
dangling commit ....
dangling blob ....
Although the --unreachable
option didn’t show anything as being available for
pruning, the --lost-found
option showed me the very commits I had just
removed, and their associated blobs (the tarballs I was concerned about). But
why was Git still holding onto them?
It turns out that Git has a very, very cool feature where it keeps track of every change you make to your repository. Say, for example, that you “pop off” the most recent commit in your branch, effectively deleting it:
$ git reset --hard HEAD^
This command removes the last commit from your repository’s history and resets
your working tree to match the new HEAD
. It’s like the commit never happened,
and so it should be gone forever now, right? Well, the real answer is: not
yet.
Git still holds a pointer to your commit in the form of a “reflog”. The reflog
keeps track of every change you make to the repository, allowing you to
examine and possibly recover them. For example, if you used the reflog
command
right after your reset
command you might see something like this:
$ git reflog
bc180ef... HEAD@{0}: reset --hard HEAD^: updating HEAD
It even has a hash value, which is just like a regular commit! In fact, it is a commit, except that it’s more like a “meta commit”. That is, it’s not a commit reflecting a change you’ve made to your project’s sources, but rather a commit that represents the change you just made to the repository itself. Here’s a few commands you can use to examine the reflog commit more closely:
$ git cat-file -t bc180ef # prove to me that it's a commit
$ git ls-tree -l bc180ef # what data is it holding onto?
$ git show bc180ef # show me a patch of what I dropped
Because this commit exists in your repository’s reflog, all the blobs it
references – and the file copies reflecting those changes – will continue to
live on. How long? The default is 30 days. Which means that git prune
and git
gc
will not actually delete the space taken up by that commit for another
month.
In the case of my giant tarballs I wanted to realize the space savings now. So I needed to prune the reflog itself such that no commit anywhere would reference my dead tarballs:
$ git reflog expire --expire=1.minute refs/heads/master
$ git fsck --unreachable # now I see those tarball blobs!
$ git prune # hasta la vista, baby
# git gc # cleanup and repack the repo
These commands wiped out the reflog history for the specified branch (master in this case), cleaned up all the dead space, and squeezed out the redundant bits. That 77 Mb unpacked repository became a nicely packed, 2.1 Mb one.
The reality wasn’t quite so easy
Figuring all this out took me some time: about 16 straight hours, and the need to restart the whole process maybe 20 times. But once I got the hang of it, I found that git’s various component tools make a whole lot of sense. There is real power here, waiting to be tapped by higher-level commands and interfaces. The kind of surgery I was able to perform – in real-time – was far beyond anything I’d ever experienced in the realm of version control systems.
And it was fast!! I rarely ever had to wait long for a change to happen, even though I was rewriting years of change history.
After this experience, far from being put off by the learning curve, I’m completely sold now. I feel like my data is wholly under my control, not subject to arbitrary things like version numbers or branch labels, etc. Everything is just a commit to Git, and the objects linked to those commits. Chain commits together from parent to child and you have a history; if a commit has multiple children, that’s a branch, while multiple parents represent a merge. How much simpler can you get?
I’ve found that sometimes, the simpler a concept is the more complex its explanation becomes – because true simplicity allows for the greatest range of expressive forms.