Git has sometimes been described as a versioning file-system which happens to support the underlying notions of version control. And while most people do simply use Git as a version control system, it remains true that it can be used for other tasks as well.
For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need. It’s fast, efficient, and offers a large array of command-line tools for examining and mutating the resulting data store.
To support this kind of usage – for the upcoming purpose of maintaining issue
tracking data in a Git repository – I’ve created a Python class that wraps
Git as a basic shelve
object.
Here is how you normally use the standard shelve
module:
import shelve
= shelve.open('data.db')
data
# data.db may or may not have existed on disk before now. If not,
# We're Manipulating an Empty Dictionary. If so, we can examine or
# modify the previous run's state data. In both cases, the database
# is manipulated like a standard Python dictionary.
= "Hello, world!"
data[key] # Write out changes to the dictionary
data.sync()
del data[key]
# Close and clean up, sync'ing only if necessary data.close()
This provides the simplest kind of database, without any query language or
notion of whether previous state did or did not exist. Both of those are
services you’d have to layer on top of the shelve
object if you wanted them.
Now consider gitshelve
. Whereas the Python shelve
module stores your data by
pickling all of the dictionary values, I pass whatever data you place in the
dictionary straight on to Git’s standard input. In the default mode, this
means you work strictly with string data:
import gitshelve
= gitshelve.open(repository = '/tmp/data.git')
data
= "Hello, world!"
data[key] # Repository is created if it doesn't exist
Data.Sync()
del data[key]
data.close()
The interface is identical, but with the Git version you can now examine the resulting repository’s yourself, using regular Git commands:
$ GIT_DIR=/tmp/data.git git log
By default, the commits have no associated comment text, but the sync
method
doesn’t accept parameters. If you wish to add transaction notes, use the
commit
method instead:
"This is a comment") data.commit(
You can store data this way either in a separate repository, or in named
branches within any repository. If the repository
argument is not given, the
named branch within the current Git repository is used. An exception will be
raised, however, if you do this and there is no Git repository related to the
current directory.
# I'm expecting to use the 'data' branch of the current repository, but
# I ran the script in a directory unknown to Git!
= gitshelve.open(branch = 'data')
data
# It appears to work, because no Git commands are run until the last
# possible moment
'foo/bar/hello.txt'] = "Hello!"
data[
# This raises an exception, because there is no current repository. To fix
# it, either run "git init", or use a specific 'repository' argument above.
"I just said hello") data.commit(
The really nice thing about using Git this way is that you get all of its best features for free.
Added non-text values
If you have a need to store non-textual values, you’ll have to let gitshelve
know how to deal with them. I don’t do any such handling by default, because
of the big chance of doing the wrong thing, and having you not find out about
it until it’s much too late. Just pickling data like shelve
does isn’t very
smart, for example, because it will wreak havoc on Git’s merge algorithms
should you ever need to incorporate new data from another source.
So, let’s see how to add a custom data translator. First, you need to subclass
a new type of gitbook
, which is the wrapper used to interface with the blobs
in the Git repository. There are only two methods you need to override:
class my_gitbook(gitshelve.gitbook):
def serialize_data(self, data):
return object_to_string(data)
def deserialize_data(self, data):
return object_from_string(data)
Now you must define object_to_string
and object_from_string
, which should
examine the types of the objects passed and turn them into merge-friendly
string as appropriate. Certain forms of XML work well for this job, as do
ini-style configuration files in some cases. It’s up to you and what works
best for your usage.
Once you have this new class type, you must pass it to the gitshelve.open
function:
= gitshelve.open(repository = '/tmp/foo', book_type = my_gitbook) data
Making things even faster
Every time you open a gitshelve
, it must walk through the assoicated branch
and determine its contents in order to build the key/value relationships in
the dictionary. If you find that this ever gets slow, what you can do is just
pickle the gitshelve! The only caveat is that you must take care to delete it
if the HEAD you created it from is different from the current HEAD. Here’s an
example:
import gitshelve
import cPickle
import os
= None
data if os.path.isfile('data.cache'):
= open('data.cache', 'rb')
fd = cPickle.load(fd)
data
# I'm using an arbitrary file name here, __HEAD__
if data['__HEAD__'] != data.current_head():
= None # Out of date, we can't use it
data
if not data:
= gitshelve.open(branch = 'data')
data '__HEAD__'] = data.current_head()
data[
# ... for data sets with enormous quantities of tiny files, this
# could really speed things up ...
Where can you get it?
The gitshelve
module is being maintained as part of the git-issue
project,
which is yet another attempt to bring distributed bug tracking to Git.
Actually, I tend to support multiple repositories as data backends, but right
now Git is my initial focus. You can clone the project and test it out as
such:
git clone git://github.com/jwiegley/git-issues.git
cd git-issues
python t_gitshelve.py
If see “OK” at the end of the unit tests, you’re good to go! There isn’t much
documentation on gitshelve.py itself right now, beyond this blog entry, but
then again the shelve
-like interface is simple enough that you really
shouldn’t need much more.
Or if you prefer, you can just browse the project at the GitHub project page.