Git has sometimes been described as a versioning file-system which happens to support the underlying notions of version control. And while most people do simply use Git as a version control system, it remains true that it can be used for other tasks as well.

For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need. It’s fast, efficient, and offers a large array of command-line tools for examining and mutating the resulting data store.

To support this kind of usage — for the upcoming purpose of maintaining issue tracking data in a Git repository — I’ve created a Python class that wraps Git as a basic shelve object.

Here is how you normally use the standard shelve module:

import shelve

data = shelve.open('data.db')

# data.db may or may not have existed on disk before now.  If not,
# We're Manipulating an Empty Dictionary.  If so, we can examine or
# modify the previous run's state data.  In both cases, the database
# is manipulated like a standard Python dictionary.

data[key] = "Hello, world!"
data.sync()        # Write out changes to the dictionary

del data[key]
data.close()       # Close and clean up, sync'ing only if necessary

This provides the simplest kind of database, without any query language or notion of whether previous state did or did not exist. Both of those are services you’d have to layer on top of the shelve object if you wanted them.

Now consider gitshelve. Whereas the Python shelve module stores your data by pickling all of the dictionary values, I pass whatever data you place in the dictionary straight on to Git’s standard input. In the default mode, this means you work strictly with string data:

import gitshelve

data = gitshelve.open(repository = '/tmp/data.git')

data[key] = "Hello, world!"
Data.Sync()                  # Repository is created if it doesn't exist

del data[key]
data.close()

The interface is identical, but with the Git version you can now examine the resulting repository’s yourself, using regular Git commands:

$ GIT_DIR=/tmp/data.git git log

By default, the commits have no associated comment text, but the sync method doesn’t accept parameters. If you wish to add transaction notes, use the commit method instead:

data.commit("This is a comment")

You can store data this way either in a separate repository, or in named branches within any repository. If the repository argument is not given, the named branch within the current Git repository is used. An exception will be raised, however, if you do this and there is no Git repository related to the current directory.

# I'm expecting to use the 'data' branch of the current repository, but
# I ran the script in a directory unknown to Git!
data = gitshelve.open(branch = 'data')

# It appears to work, because no Git commands are run until the last
# possible moment
data['foo/bar/hello.txt'] = "Hello!"

# This raises an exception, because there is no current repository.  To fix
# it, either run "git init", or use a specific 'repository' argument above.
data.commit("I just said hello")

The really nice thing about using Git this way is that you get all of its best features for free.

Added non-text values

If you have a need to store non-textual values, you’ll have to let gitshelve know how to deal with them. I don’t do any such handling by default, because of the big chance of doing the wrong thing, and having you not find out about it until it’s much too late. Just pickling data like shelve does isn’t very smart, for example, because it will wreak havoc on Git’s merge algorithms should you ever need to incorporate new data from another source.

So, let’s see how to add a custom data translator. First, you need to subclass a new type of gitbook, which is the wrapper used to interface with the blobs in the Git repository. There are only two methods you need to override:

class my_gitbook(gitshelve.gitbook):
    def serialize_data(self, data):
        return object_to_string(data)

    def deserialize_data(self, data):
        return object_from_string(data)

Now you must define object_to_string and object_from_string, which should examine the types of the objects passed and turn them into merge-friendly string as appropriate. Certain forms of XML work well for this job, as do ini-style configuration files in some cases. It’s up to you and what works best for your usage.

Once you have this new class type, you must pass it to the gitshelve.open function:

data = gitshelve.open(repository = '/tmp/foo', book_type = my_gitbook)

Making things even faster

Every time you open a gitshelve, it must walk through the assoicated branch and determine its contents in order to build the key/value relationships in the dictionary. If you find that this ever gets slow, what you can do is just pickle the gitshelve! The only caveat is that you must take care to delete it if the HEAD you created it from is different from the current HEAD. Here’s an example:

import gitshelve
import cPickle
import os

data = None
if os.path.isfile('data.cache'):
    fd = open('data.cache', 'rb')
    data = cPickle.load(fd)

    # I'm using an arbitrary file name here, __HEAD__
    if data['__HEAD__'] != data.current_head():
        data = None       # Out of date, we can't use it

if not data:
    data = gitshelve.open(branch = 'data')
    data['__HEAD__'] = data.current_head()

# ... for data sets with enormous quantities of tiny files, this
#     could really speed things up ...

Where can you get it?

The gitshelve module is being maintained as part of the git-issue project, which is yet another attempt to bring distributed bug tracking to Git. Actually, I tend to support multiple repositories as data backends, but right now Git is my initial focus. You can clone the project and test it out as such:

git clone git://github.com/jwiegley/git-issues.git
cd git-issues
python t_gitshelve.py

If see “OK” at the end of the unit tests, you’re good to go! There isn’t much documentation on gitshelve.py itself right now, beyond this blog entry, but then again the shelve-like interface is simple enough that you really shouldn’t need much more.

Or if you prefer, you can just browse the project at the GitHub project page.

 

17 Responses to Using Git as a versioned data store in Python

  1. moncho says:

    It would be interesting to make it a backend for Shove.

  2. Phil says:

    > For example, if you ever need to store mutating data in a series of snapshots, Git may be just what you need.

    I’ve thought about this… do you think Git could be used to support real-time collaborative editing (like Gobby) from within Emacs or other editors?

  3. John Wiegley says:

    It could be used for collaborative editing by automatically taking a snapshot every time sometimes saves. The only time it would seem to run slowly is if you exceed the “loose object” threshhold, at which point Git will auto-garbage collect your repository. You can turn this off by running “git config –global gc.auto 0″ in the repo before you start things.

    With that said, in Emacs you would add a function to `after-save-hook’ that calls out to Git and runs “git-commit” and auto-pushes. You would have to handle merge conflicts on pull if several of you push changes at the same time.

  4. Phil says:

    Interesting. I’ve envisioned it as something that would run on an idle hook, but on save might be easier to implement.

    The advantage of the more immediate version is that you could tell git to ignore merge conflicts and always take the newer version since it would be more visible to the user.

  5. Charles Duffy says:

    Ya know, this is much easier to do in bzr — its complete functionality is exposed via a built-in Python API.

    I once wrote a decorator (as part of a certificate management system) that effectively did a “bzr commit” with the method name and its arguments as the commit message if the wrapped operation succeeded, and did a rollback on the filesystem if an exception was passing through. SCM integration with Python is indeed a shiny thing at times.

  6. hgshelve says:

    If you are using Python why not use something better?

    http://piranha.org.ua/blog/2008/05/19/hgshelve/

  7. Ittay Dror says:

    Why store the issues in a separate branch? Storing them alongside the normal source files has the benefit that the issues’ status is consistent with the source. So if I fixed an issue in ‘newver’ branch (some changes to source files and change the issue’s status), but not yet merged it to HEAD, then in HEAD the issue is still open. After the merge the issue will be closed.

  8. John Wiegley says:

    hgshelve: I’m glad you’ve ported this idea to hgshelve. You say on your blog that gitshelve can only store strings, implying it could not be used to store objects. However, in the blog entry I showed how to extend gitshelve to support arbitrary object string serializers. This is how I handle issue data in git-issues, except that I use XML instead of JSON.

  9. John Wiegley says:

    Ittay: I really like where you’re headed. I’ll give it some more thought and see what I can come up with. In the meantime, if you flesh out your design and want to send me something by e-mail, I would be quite happy to consider it!

  10. Ittay Dror says:

    Very high level design (we can formalize more via email if you’d like):

    Each issue has a list of properties (e.g., assignee, creator, status, comments). They can be simple or lists. Each can also be either ‘global’ or ‘branched’. ‘global’ properties are stored in a dedicated branch (where also the issue’s unique id is stored). this branch is not used for normal development. ‘branched’ properties are stored in the same branches used for developement, probably in a subdir under the top dir (something like .issues)

    This means that merging of branches only affects some properties and that some properties can be the same regardless of what branch you’re in (so creator, comments etc. are global, but status is branched)

    When viewing an issue, it is composed from the branched and global properties.

    I think that with careful granularity of branched properties (put each per file, not in one file, or make one file, but sorted), merging of two branches should not create conflicts, unless when called for (e.g., someone changed the priority of an issue in two branches)

    Hope I am clear enough,
    Ittay

  11. Anonymous says:

    Ittay, I’m also interested in the approach you’re presenting. Please let me know if you get something going.

  12. Uldis Bojars says:

    A cool idea :)

    Would it be a good idea to use Python’s pickle module for serialising and restoring objects in object_to_string and object_from_string? Since it is native to Python it should work quite well unless it does not satisfy the criteria of being merge-friendly.

  13. John Wiegley says:

    Sure, you can do that, you just couldn’t use git diff or git log -p anymore, since the contained data would be binary. I left it open the way that I did so that others could use XML, JSON, Pickle, etc.

  14. Uldis Bojars says:

    That is not entirely true: “By default, the pickle data format uses a printable ASCII representation.” – See http://docs.python.org/lib/node315.html

    Not trying to say that you should have used pickle. It is good that users have the option to choose their own serialisation.

    Just wondering why pickle as a possible serialisation was not mentioned and if there are reasons some reasons to avoid using it in gitshelve.

  15. John Wiegley says:

    Oh, in that case there was no reason at all that I avoided pickle; it just so happened that I needed XML and so made the design more abstract to accommodate that from the beginning.

  16. Hi,

    I wrote something similar for Ruby. Actually it was part of my blog engine and some day I realized, that your library is basically the same in Python.

    http://matthias-georgi.de/2008/12/git-store-using-git-as-versioned-data-store-in-ruby

  17. John Wiegley says:

    Hi Matthias, but your link comes up 404?

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Notify me of followup comments via e-mail. You can also subscribe without commenting.

Set your Twitter account name in your settings to use the TwitterBar Section.