Your Git Repository in a Database: Pluggable Backends in Libgit2

| Comments

Git has a well-known and well-defined structure for how it stores data. In the .git directory of every Git repository you can expect to find certain things: objects for the data, refs for the branch and tag pointers, and so on. Additonally, everything here is stored in flat files, though some of their formats are a bit more involved than others.

However, it turns out this is not the only way you can store data in a Git repository. You can actually use a relational or a NoSQL database, or an in-memory data structure, or something like Amazon S3. This is made possible by the pluggable backends provided by the libgit2 library.

What This Means

Using alternative Git storage solutions is probably most interesting for services or products that provide Git hosting – like we do at Deveo. Use cases for hosting providers include:

  • Caching of Git data for lightning-fast access, by using an in-memory backend or a Memcached or Redis backend with fallbacks to traditional file storage.
  • Building a fault-tolerant storage solution, or even a multisite replication solution, by storing data in a modern database system designed for this purpose. At least Voldemort, Riak, or Cassandra would fit the bill here.

Outside of hosting, I can imagine there’s a bunch of use cases for pluggable storage when incorporating Git access to tools and libraries.

The Two Datastores of a Git Repository

A Git repository is actually quite a simple thing, though you would never guess that when looking at the surface area of the Git user interface. There are just two main structures in the heart of a Git repo, on which everything is based: There is the object database and there is the ref database.

The Object Database

The object database is where all the data is stored. The contents of all files, the structures of directories, the commits, everything, goes in the object database. However, what’s remarkable about the object database is that it’s essentially nothing but a key-value store.

Git stores data in the object database using a hash-based retrieval, meaning that the keys of the store are the (SHA1) hashes of the values. That has some interesting further implications: The values in the object database are essentially immutable and you don’t need an update operation.

What’s left is a basic data structure with essentially four operations:

get_keys()
read(key_or_prefix)
add(key, value)
delete(key)

It’s easy to see you don’t necessarily need flat file storage to implement something like this! Git’s default, file-based object database is just one implementation of the abstract concept.

The Ref Database

The ref database stores a Git repository’s references - the branches, tags, and HEAD.

Just like the object database, the ref database is also essentially a key-value store. The keys are the identifiers of the references, and the values are SHA1 hashes, which in turn correspond to commit objects in the object database.

The values of a ref database are mutable, which is a key difference when compared to the object database. The commit that master points to may change over time. That means there’s a slight difference in the operations that a ref database must provide:

get_keys()
read(key)
write(key, value)
rename(old_key, new_key)
delete(key)

Libgit2

Libgit2 is an implementation of Git written in pure C. It’s designed to be an alternative to the Git reference implementation, providing easy linkage to other libraries and applications. It is actually the basis of the Git language bindings in many programming languages.

One of the less advertised features of libgit2 is that it has pluggable backends, which means that instead of storing the object database and the ref database in the way Git usually does it – in flat files – you can provide your own backend implementation and do whatever you want. Let’s see how that works.

The Libgit2 Object Database Backend

The libgit2 object database code accesses data through functions in a C struct git_odb_backend, defined in git2/sys/odb_backend.h. It basically has the functions described above, with some additional functions for convenience (reading object headers only, streaming access, writing a packfile).

There are two built-in implementations for this struct that ship with libgit2. They implement the two object storage formats that Git traditionally supports:

  • odb_loose implements the loose file format backend. It accesses each object in a separate file within the objects directory, with the name of each file corresponding to the SHA1 hash of its contents.
  • odb_pack implements the packfile backend. It accesses the objects in Git packfiles, which is a file format used for both space-efficient storage of objects, and for transferring the objects when pushing or pulling.

As you create a Git object database, you can provide any instance of the git_odb_backend struct, including a custom-built one. This lets you plug in your own implementations, as we’ll see later in this article.

The Libgit2 Ref Database Backend

You can also provide a custom backend for the ref database, resulting in a potentially completely flat file free Git repository. The technique libgit2 uses for this is essentially the same as with the object database. There is a struct git_refdb_backend, defined in git2/sys/refdb_backend.h, with functions for the different access operations.

There is just one implementation of the ref database backend that ships with libgit2: The file system backend refdb_fs, which accesses the refs in the refs directory of a repository.

Existing Alternative Backends

In addition to the built-in backends already mentioned, the libgit2-backends repository maintained by the libgit2 team provides a few custom object database backends:

These are not only useful by themselves, but they also provide a nice starting point for writing a custom backend of your own.

Setting It Up

Let’s look at how to actually use these alternative backends.

What you would usually do when using the built-in backends would be to invoke git_repository_open with the file system path containing the usual .git directory contents, such as the loose object database, the packfiles, and the refs.

What we need to do instead when using custom backends is to invoke git_repository_wrap_odb providing our own object database with a custom backend.

Let’s say we have custom backends written for the Voldemort database, with the following constructor functions:

1
2
int git_odb_backend_voldemort(git_odb_backend **backend_out, git_repository *repo, const char *repo_id, const char *bootstrap_url, const char *store_name);
int git_refdb_backend_voldemort(git_refdb_backend **backend_out, git_repository *repo, git_refdb *refdb, const char *bootstrap_url, const char *store_name);

Here’s how we can set up a Git repository backed by those backends:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
git_repository    *repo;
git_odb           *odb;
git_odb_backend   *voldemort_odb_backend;
git_refdb         *refdb;
git_refdb_backend *voldemort_refdb_backend;
int               error = 0;

error = git_odb_new(&odb);
if (!error)
  error = git_repository_wrap_odb(&repo, odb);
if (!error)
  error = git_odb_backend_voldemort(&voldemort_odb_backend, repo, "my_repo", "tcp://localhost:6666", "git_odb");
if (!error)
  error = git_odb_add_backend(odb, voldemort_odb_backend, 1);
if (!error)
  error = git_refdb_new(&refdb, repo);
if (!error)
  error = git_refdb_backend_voldemort(&voldemort_refdb_backend, refdb, "my_repo", "tcp://localhost:6666", "git_refdb");
if (!error)
  error = git_refdb_set_backend(refdb, voldemort_refdb_backend);
if (!error)
  git_repository_set_refdb(repo, refdb);
  • On line 8 we construct an object database without any backends.
  • On line 10 we construct a Git repository backing this object database.
  • On line 12 we construct the Voldemort object database backend.
  • On line 14 we plug in the voldemort object database backend to the object database. Object databases support multiple backends, and the order in which lookups are done is based on a priority number. We give the Voldemort backend priority 1.
  • On line 16 we construct a ref database without any backends.
  • On line 18 we construct the Voldemort ref database backend, just like we did with the object database.
  • On line 20 we plug in the Voldemort ref database backend to the ref database.
  • On line 22 we finally plug in the ref database to our repository, and we have a functioning repository we can read and write to.

In place of the Voldemort backends, you could of course use one of your own implementations, or one of the existing custom implementations from libgit2-backends. You could even provide multiple custom object database backends by adding them with different priorities. This can come in very handy when implementing caching, for example.

If you’re not working in raw C, you can take a look at all the language bindings based on libgit2 to see how you might be able to achieve this in your programming language. For example, I’ve customized the Rugged Ruby bindings a bit so that I can plug in custom backends from Ruby code.

Comments