Introducing a per-commit key/value store for Git

Jannis Pohlmann jannis.pohlmann at codethink.co.uk
Wed Jan 2 15:56:35 GMT 2013


Hey,

first of all, thanks for the feedback!

On 13-01-02 15:27:58, Richard Maw wrote:
> On Wed, Jan 02, 2013 at 02:37:01AM +0100, Jannis Pohlmann wrote:
> > The source code is available at
> > 
> >   https://github.com/Jannis/gitpercs
> > 
> > Please have a skim through the code and comments (esp. the main doc
> > string for the Store class in gitpercs/store.py). I'd appreciate
> > feedback to the current design. I reckon especially Daniel might
> > come up with remarks wrt the usage of Git internals here. ;)
> > 
> >   - Jannis
> 
> Validating key format with a regular expression is overkill unless the
> format changes. You just need a set operation like the following.
> 
> valid_chars = string.digits + string.ascii_letters + '-_/:'
> any((c in valid_chars) for c in key)

I think you mean all()? But yes, I agree with your point.

> Interestingly, the '.' character is not valid, though git itself uses it
> as a configuration path separator.

There's no good reasons for this. All we need is a sensible format that
works for our use cases. Caching rendered web content is probably the
most tricky one here as people will likely want to use URL paths
as keys. But if that involves more than just basic characters, they can
always convert between URL paths and keys encoded using base64 and a
basic alphabet that is allowed.

> I don't think that there's a use case for needing snapshots of the state
> of every annotation together, so I would have multiple percs refs named
> after the sha1 of the commit they annotate, so the commits only have the
> property names and values.

That would be an option. In the initial implementation I decided against
it because it would potentially generate a lot of refs. Overall, I think
what I'd go for is refs like

  refs/percs/<sha1>

rather than

  refs/heads/percs/<sha1>

because the latter might conflict with real branches.

> I don't think having a tree for every path component is a good idea, it
> makes a lot of trees and complicates the code significantly.
> The only benefits I know of are:
>   1. it produces a smaller top-level tree
>   2. it is less likely to run into problems checking out the tree
>      on systems with small directory entry size limits but implausibly
>      large numbers of properties
>   3. applications don't need to escape or strip / components in property
>      names

Yep. Although 3 could be solved by replacing slashes with characters
outside the allowed key alphabet internally.

> Since this could be involved in page caches for bottle, I'm guessing
> it's in for point 3, since you could have the processed page as the
> .value and the key be the relative path to the page, in which case a
> checkout is a static snapshot of the page.
> 
> However, being able to do that requires that the web server redirects to
> the .value file, or the format is changed so the last path component is
> the blob.

I'd expect web applications to load the .value files via libgit2 rather
than redirecting to checked out versions of these files.

> In summary; you could special case the format such that a checkout of
> the tree would become useful, or the application needing to substitute
> '/', but at the cost of creating a lot of creating a lot of
> potentially redundant tree objects.

I think you're right and we should avoid creating 1+ trees for every
single key/value pair. I'll think about this for a bit. Flat trees with
keys represented as blobs with names like

  foo
  foo:bar:baz
  bla:1231423:bla

might be ok as well if gitpercs converts between / and : internally so
that applications can still use

  foo
  foo/bar/baz
  bla/1231423/bla

transparently. Does that make sense?

  - Jannis

-- 
Jannis Pohlmann
Senior Software Developer
Codethink Limited
http://codethink.co.uk




More information about the baserock-dev mailing list