Introducing a per-commit key/value store for Git

Richard Maw richard.maw at codethink.co.uk
Wed Jan 2 16:57:14 GMT 2013


On Wed, Jan 02, 2013 at 04:56:35PM +0100, Jannis Pohlmann wrote:
> Hey,
> 
> first of all, thanks for the feedback!
> 
> On 13-01-02 15:27:58, Richard Maw wrote:
> > Validating key format with a regular expression is overkill unless the
> > format changes. You just need a set operation like the following.
> > 
> > valid_chars = string.digits + string.ascii_letters + '-_/:'
> > any((c in valid_chars) for c in key)
> 
> I think you mean all()? But yes, I agree with your point.

I did mean all() :)

> > Interestingly, the '.' character is not valid, though git itself uses it
> > as a configuration path separator.
> 
> There's no good reasons for this. All we need is a sensible format that
> works for our use cases. Caching rendered web content is probably the
> most tricky one here as people will likely want to use URL paths
> as keys. But if that involves more than just basic characters, they can
> always convert between URL paths and keys encoded using base64 and a
> basic alphabet that is allowed.

RFC3986 (URIs) specifies pretty much all printable characters apart from
"<>#%\"", and % can appear for escaping.

> > I don't think that there's a use case for needing snapshots of the state
> > of every annotation together, so I would have multiple percs refs named
> > after the sha1 of the commit they annotate, so the commits only have the
> > property names and values.
> 
> That would be an option. In the initial implementation I decided against
> it because it would potentially generate a lot of refs. Overall, I think
> what I'd go for is refs like
> 
>   refs/percs/<sha1>
> 
> rather than
> 
>   refs/heads/percs/<sha1>
> 
> because the latter might conflict with real branches.

Agreed, it also makes `git branch` cluttered.
However it will add complications to fetching, since git defaults to
just refs/heads and refs/tags.
For the caching use-case I don't see a problem though, since you can
always generate the pairs in refs/percs yourself, or change your fetch
refspecs.

> > Since this could be involved in page caches for bottle, I'm guessing
> > it's in for point 3, since you could have the processed page as the
> > .value and the key be the relative path to the page, in which case a
> > checkout is a static snapshot of the page.
> > 
> > However, being able to do that requires that the web server redirects to
> > the .value file, or the format is changed so the last path component is
> > the blob.
> 
> I'd expect web applications to load the .value files via libgit2 rather
> than redirecting to checked out versions of these files.
> 
> > In summary; you could special case the format such that a checkout of
> > the tree would become useful, or the application needing to substitute
> > '/', but at the cost of creating a lot of creating a lot of
> > potentially redundant tree objects.
> 
> I think you're right and we should avoid creating 1+ trees for every
> single key/value pair. I'll think about this for a bit. Flat trees with
> keys represented as blobs with names like
> 
>   foo
>   foo:bar:baz
>   bla:1231423:bla
> 
> might be ok as well if gitpercs converts between / and : internally so
> that applications can still use
> 
>   foo
>   foo/bar/baz
>   bla/1231423/bla
> 
> transparently. Does that make sense?

If : and / are both valid separators then that would make foo:bar and
foo/bar syonyms. This is unlikley but confusing.

Control characters aren't allowed, so if it doesn't need to be printable
then how about http://en.wikipedia.org/wiki/Substitute_character

If it needs to be printable, it can be one of ' <>#"'.




More information about the baserock-dev mailing list