On Mon, 2016-10-17 at 11:15 +0100, Sam Thursfield wrote:
Hello
On Mon, Oct 17, 2016 at 8:51 AM, Tristan Van Berkom
<tristan.vanberkom(a)codethink.co.uk> wrote:
>
> I'm trying to cut some fat and reduce complexity here, and making
> (yet
> another ?) case for throwing away git tree shas from the cache key
> algorithm in favor of the git commit sha instead.
...
>
> In any case, I am mostly curious if people on this list have first
> hand
> knowledge of just how much disk space on the artifact servers we
> save
> by trading away the simplicity of just using commit shas.
We've used tree SHA1 instead of commit SHA1 since pretty much
forever,
so I doubt anyone knows.
Thanks for your reply Sam !
>
> The benefits of using the tree sha are:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o Two commit shas may possibly point to the same tree with
> identical
> sources.
>
> So in the case that definitions are modified to point to a new
> commit that is the same tree, and in that definitions set; the
> said
> modified chunk has all the same dependencies which they
> themselves
> have not changed, this would potentially result in the caching
> of
> 2 identical artifacts under differing cache keys.
>
> In other words: It can happen approximately never.
It happens regularly in definitions if you mandate the use of `git
merge --no-ff` when merging branches. I.E. you create a commit on a
branch, build it, send it for review, receive +2 on the mailing list,
merge it with `git merge --no-ff`, and now you have a useless
rebuild.
I think this concern may have been valid back in a distant past when
definitions were stored in branches of the mirrored sources on trove.
At this time, the neither the commit sha or tree sha of the
*definitions* module can effect a cache key calculation. This is
entirely limited to the refs uses to point to sources we intend to
build: Commits to definitions themselves do not effect cache keys
except inasmuch as how they modify build instructions and dependencies,
etc.
This dates back to when we had a Linux Kernel-style patch review
process for definitions. Now we use Gerrit, so this usecase probably
doesn't matter to anyone anymore.
It also occurs if you have a local commit and you amend the commit
message, commit author, commit date etc.
I asked about this in #git on irc actually; a changed commit date or
anything about a changed commit at all apparently changes the tree.
The tree has nothing to do with "whether the underlying sources are the
same", rather 2 commits can only reference the same tree when separate
branches refer to a history which is the same; i.e. the commit objects
are themselves the same.
Not sure I understand this entirely, but I think it would be too
complex for git to bother trying to track whether a given commit
changes sources or not.
>
> The benefits of using the commit sha are mostly obvious:
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>
> o Ability to make cache key calculations with only the set of
> target definitions, without having access to gits or "tree
> servers"
I agree this is a big argument...
That said, I kinda like the extensions to Git's remote capabilities
that the tree server provides. It's not just tree SHA1s, it also
allows you to list files in a given commit, which has been useful to
me in the past.
We could use the remote Git server or CGit server directly to provide
the tree SHA1s, instead of us needing to maintain our own custom tree
server code, would you still be making this case?
Its a good question: If we didnt have to use our own homebrewed tree
server, and this was a feature we could use with standard git; would I
still make this argument ?
I think yes, I know that the baserock way is to build everything from
gits; and for good reasons, but I'm not convinced that the tooling
should enforce this.
Without this deep investment to relying on git trees, the tooling is
free to decide that a cache key can be resolved with a tarball url +
sha256 sum if it wants to - leaving usage of a trove as a best practice
recommendation: especially targeted at user bases who are particularly
interested in long term repeatability (rebuilding exactly the same
system 10 years from now).
I would envision a future where Baserock continues to be a shining
example of using trove - and that definitions/defslib/ybd become a more
flexible and extensible tooling which has other consumers outside of
the Baserock ecosystem, for that I want looser coupling with git/trove
and more extensibility.
Cheers,
-Tristan
>
> o Less moving parts; if for example we would build GNOME modules
> with ybd/definitions, why would we bother with a trove for the
> GNOME modules when they are all hosted at
git.gnome.org ? Would
> we have to implement a tree server to optimize the builds ?
>
> Same goes for some infrastructure and CI setup with gitlab,
> does gitlab provide it's own tree server ?
The idea of Trove is that all your source code is mirrored in one
place in case somebody, for example, hacks into
kernel.org and it
goes
offline for 3 months. Or you're using code from Gitorious and it
suddenly goes offline. That kind of thing. It's not there purely to
resolve tree SHA1s.
...
I'm well aware how much of a pain it is having to involve a Git
server
in the cache key calculations so I'm not really against what you're
proposing. But watch out for the extra rebuilds.
Sam