Hi Emmet
First, thanks for this email, it's very helpful! My thoughts on this are
a bit scattered, as you can probably tell from my confusing replies
below ...
(for those who want a concise summary of my response, try this: "I think
baserock/ repos should move to Gerrit and delta/ repos should stay in
Trove").
On 13/12/14 15:40, Emmet Hikory wrote:
I've been advocating the use of gerrit as a patch tracker
for some
time, and am gladdened to see more people supporting this idea, and the
work of the operations team to put test gerrit instances up in the
Baserock infrastructure. However, gerrit is a git server, and I am a
little worried that we may lose some of the advantages we have enjoyed
from the use of gitano for Baserock curated archives.
My understanding is that gitano stores all history and all ACLs in a
gitano-managed git repository, so that short of generating collisions in
the underlying datastore, one can be confident that one has a traceable
record of everything that has happened to the repositories managed by
gitano. I also believe that gitano has no means of tracking candidates
that ought be merged to arbitrary targets, or easily allowing
individuals to self-generate credentials allowing them to upload
candidates against a given branch.
From what I can tell, gerrit has the opposite feature set:
specifically a rich means to manage registration, candidate submission,
and review, all with a documented API that has inspired a diverse
ecosystem of support tooling to interact with these services.
Conversely, configuration, state, and access controls are stored
externally, so that one would need to maintain a complete transaction
log for all changes to the hosting server(s) to ensure that the state of
the repositories were accurate.
My understanding is that Gerrit stores all this in an SQL database -- is
there no SQL database that could provide a sufficient guarantee of
consistency if configured appropriately? Or is it the case that it's
just way more difficult? I don't have an idea of the scale of the
problem that you want to be solved..
Having been working with Gerrit last week, it seems like something of a
resource hog too. Gitano seems much more lightweight. I think this is a
good reason to stick with Gitano where we can.
For archive curation, I believe that we should continue to
recommend
the use of gitano as a central storage facility, and further that we
should continue to use gitano to manage the archive used to generate the
reference systems. While there is a knowledge horizon imposed by
mirroring, in that we cannot validate the accuracy of commits not
applied directly to the archive mirror, the use of lorry provides us
with some degree of comfort that we have accurately captured the commits
made upstream, and will not lose commits removed upstream, and that we
have some traceability to this process. If we change to a git server
that does not provide the same level of guarantee as gitano, there is a
much greater potential for rogue commits to be present, making it less
safe for organisations to mirror the Baserock archive, and that the
Baserock project cannot have as much confidence in systems deployed by
the operations team.
For patch tracking, we clearly want to use gerrit, as gitano has no
such features. In the case of repositories for which Baserock is
upstream (lorry, morph, definitions, etc.), a sensible model might be to
have the "upstream" repository be on gerrit, and lorry this into the
curated archive, as with any other upstream component. In the event
that the "upstream" archive is compromised or corrupted, the project can
confidently detect the issue and safely recover from the curated archive
mirror.
I agree it'd be good to have the canonical baserock/baserock/* repos
(those of them which aren't obsolete) hosted in Gerrit.
I'm not sure I agree that the Baserock 'archive' needs to mirror them...
we could do a backup of the contents of the Gerrit server in the same
way we do backups of the contents of the current
git.baserock.org
server, we don't need to (ab)use the curated archive for that purpose.
And currently our deletion policy for the archive server (which is
"don't delete anything, ever") causes a bit of pain when developing
Baserock components: if one creates a repo and later realises it should
have a different name, or or shouldn't exist at all, then we end up with
dead repos that we can't remove: see
http://git.baserock.org/cgi-bin/cgit.cgi/baserock/
Troves that are downstream of the toplevel one could have a
configuration that mirrors both Baserock's Gerrit and Baserock's
'archive'.
In the case of repositories for which Baserock is not upstream,
things are a little more complicated. Adding a new repository to gerrit
for a single review for a patch expecting to land upstream, and then
switching the lorry back later seems horridly complicated, as is using
any other means of causing an upstream mirror to exist. However, and
especially for repositories for which upstream does not use git, it may
be difficult to cause a given patch to appear somewhere suitable for
lorrying. In the presence of a patch tracker, the habit of review on a
mailing list may fade, and so there is no obvious way in which members
of the Baserock community might propose patches to non-Baserock
components in advance of them landing upstream.
Here's a view I had on this, which originally came from something you
said about wanting everyone to work upstream. It follows these rules:
1. If you make a change to a project, it should be sent upstream. The
Baserock project's 'archive' should make it easy to manage and test
these changes before sending them for review to the upstream project,
but it should not make it easy to maintain branches in the Baserock
'archive' that are never sent upstream.
2. If you are making a change to a project which can't be sent upstream,
you have made a fork, and become an upstream maintainer yourself. The
fork should not be directly hosted in the Baserock archive.
For case (1), the review should be done using whatever review tooling
the upstream project chooses to use. While the branch is being worked
on, it can be pushed directly into the curated archive (as long as the
branch is in the baserock/ namespace) using Gitano. For (2) the fork
should be hosted and maintained in Gerrit, and changes should be
reviewed by whichever people in the Baserock project are taking
responsibility for the fork. (If that's only one person, forking perhaps
wasn't a good idea).
In the past we've done some review of patches against upstream projects
on the baserock-dev mailing list, even when the patches are just being
backported from other upstream branches. I don't think this makes sense:
only the associated change to definitions needs to go through the normal
Baserock review process. When patches to U-Boot, GCC or GStreamer
appear on the mailing list I ignore them because I have no way of giving
a review beyond "yes, this definitely came from that other branch."
The ugliest part of this setup is managing user accounts: new
contributors can sign up to Gerrit themselves, but can't sign up to
Gitano themselves. I think this isn't too bad: most new contributers
probably won't immediately need to create multiple branches in delta/
repos. If they do need to do something that drastic, ideally they'd
discuss it on the mailing list anyway, at which point we could create
them a Gitano account able to create branches in the baserock/ names in
the 'archive' server.
Some of the automation used by the Baserock project adds more
potential complication to the set of available choices. To date,
cache.baserock.org has been populated for any commits that reach master
of the primary repository for definitions. In the event that we
separate the archive curation and patch tracking facilities as described
above, how long does it take for a given change to mirror, and from
which source should we be populating the cache? My personal preference
is to populate the cache pre-merge, so as to provide a more robust cache
to any humans validating results, but this may require unacceptable
storage capacities.
For pre-merge testing automation, the situation is a bit
simpler.
In the case of a change in definitions, the automation can check out the
candidate, and then safely run against the curated archive (assuming a
sufficiently small mirroring time that the referenced commits have landed).
In the case of a change to other Baserock components (e.g. morph), this
can be modeled as a change to definitions, where the repo and ref for the
change component references the ref of the candidate in the gerrit repo,
and everything else references the curated archive, avoiding the need
to keep track of which repository to test externally.
For integration automation, if the automation is tracking
potential
changes in the curated archive, this means that they are available in
the curated archive, so candidate definitions branches using these
references may be safely submitted to pre-merge testing automation
as described above.
I didn't realise until Friday that right now the Mason doesn't actually
test or build 'master' at all. It makes sense that it doesn't, because
if Zuul does all the merges to 'master', and Zuul isn't broken, then we
know that 'master' will always work. But right now Zuul doesn't do all
the merges to 'master', and it will be some time before we can actually
gate pushes to 'master' based on the tests done by Mason. So it would
make sense to keep the existing Mason running as well, for the time
being. This lets us defer the question of how to populate
cache.baserock.org.
About testing definitions that point at branches in delta/ projects, and
needing the archive to be up to date with the branch that was pushed: I
think this is a good argument for continuing to push in-development
branches in delta/ repos into the archive directly.
About testing Morph as part of integration into a Baserock system:
That'll always be pretty slow, if we test building a full system with
Morph. Morph has a test suite in its repo which runs in a few minutes,
covers different things to the rebuild-everything-in-a-devel-system
Mason test, and is usually enough to prove changes.
I would be interested in the opinions of others as to how to
model
this well, gaining the advantages of both gerrit and gitano to allow
both effective patch tracking and reliable and traceable curated archives.
I am especially interested in suggestions that allow Baserock users to
configure patch tracking for downstream repositories, as one of the
advantages of a curated archive is the ability to land branches that
differ from upstream in one or more ways. I would hope that such solutions
might also be usable by the Baserock project to allow the operations team
to land patches in various components to support the needs of the
Baserock infrastructure.
Here's a sketch of something that I think is achievable with a few days
of ops work.
archive.baserock.org: (and/or 'trove.baserock.org', keeping with the
concept of 'Trove is upstream is a box'):
- runs lorry-controller and lorries the wide world of source code
- uses Gitano as a Git server
- user accounts created on request by Baserock ops team
- pushes to refs in baserock/ namespace allowed by any user
- not sure what happens to the existing baserock/ repos in here --
perhaps we keep them up to date from
gerrit.baserock.org , to avoid
breaking things for existing Troves, but discourage anyone new from
using these mirrors.
gerrit.baserock.org:
- all baserock/ projects hosted here, including any upstream projects we
are maintaining forks of.
- uses Gerrit git server, and Gerrit patch tracking
- not sure about accounts policy -- allow anyone in, for now?
- has 'Verified' flag that is set by Mason after testing is done
- Firehose pushes candidate branches for definitions.git here
git.baserock.org: not sure how this would work :) We could do some
clever path-based forwarding so that requests are forwarded to the
appropriate server (Gerrit or Trove), but I've never tried to set up
something like that, maybe the approach sucks and I don't realise. Would
be nice if one cgit instance could browse both Git servers, too: I don't
know if that's possible. Clever request forwarding isn't necessary for
Morph to be able to use 2 git servers: we could change the 'baserock:'
prefix to point to 'git://gerrit.baserock.org' easily enough.
Downstream troves: mirror both
archive.baserock.org and
gerrit.baserock.org. Configuration of downstream Troves is something we
need to consider soon anyway, because the default of "everyone mirrors
everything" will probably become increasingly impractical, if we
continue adding more stuff to
trove.baserock.org, and continue to
advertise that setting up a local Trove is easy.
thanks for reading ...
Sam
--
Sam Thursfield, Codethink Ltd.
Office telephone: +44 161 236 5575