We currently handle definitions that need to pull multiple source
repositories together in an ad-hoc way.
For gcc we import the gmp, mpc and mpfr source trees in by checking them
into our delta branch directly.
Some upstreams handle importing the sources by adding submodules.
However, we have to make them fetch sources from our Trove for
reproducibilty and latency concerns, so we end up having to add a delta
on top to change where the submodules fetch their sources from.
This works against one of our goals for minimising the delta, so we
need a way to encode in our builds how to deal with components that need
sources from multiple repositories to function.
Our current approach from submodules introduces a _lot_ of extra work
when we need to update multiple submodules recursively, so we need a
To solve this, I propose we extend the source repository information from
just (repo, ref) to be a list [(repo, ref, path?, ?submodule-commit)].
So rather than:
- name: foo
We extend it to be able to take a "sources" field.
- name: foo
- repo: upstream:foo
ref: baserock/morph # used to validate commit is anchored
# properly and as a hint for where
# you ought to merge changes to
- repo: delta:gnulib
path: .gnulib # where to put the cloned source
submodule-commit: feedbeef… # check that this matches, so you
# can be told if you need to update
# your delta
The `path` field is used to specify that this tree should be placed in
the parent tree at this position.
The `path` field is optional, defaulting to `.`
If multiple paths, after normalising, clash, then it results in a
build-failure at source creation time.
Sub-sources can be placed over an empty directory, when there's no
existing entry at that path, or when there's a git submodule there AND
the commit of the submodule matches the `submodule-commit` field.
If there's a file, symlink, non-empty directory or submodule that doesn't
match `submodule-commit` there, then it results in a build-failure
The `submodule-commit` check exists as a safeguard against the parent
repository being updated and requiring a new version, which your specified
commit isn't appropriate for.
If you get a build fail because the submodule isn't appropriate then
you have to check whether the version you specify works, then update
Cache key changes
This shouldn't require any cache key changes to existing definitions,
but builds that make use of multiple source repositories will also hash
the commit tree and path.
Alternative solutions for submodules
We could continue to use the current model, and deal with the pain of
having to make multiple branches in multiple repositories to satisfy
the change to the repository paths.
We could have a magic look-up table to replace repository urls when we
parse the submodules.
We could use git-replace to magically make git do the url correction, but
then we'd have to handle replacements properly, rather than dropping them.
I see there are few packages maintained in
more than one repositories in trove, for example 'autoconf'
and 'autoconf-tarball', does baserock use both?
Also is it possible to request tar packages from the
git repository using 'git-archive' as a service
IIUC the Baserock project has been maintaining a cache server for morph
artifacts for some time (source  and actual server ), and recently
Sam + Lauren did some work to create a pseudo cache server for the
bit-for-bit investigation 
I'd like to establish a cache server for ybd, with the following
- can serve artifacts requested by ybd (and morph?), identified by
- can receive artifacts posted by/during runs of ybd (and morph?)
- provided the runner is authenticated (preferably by ssh key?)
- can deal with multiple simultaneous submissions of (notionally the
same) cache artifact - only one should land
- monitors available space, and cleans out least-recently used (so can
run unattended 24/7 without manual maintenance)
I'm also interested in the following, but they are not hard
- minimum dependencies
- minimum custom code
- runs on a baserock system
- deployable on standard cloud infrastructure
- no known vulnerabilities, and capability for ongoing system security
- no need for high-availability (if the service is down ybd/morph can
Maybe others are aware of other things to think about for this kind of
Would it make sense to base this on morph-cache-server, or something
else? If mcs, would the result be of interest for Baserock's cache
I am sorry for the delay in reply.
> > It looks the local trove server takes the git.baserock.org as upstream
> > for all the opensource packages..
> Yes, the default configuration when you set up a Trove is for it to
> mirror every repo from git.baserock.org.
> > Is it still possible in runtime to convert them to track their opensource upstream,
> > so the 'origin' remote points to their respective upstream providers.
> What exactly do you want to achieve? I ask because there's no 'start
> mirroring everything from upstream instead of git.baserock.org' button
> you can push, but there may be a way we can achieve what you want.
> If you really want to turn your Trove into something doing exactly the
> same as git.baserock.org you might be able to simply merge master of
> into your Trove's local-config/lorries.git repo. But that will cause
> problems for the repos where upstream doesn't use Git, where
> git.baserock.org converts from Subversion, Mercurial or CVS to Git. The
> problem is that if you run git-svn twice on the same SVN repo, you get
> two different Git repos out -- the commit SHA1s that git-svn creates in
> the Git repo will be different each time. So if you want to do this,
> it's probably better to deploy a fresh Trove, if that's possible.
OK I will try with fresh trove having the lorries from the master.
I believe that will also solve problem in tracking the other
version controlling system.
> Also, be aware that the resource requirements are higher for a Trove
> that mirrors directly from upstream, because it has to do extra work to
> convert from SVN, CVS, Mercurial etc.
Yes I will see how it perform on my current setup as I see we have
31+13+1 packages from svn, cvs, and mercurial respectively.
One of the main aims of Baserock is to be able to reproduce a build in
its entirety regardless of when and where the build is performed.
However, due to the nature of building software from different
upstreams, this isn't a goal that can be achieved and then "it's done,
we can forget about it"; being able to reproduce a build and all
components therein bit-for-bit is an ongoing process and needs to be
Work has already started on implementing a process of working towards
reproducibility , and there are a number of ways to take this work
* Engage with other upstream projects
There has already been some interaction between us and the Debian
Reproducible project  at OFTC/#debian-reproducible, regarding the use
of libfaketime in order to set a consistent build time for all
components. There may be value in offering to collaborate with Debian
Reproducible, and any other projects aiming for reproducibility; both
sides could benefit from a greater pool of people investigating
reproducibility. However, this may end up being more time/resource
consuming than we could handle at present.
* Set up a continuous builder for Baserock
Following the instructions on the deterministic builds section of
wiki.baserock.org , a user could set up a continuous builder to check
reproducibility of components across builds, as done here . This sets
up a number of builds of a system and shows how the SHA1 of an artifact
varies across builds. This is suitable for long term monitoring of
upstream artifacts; any changes introduced that break reproducibility
would be reflected here.
* Run YBD build/check script
Run the YBD script to get SHA1 values for all components across two
consecutive builds of a given system, defined by the user. The aim of
this is to output a breakdown of all items that are not reproducible,
and to see if there is any commonality between non-deterministic files
(e.g. all gzip files differing due to MTIME being encoded in the header).
Any comments regarding issues with the current suggestions for going
forward with reproducibility, or any suggestions not mentioned here are
There was a lot of discussion on irc yesterday  around whether ybd
 should be lorried, forked and/or included in Baserock definitions. I
undertook to bring the conversation to the list, hence this email. I
apologise for the length, there is a lot of ground to cover.
For those who may not have heard, ybd is a small tool for integrating
software stacks. It was originally mentioned here in November last year
 but the subsequent discussion at that time focused mainly on other
ybd builds on many key lessons and principles (and some of the code,
and the morphologies/definitions) we have evolved in the Baserock
project since work started in 2011. It re-implements the core
functionality of `morph build` and `morph deploy`.
I've been hacking on ybd in my own time for approximately 8 months,
while discussing what I was learning with the community on #baserock. I
based the work on github, rather than try to work on g.b.o, because I
wanted to move as fast as I could with minimum friction. Some folks
contributed patches, and over the last few months when the going got
tough I asked a few Codethink folks to help out on company time, for a
couple of weeks at a time.
As a result I can say with confidence that ybd has taken less than 150
effort days in total so far. This is a tiny number, but then it's a tiny
codebase. Both of those things really matter to me, perhaps more than
they matter to others. I'm strongly in favour of achieving more with
less. Less code, and less effort. I believe this leads to less cost, and
more time available to do other interesting things.
By my own measures, ybd is already more successful than I could have
hoped. My main original aim was to have something that could build and
deploy definitions, in less than 5000 lines of code. As of now ybd does
git collection and cache, build, artifact cache and deploy, in
approximately 1600 lines total.
I had other drivers too:
1) Last year I was told by various people that I didn't know what I was
talking about, I didn't know how to write software, I didn't know how to
manage, etc. ybd has given me a chance to re-ground myself and
reality-check my own capabilities and limitations.
2) Lots of folks that I talk to about Baserock 'get it' from my
explanation... but then give up as soon as they actually try to use the
software. Baserock's barriers to entry for new users are just too high.
3) Over my years as a morph user I've been extremely frustrated at
times. Morph does some amazing stuff, better than anything else before
it, but there are so many rough edges and wrong decisions in the code
that I've found myself screaming. And I know that at least some other
users feel the same.
4) When looking at the problems Baserock has been designed to solve, I
think we need to make more progress, faster. In my view we can get
somewhat bogged down with friction resulting from our existing
5) I've tried to introduce quite a lot of ideas into Baserock, but many
are either too hard or too radical to be accepted in the status quo. A
simple example is my contention that it's possible develop very highly
reliable software without a test framework, and that this is preferable
in some situations. Another is my proposal that we should drop
morphology/stratum/chunk and settle on names that humans can easily
At this point, I am happy that ybd is useful progress for all of the
1) It has been working reliably for some months, satisfying my need for
2) Getting started with ybd is just
pip install pyyaml sandboxlib
git clone git://github.com/devcurmudgeon/ybd
git clone git://git.baserock.org/baserock/baserock/definitions
3) ybd can run on a range of Linuxes, with much less setup than Morph,
no need for a Baserock download and no need for commandline fu. As a
result DrNic, who I think has never even visited the baserock website,
let alone downloaded a baserock image and configured morph, this week
was able to get a Concourse ci pipeline running ybd in under a day 
4) and 5) remain to be seen.
And actually that is the main reason for this long email. I think it's
time I (and hopefully the Baserock community) consider how ybd moves
forward, and how it fits relative to the rest of Baserock.
- Should ybd upstream move to use gerrit and git.baserock.org?
- If ybd stays at github
- should it be included in Baserock systems
- should Baserock fork it?
- How does ybd affect the future of morph?
To be clear, as ybd maintainer I'm currently reluctant to adopt
Baserock's gerrit process. I'm certainly open to considering it, if
there's sufficient interest from the community. But note that my main
validation of what works in ybd so far has been to run ybd, and in
general I prefer to encourage contributions with minimum friction. At
some point I'd like to evolve the slickest possible workflow and
'release' mechanism for ybd, and I already have a strong hunch that
gerrit is not going to be that. Github is working ok for me so far.
Also, I should state that many of the ideas I intend to explore are
still radical/controversial, or at least of less interest to others
here. For example
- I really, really, really want bit-for-bit reproducibility.
- I hope to overhaul the definitions format, repeatedly, ideally to
make definitions (and ybd) play nicely with other projects that are
managing large software manifests (for example Cloud Foundry). Although
there's been a lot of deep thinking on definitions and versioning in
Baserock, I believe our current work is demonstrating that we haven't
yet found a good enough solution. 'Good enough' for me means we need to
be able to change schema and content trivially, and repeatedly, with no
- ybd randomises build order. As a result I believe that several
instances of ybd sharing an artefact cache could function as a distbuild
network (not 100% efficient, but simple and robust).
- I'm toying with the idea of forking definitions into ybd, at least
temporarily to see what the pros-and-cons are :-)
I hope this is of interest and look forward to feedback.
There was a discussion today on the #baserock IRC channel about how to
handle the need for multiple build-essential strata with different
libc's, compilers and busybox configurations.
Here is the text of the discussion:
[13:48:00] <tiagogomes_> did anyone looked at the possibility of using
llvm in build essential? It needs python to build...
[13:48:56] <richard_maw> tiagogomes_: not to my knowledge
[15:59:12] <paulsher1ood> tiagogomes_: you mean instead of gcc, or
earlier in stage1/stage2?
[16:03:44] <tiagogomes_> paulsher1ood isn't instead of gcc the same as
earlier in stage1/stage2?
[16:04:29] <tiagogomes_> anyway building from scratch using llvm is not
feasible without adding a bunch of chunks to build-essential
[16:05:57] <paulsher1ood> tiagogomes_: there is stage1-gcc, stage2-gcc,
gcc. i think trying to have llvm in stage1/stage2 would be overkill but
i may be wrong
[16:06:38] <tiagogomes_> paulsher1ood it is. llvm needs python to build...
[16:07:38] <paulsher1ood> if it's just that, i wouldn't see it as a barrier
[16:07:51] <paulsher1ood> but again i may be wrong
[16:08:21] <pedroalvarez> does it really need to go in build-essential>?
[16:08:42] <tiagogomes_> paulsher1ood you would have to move a few
chunks to build essential, which is very bad
[16:10:53] <tiagogomes_> pedroalvarez depends. If you want all stage 3
chunks built with llvm from scratch... yes
[16:13:13] <tiagogomes_> anyway, last time that I heard (FOSDEM) Linux
was still not building with llvm
[16:13:22] <pedroalvarez> I see
[16:18:08] <paulsher1ood> i wonder if we could end up with
build-essential-gcc, and build-essential-llvm etc?
[16:18:42] <paulsher1ood> and separate stage1 and stage2 in to
[16:19:05] <tiagogomes_> I thought that was the plan, having different
build essential for each combination of compiler and libc
[16:19:56] -*- paulsher1ood wonders where the 'plan' is :)
[16:20:35] <rdale> i've currently got an openwrt and openwrt-musl build
[16:21:07] <pedroalvarez> tiagogomes_: so, it needs python to build. but
once built, does it need python to run?
[16:21:11] <paulsher1ood> rdale:but are they branches of
build-essential, or separately named?
[16:21:31] <paulsher1ood> tiagogomes_: using latest morph? or ybd?
[16:22:04] <rdale> currently they are both called build-essential in two
different branches, but i'm thinking how to improve that
[16:22:06] <tiagogomes_> Using latest morph
[16:28:47] <paulsher1ood> rdale: i'd suggest we go with
build-essential-variant (where variant for you would be musl)?
[16:29:47] <SotK> tiagogomes_: that is definitely my bad :/
[16:29:56] <rdale> paulsher1ood: yes, that's what i was thinking. then
we'll need a little script to fix up the actual build-essential name in
the strata and system before building
[16:30:20] <paulsher1ood> rdale: erk.... that needs a *bit* more thought :)
[16:30:29] <rdale> yes
[16:31:20] <paulsher1ood> time to try to get to the bottom of this
semantics and versioning pit, i think
[16:56:41] <persia> On supporting multiple libcs: I think that we should
have multiple strata/systems that depend differently on different libs
(and not every strata needs to support every libc), so
build-essential-variant seems right to me.
[16:57:15] <persia> I would expect the reference development system to
always use glibc, but other systems might be better with musl, uclibc, etc.
[16:58:13] <rdale> another reason to have different build-essentials,
apart from libc's, is when you need different busybox config options to
the default ones
It looks the local trove server takes the git.baserock.org as upstream
for all the opensource packages..
Is it still possible in runtime to convert them to track their opensource upstream,
so the 'origin' remote points to their respective upstream providers.
Last week I moved all the deployment extensions into a subdirectory
in definitions, this week I want to stop them depending on morphlib
and cliapp altogether.
The first step for this is to move the common code needed from
morph into definitions, and stop that depending on morphlib and
cliapp. I have a branch which does this which needs to be
reviewed somehow, but can't be sent to Gerrit since it contains
merge commits of a bunch of history from morph. Last week Richard
Maw reviewed a similar branch to move the deployment extensions,
and I'd appreciate it if someone could do the same for this.
Thanks for the help,