We have a weakness in the way Baserock systems are bootstrapped, and
today it finally bit me!
Right now Morph ignores the question of whether a chunk built in
'bootstrap' mode will run on other systems. Morph assumes that a chunk
built on an x86_64 Baserock system will function the same on any other
x86_64 Baserock system. So for example, the cache key of stage1-binutils
will be the same on any two Baserock systems, assuming it's the same
architecture, same build instructions and same source code commit.
Anyone who has been in software more than 5 minutes will know this is
fantasy. But it's a very convenient fantasy. I'm not sure how many
people have actually found themselves on the wrong side of this
assumption so far -- it can't be many, or the mailing list would be full
I'll quickly sum up the problem it caused me. On ARM hard-float systems,
the upgrade from EGLIBC 2.15 to GLIBC 2.20 introduced an ABI break:
/lib/ld-linux.so.3 became /lib/ld-linux-armhf.so.3. So dynamically
linked programs built on an ARM hard-float system using EGLIBC 2.15
won't actually run on a GLIBC 2.20 system: they'll give a 'not found'
error. I had an ARM Mason set up to use a shared artifact cache which
was already being used by a much older ARM distbuild network. The older
distbuild had built and uploaded the bootstrap chunks stage1-binutils,
stage1-gcc etc. The new Mason was working fine with these until it had
to rebuild a bootstrap chunk: I think it was stage2-make. At this point
it tried to *run* the tools from stage1-binutils and stage1-gcc to build
make, and they didn't work because they were linked against the wrong
ld.so. Result: a confusing build failure!
I think we can fix this ABI break by creating a compatibility symlink in
/lib, and I'll try doing that. But it highlights a hole in our story of
'everything is reproducible and it won't randomly break for you'. Here
is a list of things we could do to tie that hole closed.
Note when I say "host system" below I'm referring to the Baserock system
that is running `morph build`. Not VM hosts.
1) Include the cache key of the host system in the cache key of each
chunk built in 'bootstrap' mode.
- assumes Morph is running on a Baserock system
- makes cache.baserock.org
less useful, because unless you're running
the exact same build-system as the Mason that built the artifacts, your
Morph will come up with different cache keys for the same commit of
definitions and will build everything locally.
2) Include the cache keys of certain chunks from the host system in the
cache key of each chunk built in 'bootstrap' mode.
- assumes Morph is running on a Baserock system, and makes assumptions
about the makeup of that system
- non-trivial to implement
At present the list of chunks would be something like: glibc, binutils,
gcc. The list would have to be maintained by a human in definitions.git,
and we'd have to make a judgement call about where to draw the line.
Once you start digging you realise every component *could* affect how
something is built, so I think going down this road is a bit pointless,
but it might be worthwhile stepping stone while we try to achieve (1).
3) Always build bootstrap chunks locally
- will make getting started slower for everyone (each new user will need
to build GCC twice locally before doing anything else)
4) Statically link bootstrap chunks.
- requires some big changes to the bootstrap, I don't know if it would
actually be possible to statically link everything in stage1 and stage2
- stage1 and stage2 artifacts would be bigger
- still vulnerable to broken versions of GCC/G++/Binutils generating bad
- will only fix our current example Linux/GNU C/C++/Fortran bootstrap
(the 'build-essential' stratum), will not solve the problem for everyone
(although the principle would remain the same)
5) Make sure we never break our reference bootstrap
- this is our current approach, but it clearly doesn't always work, and
it continually requires developers and reviewers to think whether a
change will might break the bootstrap on any of our platforms
Anyone have any more suggestions for how we could solve this? Any
In writing this email I've convinced myself that (1) is really the only
option and so we should make it happen sooner rather than later, on the
assumption that the longer we leave it the more painful it will become.
If the Mason systems are always running latest master of definitions,
then as long as you are also running latest master, you'll be able to
use artifacts from cache.baserock.org
. Except if you run a devel-system
instead of a build-system you won't, because it'll have a different
cache key. I have a possible idea for how to fix this if we moved to
allowing arbitrary nesting of components, but I think this email has
gone on a while already, and it's not something that is going to get
done next week.
Please let me know your thoughts! This is a bit of a mindbender so don't
be afraid to ask questions. I know that Emmet has expressed a desire for
(1) multiple times already, but I think it's only been in IRC so I
haven't tried to hunt out previous discussions from the mailing list.
Sam Thursfield, Codethink Ltd.
Office telephone: +44 161 236 5575