Background
==========
There's been a lot of work done in the Baserock project (and many, many
other projects!) around how best to take advantage of build machines
with many cores, and/or setups with multiple build-servers (eg
distributed team, build farm, cloud). Baserock's distbuild functionality
has been developed and redeveloped to work on multiple architectures,
and there have been several implementations of mason (continuous
integration), and some exploratory implementations of firehose
(automated integration).
Broadly our requirement is to split large jobs into parallelizable
component tasks, and spread them across available systems, processors
and/or cores. Then we integrate the results as artifacts, and publish
them. We want to be able to initiate jobs as users, alongside
automatically-generated jobs triggered by changes from upstreams. We
want the solution to work reliably on several processor architectures,
both big-endian and little-endian. We want to output the built artifacts
automatically. And we want to minimise wall clock time waiting for
builds.
Current Approach
================
For users working on a single machine, we have:
- a commandline tool (morph, ybd) to work out the build order and run
the component tasks
- a local cache of artifacts made by the tool, or downloaded from a
server of published pre-built artifacts
For users with access to a build farm (distbuild):
- a scheduler/controller to break up the jobs and farm out the
individual tasks
- build workers to do the tasks (running on a range of architectures),
using the command line tool (morph only)
- local cache of built/downloaded artifacts for each worker
- shared cache for uploaded/published artifacts
For continuous integration (mason):
- a server which initiates jobs on a distbuild network and presents the
results/status via the web
For automated integration (firehose):
- a server which monitors a set of git repos on another server, and
triggers jobs for distbuild algorithmically
New theory
==========
Some months ago it occurred to me that it might be possible to simplify
the distbuild approach, and also improve the mason/firehose approach, by
just having a gang of workers and throwing all the jobs at all of them.
I know this seems ridiculous - all the workers will attempt all the
work, constantly racing against each other. But what if each worker
randomises its task order, and they all share an artifact cache? There
will still be some races, but once a task is completed later workers
will find the artifact, so no need to re-do the work.
So in theory the overall wall-clock time for a gang of workers could be
less than for one worker.
But our workloads are complex:
- some tasks can be parallelised (eg build-commands for some
components)
- some can not (eg configure-commands, or creation of build-essential)
- sometimes jobs will mostly rely on existing cached artifacts (eg just
rebuilding python projects)
- some require rebuild of pretty much everything (eg change to GCC or
glibc).
As a result it was not obvious whether the proposed simplification
would actually work in reality, without testing an actual
implementation.
Practice
========
A couple of weeks ago I got around to making ybd parallelizable, with
atomic updates to artifact cache, and non-clashing sandbox areas. This
weekend I've had a chance to test it on some realistic scenarios. I've
been using an AWS machine with 40 cores, 160GB memory (2.4 GHz Intel
Xeon® E5-2676 v3 Haswell) ($2.50 per hour). I've also tested to get
comparable times for my MacBookPro.
The results seem quite promising:
Scenario 1:
-----------
Target is build-system-x86_64. No pre-built artifacts are present in
the cache:
- MacBookPro, instances = 1, max-jobs = 6: [TOTAL] Elapsed time
02:39:04 [1]
- AWS m4.10xlarge, instances = 1, max-jobs = 60: [TOTAL] Elapsed time
01:26:37 [2]
- AWS m4.10xlarge, instances = 6, max-jobs = 10: [TOTAL] Elapsed time
01:04:29 [3]
So the gang achieved a timesaving of 25%, which seems worth having.
Note that some of the longest tasks in this scenario don't parallelise
since everything needs build-essential.
Scenario 2:
-----------
Target is build-system-x86_64. Artifacts for base-system-x86_64-generic
already exist in the cache:
- MacBookPro, instances = 1, max-jobs = 6: [TOTAL] Elapsed time
00:45:25 [4]
- AWS m4.10xlarge, instances = 1, max-jobs = 60: [TOTAL] Elapsed time
00:19:33 [5]
- AWS m4.10xlarge, instances = 6, max-jobs = 10: [TOTAL] Elapsed time
00:08:36 [6]
Actually the final artifact was completed by instance 5 at 00:07:08,
but the rest of the instances took a long time to notice. So the gang
delivered the result in 36% of the time, a saving of 64%. Much more of
this work has been done in parallel.
Scenario 3:
-----------
Target is all the x86_64 systems in ci.morph. No pre-built artifacts
are present in the cache:
- MacBookPro, instances = 1, max-jobs = 6: [TOTAL] Elapsed time
07:13:37 [7]
- AWS m4.10xlarge, instances = 1, max-jobs = 60: [TOTAL] Elapsed time
02:53:51 [8]
- AWS m4.10xlarge, instances = 6, max-jobs = 10: [TOTAL] Elapsed time
01:46:54 [9]
Incidentally my first attempt at the single instance crashed on
gstreamer, after 535 tasks of the allotted 652 (no idea why). I've seen
some other occasional fails over the last few weeks, and this highlights
that the gang approach is more robust against hard-to-reprodue errors -
if one worker fails, it doesn't stop the whole job.
Next steps
==========
There's still work to do, for example:
- 6 instances, 10 max-jobs was chosen on a hunch, but it may be worth
trying various combinations to plot the result and work out the sweet
spot. Also this will likely be system-specific, varying based on
available memory, io bandwidth, cores and clock speed.
- ybd doesn't have a reliable/scaleable cache-server and upload
mechanism yet, so i've not tried true distributed building.
- not sure how this could best work for multi-arch distbuild use-cases
- currently ybd just skips anything that doesn't match its host
architecture.
- there's clearly room for optimisation (eg kill the gang once the
artifact is delivered)
br
Paul
Note ybd logs have lots of information in them, but the styling may be
obscure at first - for example:
- outputs from all gang workers are mixed together as they happen
- the instructions the workers run are mixed in too
- and so is the tail from the log of any crashed instruction
- for the gangs, first digit alone at the start of the line denotes
which worker is logging.
- all situations where a race occurred are logged with the word 'Bah!'
in them, so easy to grep
- all timings are logged with the word 'Elapsed' in them
- the counter shows [artifacts done/artifacts to do/total artifacts]
- for a gang, no individual worker instance will reach the 'artifacts
to do' total
[1]
http://sprunge.us/PhQS
[2]
http://sprunge.us/HYGf
[3]
http://sprunge.us/MjQX
[4]
http://sprunge.us/HVIV
[5]
http://sprunge.us/FPfJ
[6]
http://sprunge.us/RGgH
[7]
http://sprunge.us/UAZa
[8]
http://sprunge.us/bWea
[9]
http://sprunge.us/THBA