tb3 — more efficient tinderboxing – You can't take the sky from me.

Stop right there I gotta know right now before we go any further

…

Let me sleep on it and I’ll give you an answer in the morning

— Paradise by the Dashboard Light, Bat out of Hell, Meat Loaf

So, I did some work recently to possibly make our tinderboxes more efficient and scalable — which is a bit ironic as I recently hinted others at Paul Grahams advise to “do things that do not scale”. At LibreOffice we currently have tinderbox setup that served us as good as it could in the first years: It gave a quick overview of the basic health of current development branch of LibreOffice. But LibreOffice takes some time to build and test and with 50-100 commits to master each day it is playing catch-up with a moving target.

And whle they did a good job at this, they also have a few distinct weaknesses: For one, these tinderboxes would also mail everyone who commited on a branch since the last known good build if they were unhappy. Since they do not know anything about each other, with a generic breaker each tinderbox would do that on its own. In a tragic imitation of a certain comic this would result in the incremental Linux tinderbox reporting after 5 minutes something went wrong, with all the other tinderboxes dribbling in with the same message over time, finalized by the full Windows build tinderbox excitedly reporting to 200 people (as a slow builder would have more commits between builds) that something was amiss — possibly hours after it was fixed again. This resulted in these messages being filtered away by most users and even worse: the Windows tinderbox reports, which should be the most useful of them, as most developers use Linux as development platform, being easily ignored as “someone else broke it”.

So I set out improve the situation with the initial goal:

to start make tinderboxes being able to coordinate
to make it possible to easily collate the information from multiple builders
while leaving the control over what is build with the owner of the tinderbox (as most of these boxes are sponsored, we dont want to make them into drones)
for slow platforms like Windows or ARM enable bisecting a breaker as the frequency of builds is too low for those in the commit range to feel personally responsible
while bisecting a breaker, also keep an eye one the branch moving forward (as in: dont try to bisect a breaker further when it was fixed in the meantime)

And I am happy to report to have reached this initial goal with tb3 which is a tinderbox coordinator written in Python3 and having as many lines of codes for unittests as for the product itself. So how is tb3 intended to work?

Leaving control over what is build with the owner of the tinderbox

tb3 is build around the idea, that the information about the state of the source is collected and managed by a central “tinderbox coordinator” and one or more tinderboxes go to it to:

ask for something to build, giving the coordinator a branch and a platform that they are interested to work for
report that they have started to build a certain state and give an estimate on when they will be finished
report that they have finished to build a certain state and give a result

Note that the first two steps are separate: The tinderbox is essentially just asking for a suggestion on what to build — its not promising to actually follow these proposals. It can come back and report to be building something completely different(*). Now the proposals the coordinator hands out come with a score. Just looking at a classical tinderbox mode, which will always build the current HEAD of a branch on a specific platform, the score of the highest ranking proposal will be equal to the number of commits since the last finished build. With tb3, a tinderbox can watch multiple branches (e.g. a development branch and a release branch) and commit itself to building the one which saw the most commits since the last finished. It can also use multipliers and use something like “if there are 10 times as many new commits on the development branch as on the release branch, then build that, otherwise stick to the release branch” or use limits: “I only want run a build if there are at least 5 new commits”.

Coordinating multiple tinderboxes

So how do we coordinate multiple tinderboxes and ensure that e.g. if someone pushes 9 commits to master, we do not get five Linux tinderboxes to build that last commit and then sprinkle everyones mailbox over the next hour? Here is where the “coordinator” part truly kicks in. The first tinderbox that asks for something to build will get proposals with scores as shown by the green line in the chart below: The highest score is the “9” of the newest commit — the commit that has the biggest distance from the last build. If the first tinderbox reported to have taken on that proposed build, what would a second tinderbox that also asks to build something see? It makes little sense to give it the same build as the first tinderbox. Optimistically assuming that tinderbox will report something back, the best thing this second box can do is build something with the biggest distance to to the finished build and to the build running on the first tinderbox. As such, the coordinator will send it scored as denoted by the blue line and if the tinderbox accepts it will build commit 5 — which is why a third tinderbox asking for something to build, while the other two are running, will get proposals as per the pink line and thus be suggested to build commit 3.

proposal scores with tinderboxes just started

Trusting tinderboxes … a bit

Now these tinderboxes “promised” to build some commit. But can we give the tinderbox unconstrained trust? E.g. should we never ever tell any other tinderbox to build this one commit, because some other tinderbox promised to build it? The answer is obviously no: As a tinderbox is a gift, the owner should be allowed to reboot or reassign a tinderbox for other tasks at any time with imprudence. This is why the tinderbox gives the coordinator an estimated duration for its build and the tinderbox coordinator “reserves” this commit for that time. As you did see in the last chart the commit that just had a tinderbox running got scores of zero. As time goes by the coodinator looses trust in the tinderbox to still report back: the chart below shows the scores given after twice the time the tinderbox gave as an estimate has passed. You see the blue line now scores highest at commit 6, not commit 5 and the pink line scores highest at commit 5, not commit 3 — so as the coordinator looses trust in the running tinderboxes to come back, it again proposes to do builds closer to the already scheduled ones.

proposed scores with tinderbox results overdue

Another thing to note is that the highest score is rising: While in the first chart, each running tinderbox lowered the highest score by one (green line: highest at 9, blue line: highest at 8, pink line: highest at 7) after twice the time has passed, the highscores are all around 9 again.

Bisecting a breaker

Should a branch be broken, it usually would be very helpful if the tinderboxes would help bisecting. This is especially true for slow platforms and builds like Windows, ARM or the document load torturer by Markus. However, we do not want the tinderbox to over fixate on that, as our branch is a moving target. If there is a build breaker somewhere in a range of 256 commits, we do not want a slow tinderbox to bust away for 8 builds to find the offending one, and while doing that leave the head of the branch unwatched for a long time. So by default, the bisecting proposals have a highscore that is equal to the number of commits to bisect still. As such, by default, a tinderbox will be told to bisect — as long as:

the head of the branch is still broken
there are more commits in the bisect range, than there are new commit on the branch.

proposal scores of commits in a range to bisect

Otherwise, the tinderbox will be told to build the latest commit, to check if the branch is still broken or fixed in the meantime. As such the coordinator will guard against commiting tinderboxes to bisect a breaker that was already fixed. Therefore the coordinator knows a few more states than plain ‘good’ or ‘bad’ for a commit:

UNKNOWN — nothing known yet
RUNNING — a tinderbox is currently claiming to run this commit
GOOD — a tinderbox was happy with it
BAD — a tinderbox was unhappy with it
ASSUMED_GOOD — not tested, but the previous and the next finished build were good
ASSUMED_BAD — not tested, but the previous and the next finished build were bad
POSSIBLY_BREAKING — not tested, but the previous finished build was good and the next finished build was bad
POSSIBLY_FIXING — not tested, but the previous finished build was bad and the next finished build was good
BREAKING — this one was bad, while the previous commit was good

Here is some example output

$ ./tb3-show-history --repo ~/checkouts/core.git --platform linux --branch 65134fb75c3e94b7869fb6d490f88bf4b252760e --history-count 10
65134fb75c3e94b7869fb6d490f88bf4b252760e started on 2013-07-25 17:27:30.383767 with builder ubuntu-tinderbox and finished on 2013-07-25 17:40:41.226494 -- artifacts at 65134fb75c3e94b7869fb6d490f88bf4b252760e-137476605045.out, state: BAD (took 0:13:10.842727)
6100d94078d37cb1413a0e45460cee480ba3e211 started on None with builder None and finished on None -- artifacts at None, state: ASSUMED_BAD
24d46ea66485ff8b5bca49ec587b41547787bf42 started on None with builder None and finished on None -- artifacts at None, state: ASSUMED_BAD
d041980a7aad0e6d111752ca98db42f9853a3c6b started on 2013-07-25 17:40:52.587150 with builder ubuntu-tinderbox and finished on 2013-07-25 17:53:04.204549 -- artifacts at d041980a7aad0e6d111752ca98db42f9853a3c6b-137476685269.out, state: BAD (took 0:12:11.617399)
3b28ec6855e5df0629427752d7dafae1f0a277d4 started on None with builder None and finished on None -- artifacts at None, state: ASSUMED_BAD
cca0b9ae02603ab88ec7d8810aab2a8a1b4efda2 started on 2013-07-25 18:08:01.201013 with builder ubuntu-tinderbox and finished on 2013-07-25 18:20:39.536451 -- artifacts at cca0b9ae02603ab88ec7d8810aab2a8a1b4efda2-137476848124.out, state: BREAKING (took 0:12:38.335438)
767b02bd7614059dd80d0cd1be306d9b63291f31 started on 2013-07-25 17:53:14.745394 with builder ubuntu-tinderbox and finished on 2013-07-25 18:07:42.527839 -- artifacts at 767b02bd7614059dd80d0cd1be306d9b63291f31-137476759480.out, state: GOOD (took 0:14:27.782445)
c852f83bc4d91de51c61ad4be0edf1b848247eaa started on None with builder None and finished on None -- artifacts at None, state: ASSUMED_GOOD
0d874ee2e452ea67c03a27bf1a7f26d0ffc617dc started on None with builder None and finished on None -- artifacts at None, state: ASSUMED_GOOD
ff14c3b595ebe71153f97ebb8871cf024ea76959 started on 2013-07-25 17:12:58.024727 with builder ubuntu-tinderbox and finished on 2013-07-25 17:27:17.439374 -- artifacts at ff14c3b595ebe71153f97ebb8871cf024ea76959-137476517809.out, state: GOOD (took 0:14:19.414647)

Some details and missing bits

The coordinator stores the results in git notes as JSON objects. This has multiple advantages: There is no need for a external database and the state of the notes are under revision control. It also has one disadvantage: Its not exactly quick. However the revision control can help to mitigate that mostly — as e.g. a webfrontend can easily ask: “what changed on the state since I last polled you?” and do incremental updates from there.

Which brings me to the missing bits: The stuff that tells the world the state of the repo on a webfrontend, RSS feed, IRC Bots or via email digests. The second missing bit is some kind of privilege separating between the tinderboxes and the coordinator. tb3 is currently churning away on the Sun Ultra 24 that I donated to the Document Foundation doing duty as an Ubuntu tinderbox, but coordinator and tinderbox are still running on the same account — even though as separate processes. As setuid for scripts is messy business, I plan to give tb3 a trivial REST-like interface on a non-public HTTP server. In addition to being able to offload the authentication and authorization problems outside of tb3 to something considering it a solved problem, it also makes integration in webfrontends etc. simple (esp. given that all the data is in JSON already anyway.)

In the long run, the scoring of tb3 also should make it easier for the buildbots that do duty on gerrit to make a call on if they should test build something there or if their help is more needed for tinderbox duty.

tl;dr

tb3 can:

coordinate multiple tinderboxes working on the same build scenario or branch
coordinate one tinderbox working on multiple build scenarios or multiple branches
make tinderboxes bisect without loosing sight of the head of a branch
especially help tests and builds that are painfully slow

They can also create builds for bibisect along the way, but that is a story for another day.

(*) This is helpful for some test suites like e.g. subsequentcheck. If you do a build as proposed by the coordinator, you can cheaply report back the result of the build only. And since you then can just the subsequentcheck test suite on top of the build of that commit (and only on that commit), you can then report to be running these tests and report the results without ever caring if the coordinator thinks this commit has as high priority for this.

postscriptum: Yeah, I know, I promised to be on vacation now and not harass you with any posts, but this is a scheduled blogpost and as such does not count.