Unbreakable builds
[ vmassol ] 15:18, Wednesday, 29 December 2004

Let's create Unbreakable Builds

Out of my last two development projects, one had a strong sense of quality and excellence in general and continuous build failures were the exceptions (about 3-4 per week for a 30 developers team) and the other one was quite the opposite and everyone was surprised when the continuous build was passing (there were about 5 build breaks a day as an average for a 40 developers team). I'm sure this is also pretty common to other projects. Obviously the best is to build (pun intended) a build awareness in the team. However, you'll need strong evangelists for this to happen who may not always be available and other circumstances may make this difficult.

A thought struck me about a year back: what if we were able to prevent the continuous build from failing by design. There's a French saying that goes something like "it's better to cure than to heal". I think this is definitely a good idea to apply to continuous build failures. Why not make a continuous build system that cannot fail. At that time I thought it was a nice idea (I had meant to blog about it but I forgot) but I could not see very well how it could work. Now a year after, I really think it's a nice idea and I'd like to explore it.

The architecture

A potential basic architecture is shown in figure 1 (click to get a larger picture).

The general principle is to catch the commit data before they get committed to the SCM, to perform a build and to perform the actual commit only if the build is successful. Here are the detailed steps:

  1. The developer performs a commit using his favorite SCM client tool. Note that it is best if the tool is able to perform the commit asynchronously so that the developer can continue working on something else.
  2. The committed data are intercepted using a pre-commit hook script (all modern SCM support this). This script is in charge of doing 2 things:
    • Finding out the list of projects to be built. Indeed, say that the commit contains 5 files belonging to 2 different projects. We need to rebuild these 2 projects. The algorithm for finding out the projects to which belong the changes sources can be as simple as a mapping between the file paths (which contains the project name) and the project name.
    • Creating a build job and pushing it on a queue. The reason for the queue is that building all the projects on the machine that hosts the SCM is not going to be scalable. We want the SCM to be as responsive as before. Hence the queue.
  3. We need build machines to perform the actual build. They could be dedicated build machines that build continuously the build jobs. There could also be developer workstation. The concept is to have one or several build kicker applications installed on those machines. The "continuous build kicker" will continuously get a job from the build job queue and build it, whereas the "idle build kicker" will only pick a job to build when the machine is idle (hey, look around you and see how many machines are unused because the people are either on holiday, sick, in a meeting, etc. That's a lot of power).
  4. The build kickers start by updating their workspace to have the latest files for the projects associated with the changes files. Then they try to "merge" the changes files in their workspace (note: this may be the tricky part to implement unless the SCM offers a way in the pre-commit hook to get the full file - I need to explore this). If they cannot succeed they stop with an error message that flows back to the user. This can happen if someone else has been working on the same source and their change has made it to the SCM before ours has. If the merge succeeds, the build kicker starts the build. The build hasa to be relatively quick so you should not build all the projects. I suggest building the modified projects and the ones that directly depend on them so that an API break can be detected (more on that below)
  5. When the build is finished (or if an error occurs), the build kicker sends the result back to the pre-commit hook (using a RPC mechanism for example).
  6. If the result if positive, the pre-commit script either performs the real commit to the SCM
  7. The resulting message is returned to the user. In case of error the user would see for example the build console log

Advantages

Here are the following advantages of such a system:

  • Does not break other developers upon a build failure. All developers can work uninterrupted even though they can still work on HEAD in a continuous integration fashion
  • Lowers the effort required to get a CI system working thus it helps teams adopt CI
  • Prevents breakage of APIs. Indeed in step 4 above, we've mentioned that a good strategy is for the build to build not only the projects that have changes but also all projects that directly uses those projects (one level). This will allow detecting unwanted API breakages.
  • Increase self-confidence when committing which (I hope) will make it easier to get developers to commit continuously
  • Allows continuing working on one's own machine (instead of having to wait for the current build to free the CPU which is being used at 100%!). You know get your own PBS (Personal Build Server)
  • Forces atomic commits!

Questions/Issues

I'm sure you're now burning with tons of remarks/questions showing why it wouldn't work :-) Here's what I've currently thought about. If you have any opinion or other questions, I'd love to hear them.

Q: What happens if someone else also commits a change to the same file?

It works in the same way as usual. The build kicker will try to "merge" the changes after having done a workspace update and if it cannot, the user will get an error explaining that the merge failed. The user will then need to perform an update on his local machine and resolve the conflict.

Q: Imagine I perform a commit and I start working on a new feature. Then my commit is rejected because of a failure. How do I fix this without loosing my current changes?

Answer 1: This is actually relatively similar to what you're currently doing. Imagine you're committing something. Then you start working on something new and the continuous build tells you 2 hours later that your change has broken something. The difference is that your changes have been committed so you can easily create a new workspace and fix it there. We could do the same here by having the pre-commit hook actually make your changes available through a URL (sent in the commit answer) as a patch so that it is easy for you to apply it to a fresh new checkout.

Answer 2: You wait till the build is finished on the server. You can perform other activities like documenting, reading, thinking, designing, writing new classes, new tests, etc. Basically you work on stuff that do not conflict with the past changes. Actually this is probably what you're currently doing when your build is running as it is eating all your CPU...

Q: Doesn't it take too long to build?

You need to ensure your build is taking as little time as possible. I think 5-10 minutes should be ok. The best way to achieve this is probably to use binary dependencies instead of rebuilding dependent projects (a la Maven), except maybe direct dependencies. You'll still need a continuous build running continuously to produce fresh binary dependencies. I guess it's also best to use an SCM client that can do asynchronous commits in order to let you continue working while the commit is in progress.

Q: What if I want to modify an API but I wish that each projects modifies its own files?

Several options:

  • You could go through a deprecation cycle.
  • You could be doing the refactoring on one machine only (not always possible)
  • You could also plan it. Anyway an API breakage has to be planned with communications. Thus you could say: on that day, at such hour we're going to be committing this break and we have 1 day to fix all our dependent projects. When this happens you can turn off this "unbreakable build" feature for the day.

The interesting point here is that you *want* the API breakage to be detected as the default instead of the opposite.

Conclusion

It seems to me this would be particularly useful on big projects with lots of developers. It should also be useful to introduce continuous integration on an existing project as it lowers the discipline required by everyone. Obviously this is just an idea that I haven't tested yet. I'm very keen to see this in action. If any of you has any experience please share it. I'm planning to spend some time trying to implement it. If you're interested to help out, let me know too.


Comments

There used to be an SCM product, I think named Aegis, that was supposed to be another solution to this problem. If I remember correctly, you could define conditions (like regression tests) that had to pass before the repository was updated. I never used it, so I have no idea how well it worked.

--Craig Cottingham, December 29, 2004 09:01 PM

That would be better. Forcing a run of all unit tests before a check-in is better than reacting later to a broken build. Unit tests should run in less than a minute with sufficient use of mock objects. Of course acceptance/functional tests could still break, but it's less likely if unit tests are at 100%.

--Michael Slattery, December 29, 2004 10:58 PM

How about properly training your staff on software development, and increasing team communication? That would be much simpler, and solve the problem without creating a big fucking mess of a build system like you have proposed here. This sort of thing treats developers like idiots. Either your developers ARE idiots (in which case, it's your own dumb fault for hiring them/working there), or they are capable of learning proper development techniques.

--Dave, December 30, 2004 03:20 PM

My big concern is the "Doesn't it take too long to build" one. I agree with your assessment that 5-10 minutes is a reasonable maximum. But builds on reasonable sized projects take much longer. Do we really have to solve the (much harder!) problem of incremental builds (ie, re-compiling only those things that depend on the changed files -- even going as far as re-running only those tests that touch changed code) to implement unbreakable builds?

--Michael Chermside, December 30, 2004 04:10 PM

Hi Michael,

I should probably have explained more this part. I am also working on a big project and our build takes at least 2 hours to build. But that's the *full* build. Our project is divided in smaller projects and each subproject can be built standalone. In order to do this we are using binary dependencies of the other dependent projects. The main continuous build is in charge of continuously producing fresh binary dependencies. A single project build should not take more than 5-10 minutes.

In any case, it should not take longer than what the developer is currently building on his local machine (it's the same build!). That said, it would be great if we could also build directly dependent projects as this would prevent non-planned API breakages. I'm still unsure how far we could go with this idea as it may take longer. Maybe randomly choosing a single directly dependent project would do the trick (even if not perfect and not guaranteeing the "unbreakable" part).

--Vincent Massol, December 30, 2004 04:24 PM

Very intresting to read indeed. But i think it is mainly suitable for a large project, with a cons of having long build times.

--Muhammad Mansoor, December 31, 2004 06:24 AM

Peter Miller's Aegis system (which has been around since ~1992 I think) has the notion of a forced pre-commit hook. The intent/hope is that its "hooked up" to an automatic build+smoke test that ensures that the codeline builds *and* passes regression testing before committing.

Other systems that are language-aware have attempted to ensure consistency + integrity by doing semantic analysis in an attempt to guarantee that a checkin wont break a build (but without actually doing the build - since it may be prohibitively long for a large system).

These days, its fairly standard to see an shop use a pre-commit hook to ensure that some kind of build+smoke and regression testing have been run (tho possibly incremental rather than full).

What you are proposing seems to be an deployment architecture for enforcing and automating such a protocol, and grappling with issues of codebase size and build-time and distribution.

Another possibility ... rather than try to be "perfect" up front and take such pain to ensure there will be no build failures ... what if we could let the build failures happen (albeit with a culture that strongly discourages that, rather than handcuffs and a guaranteed extra overhead) AND what if we had a way of detecting *AND* correcting the build failures REALLY REALLY QUICKLY!

Suppose we attempted this kind of self-healing (autonomic) build-system. Thus when the frequent+regular integration build took place (be it daily/nightly, or more often), then only in the case of failure, there was an application with some "smarts" built-in to it that could attempt to analyze the result, and attempt to diagnose and then remedy (or at least do this to the greatest extent possible, and give that "useful" information in a timely manner via an automated notification)?

--Brad Appleton, December 31, 2004 06:46 AM

Interesting design. I can see two non-trivial problems with it, first of all: it's extremely hard to get a build verification process down to 5-10 minutes for larger projects, especially if it's going to be effective in finding problems as well. I think the key is a high level of modularisation and good dependency management, but that is something that is very hard to introduce to an existing codebase.

The other problem is also interesting: on a larger project (40+ developers) a lot of commits are going on at the same time. If the build bots runs the build verification process for each changeset serially then it may never catch up with the changesets as developers commit them. The build system needs some way of merging multiple changesets into one changeset and then run the build on them all together.

This leads me to one of my current favourite research areas: Change set oriented version control systems. Most of our current popular version control systems are file or repository based (such as CVS, Perforce, Subversion), where the repository is simply a set of directories and files, each versioned object has a set of versions. A change oriented version control system is organized along a different axis, a repository is a sequence of changes or changesets, the current state of the repository can be determined by applying all the changes in the correct sequence. Version control systems that operate in this way are for example Arch (http://www.gnu.org/software/gnu-arch/) and BitKeeper. Subversion also has something similar crafted on top of it's file/version oriented repository.

Why is this related? With a change oriented version control system a developer produce a "change" instead of committing directly to the repository. A change is a file like anything else. This change file is then sent in some way or the other to the central repository (the sequence of "approved" changes), through email, the file system or something like that. To implement the above design one could imagine a developer creating a change file and placing it in a directory, the build system polls the directory and picks up the changes, applies it to the current state of the repository, runs the build verification process and if successful adds the change to the approved sequence of changes (the central repository). In the meantime other developers have submitted new change files to the directory, the build system picks them up and so on. In case of a failure all the changes are rejected and the developers are notified via email or similar.

--Jon Tirsen, December 31, 2004 07:45 AM

Maybe I've worked at the wrong places, but all my commercial experience has been with source control systems that don't allow multiple updates to the same file. A developer has to explicitly lock a file before they can check in changes (so if someone else has worked on the same file, they have to merge their changes locally). With this kind of source control system, the proposed suggestion sounds very workable (and works around some of the objections raised)

--Chris, December 31, 2004 08:55 AM

I'm new to continuous integration and only now studying CruiseControl. Are there already existing tools that do the system you propose? My main problem would be finding someone on the team with the time to set up such a system.

--Calen Legaspi, December 31, 2004 09:28 AM

Aegis is exactly what you are proposing but it is not very successful in practice.

Have you heard about optimistic locking?

It is a database technique in which you never lock anything, but version it. Versioning a row means generating a new version of the row, while other transactions can still see the old version. The good thing is that databases get a lot faster, because all readers see the old data, while writers simply write the new versions. When 2 processes have conflicts, one succeeds, the other fails, rollsback and then re do the transaction immediatly.

The same happen with CVS and continuous integration, but at the source level ;-)

What you are proposing is going backwards at least 10 years... I can mention at least 10 reasons why it wouldn't work. Build breaks are not a big problem, at least not for companies that do continuous integration using available tools like LuntBuild and CruiseControl.

If it ain't broke, don't fix it.

--Ted Bates, December 31, 2004 01:05 PM

Jon, Subversion is a tree based system, so the changeset paradigm can exist well enough there; certainly the file paradigm of CVS/RCS, makes no sense under Subversion.

As for the idea of unbreakable builds. I think I prefer applying the policy that the mainline can be unstable. The idea then is to capture as many stable builds off the trunk as possible. To avoid dealing with multiple writers you capture stable builds asynchronously by running tests, ie via cruise or damagecontrol. from stable builds you can determine stable releases. Brad Appleton could probably say more about codeline policies, but I think the basic CI approach plus policies is sufficient here. The main upsides to that approach are reduced stress on people checking in and avoiding code freezes. The unbreakable build approach strikes me as micro-freezing the repository.

The unbreakable approach strikes me not as having a scale problem, as Cedric seems to to think, but as having a social one - not enough slack.

--Bill de hÓra, January 3, 2005 06:49 PM

To: extremeprogramming@yahoogroups.com
Subject: A fast, rapidly repairing, only breaks rarely build. Was "Unbreakable builds"

On Thu, 30 Dec 2004, bernard_notarianni wrote:

> I just read the new post on Vincent Massol's blog.
>
> http://blogs.codehaus.org/people/vmassol/archives/000937_unbreakable_builds.html
>
> Does anybody have seen an implementation of a similar solution?
> What were the products used? What were the issues?


I have been running a working variant of this for over a year....

The environment is embedded C hosted under Linux running under eCos on
the synthetic and real target. CVS is the SCM tool.

I am building 6 variants of the same product off a shared code base.

The full compile, splint, link run tests of all variants cycle takes
over two hours on a Pentium 4 2.4 Ghz machine.

Clearly that is an unacceptable burden on the 20+ developers.

Thus our development process is thus....

Prior to commit to the SCM the developers run a "build --for-checkin"
which compiles, links for two variants on the synthetic and real
target, and runs all unit tests on the synthetic (desktop) target.

If they add the "--ccache" option my build script uses
http://ccache.samba.org to return cached compiles, else calls in the
compiler.

If they use the "--distcc" option, my build script uses
http://distcc.samba.org/ to distribute the compile across all
available developer machines.

I do not require them to "splint" http://www.splint.org/ as part of
the precheck in build as it takes two thirds of the total
(unaccelerated) compile/splint/link/test) time.

The idea is to balance stability of the main line versus developer
time cost.

The project has a http://www.usemod.com/cgi-bin/wiki.pl wiki, one page of
which is designated the "CheckInList", where people can queue for
checkins. Once they rise to the top of the list, they have write access
to the mainline.

They are expected to tag the mainline with tag of the form
SOMETHING_MEANINGFUL_N_INITIALS_precheckin.

They can then catch up to the mainline, resolve conflicts, do "build
--for-checkin", perform a smoketest on the target and commit their
changes.

At some point, which point is debatable, they are expected to
get a second person to review their changes, if they are pair
programming, that requirement falls away.

They then tag the mainline with
SOMETHING_MEANINGFUL_N_INITIALS_checkin and add that tag to the
CheckedInList wiki page with a short description, particularly noting
any nasty surprises like database version changes etc.

They then edit the CheckInList wiki page, remove themselves and shift
the next guy to the top and give him a shout to go ahead.

A background ruby script on my machine polls CVS every 2 minutes
looking for tags of the form BLAH_checkin.

On seeing one, it sleeps 10 seconds and then pulls out a complete copy
of the source at that tag, and compiles, splints, links and tests all
variants for all platforms. It uses ccache and distcc to accelerate
the process leaving "splint" as the main time consumer. (I could use
checkin hooks instead of polling, but many other projects use the same
repository)

The result is posted on the project web site (a few lines of ruby
using the 'cgi' module.)

If it fails, the script emails me and I gently lean on the guilty
party. Usually the correction is submitted as part of the next
checkin.

Developers just wishing to catch up with the current state, rather
than checking in, are encouraged to update to a tag that has been
showned to have built/splinted /linked and run successfully.

An important design choice made early was to drop the use of
Make/ANT/scons/cook and use Ruby instead. http://www.ruby-lang.org

Experience on creating/using previous build systems had shown me that
in a mature project, the portion given to you for free by "make"
versus the code in sundry "glue scripts" becomes small. What is more
the Makefiles end up as a ghastly tangle of shell scripting, awk, sed, and
their sundry quoting conventions.

Thus I elected to choose the best of breed scripting language, Ruby,
and write two small reusable modules to do the part that
"Make/ANT/scons/cook" would give me and then be able to write the glue
in a decent glue language. The "tsort" module that is part of standard
ruby distribution did the hard part for free.

This turned out to be an excellent choice enabling a richer, easier
faster more maintainable build environment than I, or any other of the
developers on the project have ever seen.


--John Carter, January 5, 2005 01:09 AM

I think it is a good idea. The quicker the feedback loop the better.

*On projects that build fast, it would be better to block the developer since any work she gets done will be a waste if the checkin fails.

* Naturally running all the unit tests after the build is a great idea too.

Just getting the basics above to work would be a big step forward. Then the next step would be to address build scalability as you mentioned.
Cruise Control is getting there (http://cruisecontrol.sourceforge.net/). Today it isn't distributed, but there is a submission in the works from a guy at SolutionIQ (Jeff Ramsdale) whom I work with, he is submitting the feature today, at that point it is up to the keepers of the source.

Discussion of this patch is at :
http://jira.public.thoughtworks.org/browse/CC-137

--Lance Kind, January 7, 2005 06:24 PM

The notion of "Unbreakable" seems to me kind of similar to the Mozilla tinderbox procedures (http://www.mozilla.org/hacking/working-with-seamonkey.html).

I think some automation along these lines can do the trick without too much overhead. Here is how one can do it:

1. Break the work into commit/build/test[/fix] units and serialize them -- only one person commits at a time. Organize "waiting list" for committers.
2. Remember the point in time (revision number, tag, date) of the last "good" commit. Provide an easy way for developers to update to the last-known-good revision by default.

Kind of like "80% of result at 20% of the price" solution. I it can be done fairly easily with cooperating pre-commit hooks and continious build process and some lightweight scripting on the client-side. Some of it might be available already?

--Shurik O, January 10, 2005 05:50 AM

Hi Vincent,

Why do u need developer work-station to build on? It should not do build on developers machine to save time no?

Thanks

--Jit, February 4, 2005 09:25 AM

Besides the obvious technical difficulties I think that this mechanism is, in every which way you impose it, fundamentally bypassing the real goals of continuous integration.

The goal of continuous integration is not to have as many succesful builds as possible but to raise your team's quality awareness and to detect bugs as early as possible.

So a failing build has a function: it makes people aware of their mistakes right away and it allows you to detect problems asap.

The way to avoid failing builds is to put up a build monitor so that everyone can see right away when a build fails and to have a big lollypop to pass around. That, and indeed a person that can evangelize the importance of working precise and accurately and the importance of quality.

Making a build unbreakable does exactly the opposite: people will always have a safety net to catch their errors and quality awareness will go down. And if you lose quality awareness, you'll lose it on other things as well, not only code quality.

--Dylan van Iersel, February 11, 2005 02:05 PM

I was reading ur book JUnit in Action, in tat u wrote the first test case in Chapter 1, howerver u did notmhave a import statement for calculator, so u did not need a calcualtor calss and u first worte the test case and then when u get red, u build a class and then run the test case again, so this time it is green now. Is this the ideology of JUnit. To me it seems redundant

--Chakri, April 26, 2005 10:55 PM

Calavista's devEDGE product (http://www.calavista.com) has had guaranteed unbreakable builds for several years now. We have been using it and have notice a tremendous improvement in productivity due to this and other features.

The broken builds kill you when you have development teams working around the clock.

--Don Jackson, August 24, 2005 10:13 PM

We use Parabuild (http//www.viewtier.com) It provides stable aka unbreakable daily and nightly builds for us.

TT

--Tang Tong, August 31, 2005 10:21 PM

Clearcase SCM (UCM) is often used for precisely this reason (i.e unbreakable builds).

It has the concept of a Stream for each developer.

A developer can check out and check in from his stream in isolation (other developers don't see the changes until they are delivered to a central integration stream). This has the advantage that developers changes are 'backed up' if he chooses to check in to his own stream.

The central integration stream holds the unbroken set of code (i.e. only 'working changes' get delivered)

Periodically developers rebase (get latest changes from the approved integration stream), these get merged into their own stream. They can then deliver a set of changes to the Intgegration Stream. An integration PC would perform a full build to approve the changes before completing the delivery. This makes the set of changes publicly available.

So it is like have a two-level SCM system.

The downside (I think) is it can be much slower to get someone elses changes, and slower to deliver (release your changes). Often only one developer can deliver to the integration stream at once (incurs a full build during delivery), leading to a queue of developers wishing to deliver.

I looked at how to integrate CruiseControl into Clearcase UCM, as quite a few companies have done this.

To do this I believe you have to let multiple developers deliver to the integration (approved) stream at once. Then CC kicks in and does build as normal. So this is rather like using CC with a normal single level SCM like CVS. i.e. bad changes can get into the SCM and break the build. Yes they are detected quickly using CC but they can still happen.

So if CC is used with Clearcase the integration stream has less advantages. There is still an advantage of each developer working in their own stream (i.e recovery facilities, can work on multiple machines and still see there set of in progress changes)

--Pete , October 24, 2005 09:30 AM

The idea of unbreakable builds is a good one.
reasons are numerous, but my personal favorite is the elimination of risk when submitting code. in your scenario if the 'submission" breaks, it is only a problem for the submitter. The rest of the team continues. This means as a developer, you can check in code and go to lunch. No one looking for your head if the submission fails. Microsoft implemented this codenamed "Gauntlet" for internal development. I coded a primitive version of this using VSS and a promotional model.

--jpl, December 16, 2006 03:50 AM
Post a comment









Remember personal info?