|
Unbreakable builds
[
vmassol
]
Let's create Unbreakable BuildsOut of my last two development projects, one had a strong sense of quality and excellence in general and continuous build failures were the exceptions (about 3-4 per week for a 30 developers team) and the other one was quite the opposite and everyone was surprised when the continuous build was passing (there were about 5 build breaks a day as an average for a 40 developers team). I'm sure this is also pretty common to other projects. Obviously the best is to build (pun intended) a build awareness in the team. However, you'll need strong evangelists for this to happen who may not always be available and other circumstances may make this difficult. A thought struck me about a year back: what if we were able to prevent the continuous build from failing by design. There's a French saying that goes something like "it's better to cure than to heal". I think this is definitely a good idea to apply to continuous build failures. Why not make a continuous build system that cannot fail. At that time I thought it was a nice idea (I had meant to blog about it but I forgot) but I could not see very well how it could work. Now a year after, I really think it's a nice idea and I'd like to explore it. The architectureA potential basic architecture is shown in figure 1 (click to get a larger picture).
The general principle is to catch the commit data before they get committed to the SCM, to perform a build and to perform the actual commit only if the build is successful. Here are the detailed steps:
AdvantagesHere are the following advantages of such a system:
Questions/IssuesI'm sure you're now burning with tons of remarks/questions showing why it wouldn't work :-) Here's what I've currently thought about. If you have any opinion or other questions, I'd love to hear them. Q: What happens if someone else also commits a change to the same file?It works in the same way as usual. The build kicker will try to "merge" the changes after having done a workspace update and if it cannot, the user will get an error explaining that the merge failed. The user will then need to perform an update on his local machine and resolve the conflict. Q: Imagine I perform a commit and I start working on a new feature. Then my commit is rejected because of a failure. How do I fix this without loosing my current changes?Answer 1: This is actually relatively similar to what you're currently doing. Imagine you're committing something. Then you start working on something new and the continuous build tells you 2 hours later that your change has broken something. The difference is that your changes have been committed so you can easily create a new workspace and fix it there. We could do the same here by having the pre-commit hook actually make your changes available through a URL (sent in the commit answer) as a patch so that it is easy for you to apply it to a fresh new checkout. Answer 2: You wait till the build is finished on the server. You can perform other activities like documenting, reading, thinking, designing, writing new classes, new tests, etc. Basically you work on stuff that do not conflict with the past changes. Actually this is probably what you're currently doing when your build is running as it is eating all your CPU... Q: Doesn't it take too long to build?You need to ensure your build is taking as little time as possible. I think 5-10 minutes should be ok. The best way to achieve this is probably to use binary dependencies instead of rebuilding dependent projects (a la Maven), except maybe direct dependencies. You'll still need a continuous build running continuously to produce fresh binary dependencies. I guess it's also best to use an SCM client that can do asynchronous commits in order to let you continue working while the commit is in progress. Q: What if I want to modify an API but I wish that each projects modifies its own files?Several options:
The interesting point here is that you *want* the API breakage to be detected as the default instead of the opposite. ConclusionIt seems to me this would be particularly useful on big projects with lots of developers. It should also be useful to introduce continuous integration on an existing project as it lowers the discipline required by everyone. Obviously this is just an idea that I haven't tested yet. I'm very keen to see this in action. If any of you has any experience please share it. I'm planning to spend some time trying to implement it. If you're interested to help out, let me know too. There used to be an SCM product, I think named Aegis, that was supposed to be another solution to this problem. If I remember correctly, you could define conditions (like regression tests) that had to pass before the repository was updated. I never used it, so I have no idea how well it worked. --Craig Cottingham, December 29, 2004 09:01 PM
That would be better. Forcing a run of all unit tests before a check-in is better than reacting later to a broken build. Unit tests should run in less than a minute with sufficient use of mock objects. Of course acceptance/functional tests could still break, but it's less likely if unit tests are at 100%. --Michael Slattery, December 29, 2004 10:58 PM
How about properly training your staff on software development, and increasing team communication? That would be much simpler, and solve the problem without creating a big fucking mess of a build system like you have proposed here. This sort of thing treats developers like idiots. Either your developers ARE idiots (in which case, it's your own dumb fault for hiring them/working there), or they are capable of learning proper development techniques. --Dave, December 30, 2004 03:20 PM
My big concern is the "Doesn't it take too long to build" one. I agree with your assessment that 5-10 minutes is a reasonable maximum. But builds on reasonable sized projects take much longer. Do we really have to solve the (much harder!) problem of incremental builds (ie, re-compiling only those things that depend on the changed files -- even going as far as re-running only those tests that touch changed code) to implement unbreakable builds? --Michael Chermside, December 30, 2004 04:10 PM
Hi Michael, I should probably have explained more this part. I am also working on a big project and our build takes at least 2 hours to build. But that's the *full* build. Our project is divided in smaller projects and each subproject can be built standalone. In order to do this we are using binary dependencies of the other dependent projects. The main continuous build is in charge of continuously producing fresh binary dependencies. A single project build should not take more than 5-10 minutes. In any case, it should not take longer than what the developer is currently building on his local machine (it's the same build!). That said, it would be great if we could also build directly dependent projects as this would prevent non-planned API breakages. I'm still unsure how far we could go with this idea as it may take longer. Maybe randomly choosing a single directly dependent project would do the trick (even if not perfect and not guaranteeing the "unbreakable" part). --Vincent Massol, December 30, 2004 04:24 PM
Very intresting to read indeed. But i think it is mainly suitable for a large project, with a cons of having long build times. --Muhammad Mansoor, December 31, 2004 06:24 AM
Peter Miller's Aegis system (which has been around since ~1992 I think) has the notion of a forced pre-commit hook. The intent/hope is that its "hooked up" to an automatic build+smoke test that ensures that the codeline builds *and* passes regression testing before committing. Other systems that are language-aware have attempted to ensure consistency + integrity by doing semantic analysis in an attempt to guarantee that a checkin wont break a build (but without actually doing the build - since it may be prohibitively long for a large system). These days, its fairly standard to see an shop use a pre-commit hook to ensure that some kind of build+smoke and regression testing have been run (tho possibly incremental rather than full). What you are proposing seems to be an deployment architecture for enforcing and automating such a protocol, and grappling with issues of codebase size and build-time and distribution. Another possibility ... rather than try to be "perfect" up front and take such pain to ensure there will be no build failures ... what if we could let the build failures happen (albeit with a culture that strongly discourages that, rather than handcuffs and a guaranteed extra overhead) AND what if we had a way of detecting *AND* correcting the build failures REALLY REALLY QUICKLY! Suppose we attempted this kind of self-healing (autonomic) build-system. Thus when the frequent+regular integration build took place (be it daily/nightly, or more often), then only in the case of failure, there was an application with some "smarts" built-in to it that could attempt to analyze the result, and attempt to diagnose and then remedy (or at least do this to the greatest extent possible, and give that "useful" information in a timely manner via an automated notification)? --Brad Appleton, December 31, 2004 06:46 AM
Interesting design. I can see two non-trivial problems with it, first of all: it's extremely hard to get a build verification process down to 5-10 minutes for larger projects, especially if it's going to be effective in finding problems as well. I think the key is a high level of modularisation and good dependency management, but that is something that is very hard to introduce to an existing codebase. The other problem is also interesting: on a larger project (40+ developers) a lot of commits are going on at the same time. If the build bots runs the build verification process for each changeset serially then it may never catch up with the changesets as developers commit them. The build system needs some way of merging multiple changesets into one changeset and then run the build on them all together. This leads me to one of my current favourite research areas: Change set oriented version control systems. Most of our current popular version control systems are file or repository based (such as CVS, Perforce, Subversion), where the repository is simply a set of directories and files, each versioned object has a set of versions. A change oriented version control system is organized along a different axis, a repository is a sequence of changes or changesets, the current state of the repository can be determined by applying all the changes in the correct sequence. Version control systems that operate in this way are for example Arch (http://www.gnu.org/software/gnu-arch/) and BitKeeper. Subversion also has something similar crafted on top of it's file/version oriented repository. Why is this related? With a change oriented version control system a developer produce a "change" instead of committing directly to the repository. A change is a file like anything else. This change file is then sent in some way or the other to the central repository (the sequence of "approved" changes), through email, the file system or something like that. To implement the above design one could imagine a developer creating a change file and placing it in a directory, the build system polls the directory and picks up the changes, applies it to the current state of the repository, runs the build verification process and if successful adds the change to the approved sequence of changes (the central repository). In the meantime other developers have submitted new change files to the directory, the build system picks them up and so on. In case of a failure all the changes are rejected and the developers are notified via email or similar. --Jon Tirsen, December 31, 2004 07:45 AM
Maybe I've worked at the wrong places, but all my commercial experience has been with source control systems that don't allow multiple updates to the same file. A developer has to explicitly lock a file before they can check in changes (so if someone else has worked on the same file, they have to merge their changes locally). With this kind of source control system, the proposed suggestion sounds very workable (and works around some of the objections raised) --Chris, December 31, 2004 08:55 AM
I'm new to continuous integration and only now studying CruiseControl. Are there already existing tools that do the system you propose? My main problem would be finding someone on the team with the time to set up such a system. --Calen Legaspi, December 31, 2004 09:28 AM
Aegis is exactly what you are proposing but it is not very successful in practice. Have you heard about optimistic locking? It is a database technique in which you never lock anything, but version it. Versioning a row means generating a new version of the row, while other transactions can still see the old version. The good thing is that databases get a lot faster, because all readers see the old data, while writers simply write the new versions. When 2 processes have conflicts, one succeeds, the other fails, rollsback and then re do the transaction immediatly. The same happen with CVS and continuous integration, but at the source level ;-) What you are proposing is going backwards at least 10 years... I can mention at least 10 reasons why it wouldn't work. Build breaks are not a big problem, at least not for companies that do continuous integration using available tools like LuntBuild and CruiseControl. If it ain't broke, don't fix it. --Ted Bates, December 31, 2004 01:05 PM
Jon, Subversion is a tree based system, so the changeset paradigm can exist well enough there; certainly the file paradigm of CVS/RCS, makes no sense under Subversion. As for the idea of unbreakable builds. I think I prefer applying the policy that the mainline can be unstable. The idea then is to capture as many stable builds off the trunk as possible. To avoid dealing with multiple writers you capture stable builds asynchronously by running tests, ie via cruise or damagecontrol. from stable builds you can determine stable releases. Brad Appleton could probably say more about codeline policies, but I think the basic CI approach plus policies is sufficient here. The main upsides to that approach are reduced stress on people checking in and avoiding code freezes. The unbreakable build approach strikes me as micro-freezing the repository. The unbreakable approach strikes me not as having a scale problem, as Cedric seems to to think, but as having a social one - not enough slack. --Bill de hÓra, January 3, 2005 06:49 PM
To: extremeprogramming@yahoogroups.com On Thu, 30 Dec 2004, bernard_notarianni wrote: > I just read the new post on Vincent Massol's blog.
The environment is embedded C hosted under Linux running under eCos on I am building 6 variants of the same product off a shared code base. The full compile, splint, link run tests of all variants cycle takes Clearly that is an unacceptable burden on the 20+ developers. Thus our development process is thus.... Prior to commit to the SCM the developers run a "build --for-checkin" If they add the "--ccache" option my build script uses If they use the "--distcc" option, my build script uses I do not require them to "splint" http://www.splint.org/ as part of The idea is to balance stability of the main line versus developer The project has a http://www.usemod.com/cgi-bin/wiki.pl wiki, one page of They are expected to tag the mainline with tag of the form They can then catch up to the mainline, resolve conflicts, do "build At some point, which point is debatable, they are expected to They then tag the mainline with They then edit the CheckInList wiki page, remove themselves and shift A background ruby script on my machine polls CVS every 2 minutes On seeing one, it sleeps 10 seconds and then pulls out a complete copy The result is posted on the project web site (a few lines of ruby If it fails, the script emails me and I gently lean on the guilty Developers just wishing to catch up with the current state, rather An important design choice made early was to drop the use of Experience on creating/using previous build systems had shown me that Thus I elected to choose the best of breed scripting language, Ruby, This turned out to be an excellent choice enabling a richer, easier
--John Carter, January 5, 2005 01:09 AM
I think it is a good idea. The quicker the feedback loop the better. *On projects that build fast, it would be better to block the developer since any work she gets done will be a waste if the checkin fails. * Naturally running all the unit tests after the build is a great idea too. Just getting the basics above to work would be a big step forward. Then the next step would be to address build scalability as you mentioned. Discussion of this patch is at : --Lance Kind, January 7, 2005 06:24 PM
The notion of "Unbreakable" seems to me kind of similar to the Mozilla tinderbox procedures (http://www.mozilla.org/hacking/working-with-seamonkey.html). I think some automation along these lines can do the trick without too much overhead. Here is how one can do it: 1. Break the work into commit/build/test[/fix] units and serialize them -- only one person commits at a time. Organize "waiting list" for committers. Kind of like "80% of result at 20% of the price" solution. I it can be done fairly easily with cooperating pre-commit hooks and continious build process and some lightweight scripting on the client-side. Some of it might be available already? --Shurik O, January 10, 2005 05:50 AM
Hi Vincent, Why do u need developer work-station to build on? It should not do build on developers machine to save time no? Thanks --Jit, February 4, 2005 09:25 AM
Besides the obvious technical difficulties I think that this mechanism is, in every which way you impose it, fundamentally bypassing the real goals of continuous integration. The goal of continuous integration is not to have as many succesful builds as possible but to raise your team's quality awareness and to detect bugs as early as possible. So a failing build has a function: it makes people aware of their mistakes right away and it allows you to detect problems asap. The way to avoid failing builds is to put up a build monitor so that everyone can see right away when a build fails and to have a big lollypop to pass around. That, and indeed a person that can evangelize the importance of working precise and accurately and the importance of quality. Making a build unbreakable does exactly the opposite: people will always have a safety net to catch their errors and quality awareness will go down. And if you lose quality awareness, you'll lose it on other things as well, not only code quality. --Dylan van Iersel, February 11, 2005 02:05 PM
I was reading ur book JUnit in Action, in tat u wrote the first test case in Chapter 1, howerver u did notmhave a import statement for calculator, so u did not need a calcualtor calss and u first worte the test case and then when u get red, u build a class and then run the test case again, so this time it is green now. Is this the ideology of JUnit. To me it seems redundant --Chakri, April 26, 2005 10:55 PM
Calavista's devEDGE product (http://www.calavista.com) has had guaranteed unbreakable builds for several years now. We have been using it and have notice a tremendous improvement in productivity due to this and other features. The broken builds kill you when you have development teams working around the clock. --Don Jackson, August 24, 2005 10:13 PM
We use Parabuild (http//www.viewtier.com) It provides stable aka unbreakable daily and nightly builds for us. TT --Tang Tong, August 31, 2005 10:21 PM
Clearcase SCM (UCM) is often used for precisely this reason (i.e unbreakable builds). It has the concept of a Stream for each developer. A developer can check out and check in from his stream in isolation (other developers don't see the changes until they are delivered to a central integration stream). This has the advantage that developers changes are 'backed up' if he chooses to check in to his own stream. The central integration stream holds the unbroken set of code (i.e. only 'working changes' get delivered) Periodically developers rebase (get latest changes from the approved integration stream), these get merged into their own stream. They can then deliver a set of changes to the Intgegration Stream. An integration PC would perform a full build to approve the changes before completing the delivery. This makes the set of changes publicly available. So it is like have a two-level SCM system. The downside (I think) is it can be much slower to get someone elses changes, and slower to deliver (release your changes). Often only one developer can deliver to the integration stream at once (incurs a full build during delivery), leading to a queue of developers wishing to deliver. I looked at how to integrate CruiseControl into Clearcase UCM, as quite a few companies have done this. To do this I believe you have to let multiple developers deliver to the integration (approved) stream at once. Then CC kicks in and does build as normal. So this is rather like using CC with a normal single level SCM like CVS. i.e. bad changes can get into the SCM and break the build. Yes they are detected quickly using CC but they can still happen. So if CC is used with Clearcase the integration stream has less advantages. There is still an advantage of each developer working in their own stream (i.e recovery facilities, can work on multiple machines and still see there set of in progress changes) --Pete , October 24, 2005 09:30 AM
The idea of unbreakable builds is a good one. --jpl, December 16, 2006 03:50 AM
Post a comment
|