Responding to build failures and minimising wasting of developer efforts

wyclif · January 20, 2016, 5:27pm

There are times when devs notice a build failure, they create a ticket for the issue causing the failure and work on it instantly only to realize later that their fix is no longer necessary because normally the dev that broke the build made a follow up commit to fix it before they could issue their pull request, i for one i feel bad for these devs because their efforts go wasted and i think it can be avoided going forward. My suggestion is and it’s what we recommend even though i think it’s not documented anywhere is when you notice a build failure, as opposed to creating and working on a ticket to fix it immediately, you can do the following:

Try to reach out to the dev via IRC or email that made the commit that broke the build, each bamboo build normally has details about the committer(s) and link(s) to the commit(s) or parent bamboo plan that triggered the build.
Alternatively you can bring it up on talk

My view is that we shouldn’t be creating tickets for building failures just like that, chances are normally the dev that broke the build is addressing the failure even though it’s not always the case but I believe from the discussions that go on when you try the options above, either you would find out if there is someone else addressing the issue or if it needs to be ticketed. I think this would minimize the chances of devs concurrently working on the same issue and wasting their time duplicating work.

Another thing is, when you create a ticket, it’s good to allow sometime before you actually work on it for others to give their views, in theory a ticket needs to be assessed before it’s made ready for work. Otherwise this could also lead to wasting a dev’s time because the issue being reporting could be possibly already fixed, non re produce-able, or specific to a single installation may be due to mis-configuration or misuse.

What do others think?

michael · January 20, 2016, 5:49pm

I agree. If we were fully utilizing Bamboo’s features, we’d generally be assigning/claiming responsibility for build failures as described here:

https://confluence.atlassian.com/bamboo/assigning-responsibility-for-build-failures-289276918.html

I also agree. The JIRA workflow is designed to support this – do we need to restrict the roles of the people who can do that issue assessment so people can’t “skip” that part of the workflow?

wyclif · January 20, 2016, 5:57pm

I think JIRA restrictions do limit one from claiming a ticket that is not ready for work but it’s just that one can still work on a ticket even though it’s not ready for work or claimed, we just need to make it clear to devs never to start work on tickets that aren’t ready for work.

dkayiwa · January 20, 2016, 6:53pm

I agree! For a number of days, the legacyui module has been broken. I suspect that is what led to some devs getting their hands dirty trying to fix things, after seeing it unfixed for a while. As a result of this, when a module is going to be broken for a number of days, could we adopt a method of sending out a message and alert people?

bholagabbar · January 20, 2016, 7:58pm

Hi! I have been in touch with @wyclif and @dkayiwa for quite some time now and I, recently created LUI-57 to report the build failure of the master branch of Legacy-UI (LUI-57) and also created a PR to fix it temporarily

In my opinion, the master branch should NEVER fail tests. It is the branch on which everyone relies on so:

I think code should be heavily tested before merging on master
If there is a failure, I believe this should be immediately be resolved (atleast temorarily). Meanwhile, the dev who broke it can create a branch off the broken master, fix it and then commit. I have created a pull request to resolve this ASAP, as I have mentioned (https://github.com/openmrs/openmrs-module-legacyui/pull/29)

Ofcourse, these are only my views and since I am a newbie of sorts, my thinking my not be right. However, I would like some input from @dkayiwa, @wyclif and others

darius · January 20, 2016, 8:07pm

Several point here:

If you break a build, and you’re not imminently going to fix it (i.e. within minutes), you should go to the relevant broken build in bamboo, and claim responsibility and make some comment there.
But it’s not okay to leave builds broken for a long time (e.g. the legacyui build was broken for 9 days, until being fixed now). If this has to happen, then you especially need to comment publicly so people can see to expect this.

one option here would be to do a Pinned post in the dev category any time there are known broken builds
another is to script something against the bamboo REST api so we can regularly give a summary of all the broken builds (pulling forwards any comments someone made while claiming the broken build).

We should update our “how to be a developer” documentation to point out that if something fails to build for you, you should check its status on ci.openmrs.org, and check whether someone has commented about it being broken

or else check some other central place (per the prior point)

darius · January 20, 2016, 8:10pm

+1 to this.

At present, the OpenMRS Core Master build is broken: https://ci.openmrs.org/browse/TRUNK-MASTER-1181 (@wyclif, it has your name on it, merging a PR from @kristopherschmidt)

Either this should be reverted, or, if this is some sort of sporadic error in building openmrs-core, it should always be someone’s top priority to fix.

bholagabbar · January 20, 2016, 8:33pm

Thanks @darius! I had created a PR to fix it temporarily as soon as I saw it
Here is the link:

https://github.com/openmrs/openmrs-module-legacyui/pull/29

This was for the legacyUI module. Just an example to show that if it is broken, then atleast add @ignore to tests

burke · January 21, 2016, 5:09pm

FYI – I labeled CI plans with platform and refapp (I’m sure not 100% correctly, but at least 80% correctly ;)). This allows one to easily filter CI on these labels or view a wallboard based on the labels. I also added RSS widgets for builds of each to the Community Development Status wiki page currently serving as our dashboard on development status.

@wyclif – if your blocker for getting master working is still blocked, please revert the changes to get the openmrs-core build green again today.

kristopherschmidt · January 21, 2016, 11:06pm

The build failures have nothing to do with the commit – the commit was a change to .gitignore files only and build passes locally. I’m sure wyclif would have also checked that the Travis ci build passed before merging as well?

Did the build fail once or multiple times? Can it simply be rerun manually to see if it passes?

wyclif · January 22, 2016, 5:31pm

I think we all agree CI builds should never be red without being attended to but sometimes things are not as straight forward as you might want to think, as @kristopherschmidt said, we also need to realize that the latest commit is not necessarily the one that broke the build as sometimes CI reports may indicate, normally CI will indicate whether the failing tests are new or existing, lately we’ve also had randomly failing tests that can fail on any commit and that can be deceptive that the latest change broke things and this has been the case for master and 2.0.x branches of core. Another extreme scenario is where you have a commit that fixes one plan but breaks another for a dependant project and you can’t really just revert back to the last safe revisions for both projects because there’s a lot of development going on for both projects which would lead to serious merge conflicts when trying to reapply the changes, this was kind of the case for legacy UI and core. I will add @Ignore to the flaky test in core and comment on the related ticket to get it fixed. With that said, much as we want CI to be red, i still don’t think one should be so quick to create a ticket and work on it just because the CI build failed.

burke · January 22, 2016, 7:52pm

Thanks @wyclif. There should be no such thing as an unpredictable (“flaky”) test. If a test cannot produce reliable results, then it’s not a test (it’s a roll of the dice) and should not be running on CI.

Agreed, though I don’t think I would make the decision of making a ticket solely based on timing. I’d propose a convention like:

If you see an issue identified by a broken CI build, want to help:

If you can see someone is actively working to address it, then reach out to them and ask if you can help.
Reach out to the responsible person(s) named in CI to see how you can help.
If you don’t get a timely response from the responsible person(s) and others in the community cannot help you get their attention, then:
- Make a ticket for the issue if it doesn’t already exist
- Add yourself (or get someone to add you) as a responsible person for the broken build in CI
- Post a message on Talk addressed to the responsible person(s) named in CI, linking to the issue in JIRA and letting them know that your helping address the broken build.

wyclif · January 22, 2016, 8:28pm

@Burke i don’t think anyone ever writes a test hoping that it can randomly pass or fail just like no one writes buggy code intentionally, you never realize a test is flaky until is run multiple times on CI, and that’s when you realize it needs to be taken down or addressed.

darius · January 22, 2016, 9:08pm

I would look at this from a few ways:

If you notice anything red in CI then you should follow a process like what Burke describes. (And it would be worth putting this on a wiki page.)
If you are a “regular” OpenMRS developer, and especially if you’re a community dev lead, then you should be actively following CI so you’re among the first to notice these things, and push to get them addressed. There are currently 6 red CI plans. This is an OpenMRS failure.

To be clear, as a regular you may know why some tests are broken, and why “it’s okay to leave this red for a few days”. But for everyone else (including Burke and I), we don’t know these things. And it makes CI completely untrustworthy. A red build in CI should A bamboo build should mean: something just broke, and everyone needs to be thinking about it.

If you are the one responsible for fixing a broken build (either because you broke it, or you found it, or it fell to you) then:
You need to “claim” the investigation work. I suggest (1) saying something on IRC, and (2) commenting on the broken build that you’re looking into it.
Ideally, you should make the “correct” fix, if it’s quick. But if this takes more than an hour, then you need to instead take more drastic measures of (a) reverting commits, and (b) @Ignoring tests
- If you do this, then you own getting the correct fix done as a top priority, or else creating a ticket, and explicitly handing it off to someone else who will address it as top priority.

PS- I created a helpdesk ticket about the broken “Bamboo Backup” plan, and I commented on its latest broken build.

burke · January 22, 2016, 9:26pm

Well said, @wyclif. I guess I’m saying “let’s treat a flaky test like a bug that keeps the program from running at all” – i.e., once discovered, it’s not something we can live with. If it can’t be fixed directly, then we need a workaround (like @Ignore) until it can be fixed.

Totally agree. In fact, it’s déjà vu all over again.

darius · January 22, 2016, 11:02pm

Doing my part to help, I have programmed my irc bot to chatter noisily at the start of the scrum about any broken CI builds.

Currently it is saying this, which makes me sad:

There are currently 5 broken builds

Bamboo Database Backup - see https://ci.openmrs.org/browse/BB-BDB/latest

Last success: 5 months ago

darius: I created helpdesk case #24424 to look into this.

MySQL with OpenMRS - see https://ci.openmrs.org/browse/TRUNK-MWO/latest

Last success: 2 months ago

OpenMRS Core 1.8.x - see https://ci.openmrs.org/browse/TRUNK-ONEPRIOR/latest

Last success: 7 months ago

OpenMRS on MariaDB - see https://ci.openmrs.org/browse/TRUNK-OOM/latest

Last success: 3 months ago

MariaDB Docker test - see https://ci.openmrs.org/browse/RELEASE-MDT/latest

Last success: 2 months ago Please assign someone to deal with this failures!

We can do better.