CI Bamboo hung?

mogoodrich · March 3, 2016, 5:38pm

ci.pih-emr.org appears to be hung… jobs are queued but not running and looking at the agents log I see:

There are currently no online remote agents configured on this Bamboo instance.
Mar 3, 2016 4:17:26 PM Remote agent 'gw107.iu.xsede.org (2)' was unresponsive and has gone offline.
Mar 3, 2016 4:17:55 PM Remote agent 'gw108.iu.xsede.org (1)' was unresponsive and has gone offline.
Mar 3, 2016 4:17:55 PM Remote agent 'gw108.iu.xsede.org (2)' was unresponsive and has gone offline.
Mar 3, 2016 5:11:31 PM A remote agent is loading on gw108.iu.xsede.org (127.0.0.1).
Mar 3, 2016 5:11:37 PM A remote agent is loading on gw108.iu.xsede.org (127.0.0.1).
Mar 3, 2016 5:11:42 PM A remote agent is loading on gw107.iu.xsede.org (127.0.0.1).
Mar 3, 2016 5:11:46 PM Remote agent [gw108.iu.xsede.org (4) (2)] came back after a period of inactivity.
Mar 3, 2016 5:11:48 PM A remote agent is loading on gw107.iu.xsede.org (127.0.0.1).
Mar 3, 2016 5:11:50 PM Remote agent [gw108.iu.xsede.org (3) (2)] came back after a period of inactivity.
Mar 3, 2016 5:12:00 PM Remote agent [gw107.iu.xsede.org] came back after a period of inactivity.
Mar 3, 2016 5:12:06 PM Remote agent [gw107.iu.xsede.org (2)] came back after a period of inactivity.
Mar 3, 2016 5:22:25 PM Remote agent 'gw108.iu.xsede.org (1)' was unresponsive and has gone offline.
Mar 3, 2016 5:22:26 PM Remote agent 'gw108.iu.xsede.org (2)' was unresponsive and has gone offline.
Mar 3, 2016 5:22:55 PM Remote agent 'gw107.iu.xsede.org (1)' was unresponsive and has gone offline.

I assume if someone reboots Bamboo and/or the CI machine itself we should be back in business again? Thanks!

@helpdesk

michael · March 3, 2016, 5:39pm

@ryan has just been looking at this in the last hour or so … something is causing the agents to disconnect from the host every few minutes.

mogoodrich · March 3, 2016, 5:40pm

Cool, thanks for the update @michael!

Mark

cintiadr · March 3, 2016, 10:48pm

Taking a look on the logs, will keep an eye.

cintiadr · March 3, 2016, 11:18pm

@michael and @ryan

It’s not really disconnecting every few minutes. It’s reproducible: the agent starts, it picks the first build, predator (the plugin) starts, and the JVM hangs there. After 10 minutes, bamboo server just decides it’s too long, and kills it.

The line on the logs shows it was about the time the plugin was calculating the free disk space.

The weirdest thing ever; running df /
hungs my terminal. I cannot ctrl+C or anything. Why, why on earth cannot discover how much disk we have free???

I’m trying to disable predator for the moment, but it cause unstable snapshot builds, and we need to fix the under laying cause.

michael · March 3, 2016, 11:31pm

As an update for everyone, we’re trying to get the infra provider to do a power reset (well the equivalent of that for virtual machines) on these two agent hosts, in hope that might clear up some of the weirdness. So far, still waiting.

cintiadr · March 3, 2016, 11:49pm

Predator plugin is disabled.

We discovered that we are out of disk (https://ci.openmrs.org/browse/TRUNK-STAND2-589), even after clean maven cache. I disabled the agent anyway so it won’t simply fail for every queued build.

Without being able to run {df} command, I cannot really tell what is using so much disk (:D). We should have news hopefully tomorrow US time.

mogoodrich · March 4, 2016, 2:49pm

Thanks for the update! A guess is that there are some bamboo build artifacts lying around and piling up?

Mark

cintiadr · March 5, 2016, 10:56pm

Good news, the machines are back! I tried to run some of the builds, and they look OK.

Apparently the problem wasn’t on the VMs (they are now back and have GBs of free disk). Bamboo artefacts are saved on the server, not the slaves, and they are deleted after a couple of build. I tried to delete some big folders on the slaves when the problem manifested itself, no luck.

It was some problem on the VM’s hypervisor (the ‘host’ which runs those VMs), Ryan and Michael tried to contact support about it. I suppose someone fixed it.

mogoodrich · March 7, 2016, 2:26pm

Thanks @cintiadr! Annoying that you all spent so much time on something that seems like it was on the hosting company’s side if I understand correctly?

Take care, Mark