Bamboo JVM appears hung while building

FYI @dkayiwa, @cintiadr, @michael, @ryan

Daniel tried to release webservices.rest from bamboo today. (This has previously worked successfully.) He first tried, and it seems that the release actually worked correctly (because all the artifacts are in mavenrepo.openmrs.org correctly, and things got committed to git), but bamboo reported failure.

He then did the “Re-run failed/incomplete jobs” thing from bamboo, and that build failed (because the release had already happened). It seems to me that the logs from the those two runs are concatenated here.

The last message from that first run is:

error	17-Mar-2015 07:34:53	Agent 'gw108.iu.xsede.org (1)' went offline while building RESTWS-RESTWS-RTM-393. The results of that build will not be available. 

Why did this happen? Somehow it seems that although the agent went offline, the maven release actually worked!

So, I think we’re basically fine here, except that we need to reset the variable on the plan so that bamboo knows to release as 2.12 next time.

I have done this.

[And can we please remove the stupid behavior that messages have a minimum length?]

I guess your investigation is right.

So, during the first attempt do run the release, I can see the agent’s log getting the job and and started executing “release-scripts/release-prepare-perform.sh -r 2.11 -d”. (2015-03-17 02:14:19,510 agent’s time)

After that, we’ve got a few new additions to the queue (nothing important while the agent is busy, last message 2015-03-17 02:29:21,893), and the very next thing (2015-03-17 02:33:53,237) we’ve got bamboo remote agent starting (?).

Apparently the release script subshell eventually finished (successfully I’d expect, but we’ve got no logs by now), but it didn’t report back to Bamboo (as it wasn’t officially running as a build to the new restarted agent). I wonder what caused it to restart.

The rerun of the release stage failed simply because the release plugin needs to commit to HEAD of the branch; previous execution of the release plugin added commits to master, so yeah, expected.

Still, I couldn’t get a clue why it was restarted, no out-of-memory error or anything.

Well, I’ve found the anwser, but I’m not sure there’s anything we can do:

INFO   | jvm 1    | 2015/03/17 02:29:21 | 2015-03-17 02:29:21,893 INFO [AtlassianEvent::0-BAM::EVENTS:pool-1-thread-9] [BuildAgentControllerImpl] A new executable may be available on the server.
ERROR  | wrapper  | 2015/03/17 02:32:32 | JVM appears hung: Timed out waiting for signal from JVM.
ERROR  | wrapper  | 2015/03/17 02:32:45 | JVM did not exit on request, terminated
STATUS | wrapper  | 2015/03/17 02:32:45 | JVM exited in response to signal SIGKILL (9).
INFO   | wrapper  | 2015/03/17 02:32:45 | JVM exited on its own while waiting to kill the application.
INFO   | wrapper  | 2015/03/17 02:32:45 | Wrapper Process has not received any CPU time for 12 seconds.  Extending timeouts.
STATUS | wrapper  | 2015/03/17 02:32:49 | Launching a JVM...
INFO   | jvm 2    | 2015/03/17 02:32:51 | Wrapper (Version 3.2.3-atlassian-1) http://wrapper.tanukisoftware.org
INFO   | jvm 2    | 2015/03/17 02:32:51 |   Copyright 1999-2006 Tanuki Software, Inc.  All Rights Reserved.
INFO   | jvm 2    | 2015/03/17 02:32:51 |

(from /home/bamboo-agent-1/bamboo-agent/atlassian-bamboo-agent.log) It’s a fairly new java version, but it’s hard to discover why the JVM was hung.

@cintiadr This just happened again (luckily not on a release build this time…)

https://ci.openmrs.org/browse/EBOLA-EEM-JOB1-844/log

 19-Mar-2015 11:28:44	Agent 'gw108.iu.xsede.org (1)' went offline while building EBOLA-EEM-JOB1-844. The results of that build will not be available.

From the Bamboo host (not agent logs):

Mar 19, 2015 11:28:36 AM Remote agent [gw108.iu.xsede.org (4) (2)] marked as inactive. A new one came in place.
Mar 19, 2015 11:28:36 AM Remote agent [gw108.iu.xsede.org (4) (2)] came back after a period of inactivity.
Mar 19, 2015 11:28:51 AM Remote agent [gw108.iu.xsede.org (3) (2)] marked as inactive. A new one came in place.
Mar 19, 2015 11:28:51 AM Remote agent [gw108.iu.xsede.org (3) (2)] came back after a period of inactivity.

Oh, of course it did.

ERROR | wrapper | 2015/03/19 06:27:26 | JVM appears hung: Timed out waiting for signal from JVM. ERROR | wrapper | 2015/03/19 06:27:26 | JVM did not exit on request, terminated STATUS | wrapper | 2015/03/19 06:27:26 | JVM exited in response to signal SIGKILL (9). INFO | wrapper | 2015/03/19 06:27:26 | JVM exited on its own while waiting to kill the application. STATUS | wrapper | 2015/03/19 06:27:32 | Launching a JVM… INFO | jvm 3 | 2015/03/19 06:27:32 | Wrapper (Version 3.2.3-atlassian-1) http://wrapper.tanukisoftware.org INFO | jvm 3 | 2015/03/19 06:27:32 | Copyright 1999-2006 Tanuki Software, Inc. All Rights Reserved. INFO | jvm 3 | 2015/03/19 06:27:32 |

I don’t remember seeing it before, but now it looks suspiciously common in both remote agents: https://ci.openmrs.org/admin/systemErrors.action

Could it be a heartbeat too long issue?

I’m not sure. It’s clearly not the heartbeat between agent and server, but it could be the ping between the wrapper and the JVM.

But seriously, 30 seconds in computer time is a lot of time already. Do we have graphs of the usage of the VMs (CPU/Memory/IO) as well as the hosts where they are running?

I wonder if it’s a symptom of something more dangerous.

Also, the UI tests are even more flakey this week due to timeouts in the chrome driver.

As those problems happened at the same time, I guess those can be related.

1 Like

We just had two more inexplicably failed Ebola builds:

https://ci.openmrs.org/browse/EBOLA-EEM-893/log

...
26-Mar-2015 12:54:16	Agent 'gw108.iu.xsede.org (2)' went offline while building EBOLA-EEM-JOB1-893. The results of that build will not be available. 

https://ci.openmrs.org/browse/EBOLA-EEM-894/log

No logs were found for this job

The next build passed (with no intervention by me).

This one might have been caused by the crash of the Bamboo server at that time:

http://status.openmrs.org/pages/incident/534decd528328a247100004b/55140a76319b7b9805000009

I’m seeing more of these going offline now:

(And now it seems like there’s only one agent alive…)

I see there’s now a Bamboo plugin to get more REST api referent to remote agents, are we now monitoring it somewhere? It’s a surprisingly complicated plugin, I’m still not sure what it’s trying to achieve.

Anyway, there was a

gw108:/home/bamboo-agent-1/bamboo-agent$ fgrep “hung” atlassian-bamboo-agent.log*

atlassian-bamboo-agent.log.3:ERROR | wrapper | 2015/03/04 15:15:58 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.3:ERROR | wrapper | 2015/03/05 08:46:03 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.3:ERROR | wrapper | 2015/03/05 09:35:35 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.2:ERROR | wrapper | 2015/03/11 12:52:28 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/17 02:32:32 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/18 18:52:31 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/19 06:27:26 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/19 10:00:38 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/20 22:51:25 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/20 23:12:48 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/20 23:15:57 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/21 00:01:37 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/21 23:05:03 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log.1:ERROR | wrapper | 2015/03/22 00:03:19 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log:ERROR | wrapper | 2015/03/24 22:01:43 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log:ERROR | wrapper | 2015/03/26 23:03:54 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log:ERROR | wrapper | 2015/03/26 23:05:49 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log:ERROR | wrapper | 2015/03/26 23:13:53 | JVM appears hung: Timed out waiting for signal from JVM. atlassian-bamboo-agent.log:ERROR | wrapper | 2015/03/28 00:02:45 | JVM appears hung: Timed out waiting for signal from JVM.

It’s getting way too common. We’ve got logs since 2015/02/28 (when I upgraded bamboo), but looks like it just started happening on 2015/03/04. Still investigating.

gw107 had 5 zombie processes (only one for gw108) running since before the upgrade. I don’t think they were related to this problem, though, but I’ve cleaned both. I stopped and restarted a couple of builds to get it clean.

gw107:~$ ps aux | grep 'Z'
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
bamboo-+  9659  0.0  0.0      0     0 ?        Z    Mar26   0:00 [java] <defunct>
bamboo-+  9686  0.0  0.0      0     0 ?        Z    Mar26   1:28 [java] <defunct>
bamboo-+ 12188  0.0  0.0      0     0 ?        Z    Mar26   0:00 [wrapper] <defunct>
bamboo-+ 13709  0.0  0.0      0     0 ?        Z    Mar26   0:30 [java] <defunct>
cintiadr 20824  0.0  0.0  11744   932 pts/1    S+   23:04   0:00 grep --color=auto Z
bamboo-+ 29184  0.1  0.0      0     0 ?        Z    Mar26   6:24 [java] <defunct>

So far, out timeline is: 2015/02/28: Bamboo upgraded, agents restarted 2015/03/04: Problem started

Notes: I didn’t run puppet on the agents. So the only thing in the agent side that actually changed was the remote agent, which is automatically handled by bamboo. Java version wasn’t touched. Don’t get too attached to the hours, as the remote agents were in Hawaii timezone until 20th March.

For the sake of completeness, here all the occurrences:

gw108:bamboo-agent-1

atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/04 15:15:58 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/05 08:46:03 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/05 09:35:35 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/11 12:52:28 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/17 02:32:32 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/18 18:52:31 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 06:27:26 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 10:00:38 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 22:51:25 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 23:12:48 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 23:15:57 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/21 00:01:37 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/21 23:05:03 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/22 00:03:19 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/24 22:01:43 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:03:54 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:05:49 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:13:53 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/28 00:02:45 | JVM appears hung: Timed out waiting for signal from JVM.

gw108:bamboo-agent-2

atlassian-bamboo-agent.log.4:ERROR  | wrapper  | 2015/03/05 09:35:34 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.4:ERROR  | wrapper  | 2015/03/05 11:20:41 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.4:ERROR  | wrapper  | 2015/03/06 21:10:59 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.4:ERROR  | wrapper  | 2015/03/06 21:13:36 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/11 04:54:52 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/18 06:08:30 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/18 18:34:26 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/18 18:51:47 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/19 06:27:09 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/19 09:17:21 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 10:00:38 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 10:08:51 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 10:41:37 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 22:51:10 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 23:12:48 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 23:15:57 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/20 23:20:47 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/21 00:07:06 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/21 01:03:43 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/22 00:03:21 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/22 01:04:54 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 12:45:10 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:03:54 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:05:50 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 23:13:53 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/28 00:02:45 | JVM appears hung: Timed out waiting for signal from JVM.

gw107:bamboo-agent-1

atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/11 12:51:28 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/11 19:01:14 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/12 08:40:54 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/18 18:51:29 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 13:05:33 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/21 22:14:34 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 18:35:14 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 20:11:04 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 20:44:27 | JVM appears hung: Timed out waiting for signal from JVM
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 21:51:14 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 22:35:27 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/27 23:43:07 | JVM appears hung: Timed out waiting for signal from JVM.

gw107:bamboo-agent-2

atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/11 12:49:06 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.3:ERROR  | wrapper  | 2015/03/11 20:30:37 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.2:ERROR  | wrapper  | 2015/03/12 08:40:59 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/18 18:51:29 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 09:15:08 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log.1:ERROR  | wrapper  | 2015/03/19 13:04:56 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 20:11:02 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/26 21:56:42 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/27 23:43:07 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/28 00:02:41 | JVM appears hung: Timed out waiting for signal from JVM.
atlassian-bamboo-agent.log:ERROR  | wrapper  | 2015/03/28 00:45:14 | JVM appears hung: Timed out waiting for signal from JVM.

Apparently gw107 suffers less of the problem, and started later. Looks more common on wed/thurdays/fridays?

Unfortunately, I’ve got no syslogs anymore.

Can we get any data about server usage? Including the VM host.

I haven’t looked for data about the Webdriver flakiness, but I’m feeling inclined to disable one of the remote agents in gw108 and see how it goes.

We were hoping the API plugin would successfully provide a way to monitor agents, but alas, no. As of yet we have yet to discover a way to hook the agent availability into our monitoring system.

@ryan was brainstorming about possibly running NewRelic on the agent hosts to monitor the Java processes … but with Bamboo’s strange Java Wrapper we weren’t sure if it would be compatible … more research is needed.

It is possible to use newrelic, so far I’ve only used it to get a little of performance data from the UI. Which stack do we have? I have some experience connecting bamboo with graphite + opsview, we have a plugin to send data do graphite and statsd. But the data we have is pretty much way too much for majority of customers.

I believe it’s possible to not have the wrapper for remote agents, but then we’d need to configure any other monitor for the process (daemon or anything like that). I believe we’d need to refactor quite a bit of the current puppet module.

But knowing if the agent is connected is not the only thing we want to know, we need things like memory consumption of that VM, CPU, this kind of information. This is not connected to the java process only, but we could get the JVM to send this data somewhere (if it’s still responsive). What ends up happening is that JVM stop answering and we have gaps in the data graphs, but it’s just expected. If you let me know which monitoring system we are using, I can try to see how to hook the agent and the server into that.