Bamboo agents to be recreated this weekend (1st/2nd August)

cintiadr · July 29, 2020, 8:34am

Hi peeps!

This weekend, I’m going to be destroying and recreating at least one of our Bamboo agents (the thingies tha run our CI builds). I will do one at a time, and hopefully that will help us with builds running out of disk.

dkayiwa · July 29, 2020, 8:45am

That will be awesome!!! Thanks @cintiadr

cintiadr · August 11, 2020, 11:37am

yak agent has been replaced.

Next weekend will be yue. Please let me know if there’s anything weird with those machines.

dkayiwa · August 11, 2020, 11:53am

Thanks @cintiadr!

cintiadr · August 16, 2020, 8:51am

All bamboo agents were recreated. Please let me know if there’s something weirder than usual.

cintiadr · August 17, 2020, 7:42am

I will be fixing https://ci.openmrs.org/browse/TRAN-TRAN-1463/log tomorrow

cintiadr · August 18, 2020, 12:44pm

Transifex build is finally fixed.

jwnasambu · August 18, 2020, 1:30pm

Thanks for the fix.

dkayiwa · August 28, 2020, 2:21pm

Is there any possibility of this being related to? TRUNK-MASTER 2456: Logs - OpenMRS Bamboo

And: TRUNK-OC3 54: Logs - OpenMRS Bamboo

cintiadr · August 29, 2020, 11:01am

Maybe, @dkayiwa.

Core 2.3.0:

 	[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project openmrs-api: Execution default-test of goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test failed: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?

It seems that the JVMs are just dying.

Could be related to the minor java update that happened. It was update 252, and it’s now 265

cintiadr · August 29, 2020, 11:48am

So @dkayiwa, the problem wasn’t java.

Turns out there’s an OCL build that is leaving a lot of docker containers behind, started a couple of days ago. They continue to eat all CPU and memory from the agents. The automation that is supposed to help with that doesn’t seem to be working.

Are you able to check with OCL team and make sure they get builds with cleanups? So it doesn’t break other builds?

I don’t know why automatic clean up wasn’t working, but we shouldn’t really rely on it.

dkayiwa · August 30, 2020, 6:06pm

Thanks @cintiadr for looking into this!

@grace what do you think of this?

burke · August 30, 2020, 7:38pm

Do we know which builds weren’t getting cleaned up? I”m guessing this is from the OCL dev team (e.g., work on OCL API v2) and not the OCL for OpenMRS squad.

cintiadr · August 31, 2020, 6:52am

I didn’t have the time to chase the exactly which build that was triggering it, @burke.

But it seems to be OCL API that was up, indeed. I could see the api, Celery and a couple more docker containers. (now, unless the OCL for OpenMRS is starting OCL completely during CI).

I will try to check if there’s any agent suffering of the same problem, and will try to pinpoint the build based on start time.

burke · August 31, 2020, 12:35pm

I’m sure it’s the api team, then. I’ll let them know to make sure they bring down containers post build.

Thanks, @cintiadr!

cintiadr · September 2, 2020, 12:26pm

I believe this is the build: https://ci.openmrs.org/browse/OCL-OCLAPI2

I’ll tag @raff and @sny as they seemed to have touched that build

raff · September 2, 2020, 12:59pm

I’ve added cleanup to https://ci.openmrs.org/build/admin/edit/editBuildTasks.action?buildKey=OCL-OCLAPI2-BO

Thanks for tracking it down!

dkayiwa · September 2, 2020, 8:22pm

@cintiadr is it in any way related to this? failed; error='Cannot allocate memory' (errno=12) https://ci.openmrs.org/browse/OP-OPM-BS-776/log

cintiadr · September 2, 2020, 11:04pm

I expect it to be related, @dkayiwa

I will do another cleanup in a few hours, hopefully that will cover it