Migrating to Jetstream 2!

jwnasambu · June 29, 2022, 2:50pm

Thanks so much for the guidance I will strictly follow each step. Working with you has been a blessing to me.

ibacher · June 29, 2022, 8:54pm

Found one more small thing: apparently nvm.sh was not being marked as executable, so the builds depending on nvm failed, but otherwise, xindi seems to have successfully run around 80 builds.

jwnasambu · June 29, 2022, 8:56pm

Kindly I don’t have access to AWS how do I go about it?

ibacher · June 30, 2022, 6:46pm

Ok, I found another small issue but it only affects on artefact: coreapps uses PhantomJS 1 to run some browser-based tests. The problem here is that PhantomJS 1 depends on an old version, vulnerable version of OpenSSL. I’m going to see if those tests can be changed to work with Chrome or Firefox, since we seem to have headless versions of those available on the Bamboo agents.

raff · July 1, 2022, 11:44am

I’m sorry for leaving the party without notice. I had to deal with some infection and stayed in bed for the last few days. Thanks @ibacher for stepping in.

I see you fixed the coreapps issue already.

ibacher · July 1, 2022, 12:17pm

Yeah. Switching to Firefox in headless mode seems to work. The agents have Chromium rather than Chrome which, apparently, the (very old) version of karma doesn’t recognise.

cintiadr · July 15, 2022, 12:43pm

Let me get the latest update here before I go to sleep and confuse everyone else in the process.

All vms should be now created - full list. I can successfully run ansible in all of them, while the actual docker containers/services/atlassian suite haven’t been installed or are probably broken
I was forced to upgrade datadog and change our custom monitoring to get them working again. Seems fine.
I’ve cleaned up a bunch of things which I reckon aren’t being used anymore. If I deleted something I shouldn’t have, there’s always git to recover it!
Instead of all the forks of multiple ansible roles, I just added them to custom_roles folder. It will be simpler for us to maintain it
I removed Jetstream 1 machines from our ansible inventory, as the changes I’m doing are, most likely than not, incompatible. If you need to change something there, do it manually for the time being, and let me know.

I’m not sure exactly which machines I will be migrating this weekend, but I will let you know when I have a step-by-step on each machine if someone with jetstream/terraform/ansible access would like to help

jennifer · July 15, 2022, 3:39pm

@cintiadr, you are a ROCK STAR!

cintiadr · July 16, 2022, 3:59am

Setup

Follow the instructions on terraform readme file to generate a credentials file with both Jetstream 1 and 2. It only works if you already had terraform permissions
Ensure you can run ./build.rb plan <machine> on both a Jetstream 1 and 2. Check the region on our docs to differentiate Jetstream 2 (region: v2) from Jetstream 1
Ensure you have access to our backups in AWS S3. Read about how to recover docker backups

Per machine

Change ansible on the machine until you are satisfied with the status
Verify what needs to have a backup, and how to extract backups. For docker compose apps, check the backups. Otherwise, atlassian apps will have their data either in a database or home folder in /data.
Create a maintenance notification in our status page
Go to terraform and edit the previous DNS records and add -v1 via terraform. Change the -v2 record to the new one. Apply via terraform plan/apply. Please note that most our DNS entries have a TTL of 5 minutes.
Update the terraform docs via ./build.rb docs && ./build.rb plan docs && ./build.rb apply docs
Update ansible variables with the new value. You will need to recreate the letsencrypt cert and nginx config (tags tls and web). Follow the instructions on the README.
Please note that the previous server do not have ansible. So if you need to access it, you may need to modify things manually
Generate the backups and move them accross to the new server. Apply them.
Confirm things work as expected
End the maintenance in out status page

cintiadr · July 16, 2022, 11:28am

Current state of new and old machines:

Migrated machines

bele and bonga are ready from my point of view. I’m not sure which parts of emr-3 need to be migrated? I migrated some. Feel free to edit terraform/ansible to get it back, @ibacher . Also, @raff , feel free to point the OCL dns entries to the new machines. bonga will _probably need to be upgraded from quad to medium soon, so go ahead if you need to do it. From my point of view, balaka, dowa, nairobi, nakuru and narok can be powered off as soon as you let me know
worabe, our new CI server, took me a while! It was broken in several different ways, our backups weren’t working, it seems like storing artefacts in S3 was also not working, took me forever to upgrade. lobi can be powered off.
adaba, our new ID server. I migrated crowd, ldap and ID there, but I discovered in the process that our SMTP server isn’t working anymore. I had to do some ungodly tricks with symlinks to get ldap to work with TLS. So new sign ups aren’t working. ako, ambam and baragoi can be powered off.
mojo, our new jira server, seems file. maroua can be powered off.
mota, our new wiki server, seems ok. menji and salima can be powered off

Pending machines:

maji: I’m also struggling to get the new discourse/talk up, it’s complaining about some ruby things I’m clueless about. To be discovered, but I don’t want to migrate before we fix the SMTP issues anyway. Let me know if you’d like to investigate.
goba and gode: miscellaneous services, haven’t even started. Will probably do during the week.
jinka: website and several redirects. Haven’t even started. I guess I might to it, at least partially, during the week.

If you think you can help me, please pick goba, gode or jinka.

raff · July 18, 2022, 1:43pm

Updated bonga to include oclclient-prd in addition to stg, qa, demo and pointed to OCL DNS entries. @ibacher is oclclient-dev or oclclient-clone still needed? It’s currently not deployed to bonga. I haven’t updated the VM to medium yet.

ibacher · July 18, 2022, 1:59pm

oclclient-dev would probably be good to keep up (it’s essentially tracking the master branch). oclclient-clone was intended to be a short-lived part and can be dropped, AFAIK.

cintiadr · July 19, 2022, 12:20pm

@raff I had deployed oclclient-prd to bele, not sure if we should delete from there, then? Also, we seem to be having certificate issues there with https://openmrs.openconceptlab.org/

Remaining machines

Please note I’m keeping vms docs always updated

jinka: redirects and website migrated successfully . mua and campo can be powered off. I’ll keep an eye, but if there’s any issues, you can manually change the DNS to the old server and it should automatically work.
maji: same as before. Discourse wasn’t starting last time I checked
goba and gode: miscellaneous services, I haven’t even started.

Known issues

I reckon backups for atlassian jira/wiki/bamboo is probably not working well and successfully
We still have the issue of having to restart LDAP every couple of months to pick up new certs. To be honest, I might have added new certificate issues there…
I was forced to add a -refresh=false on our terraform plans as it was attempting to create new data volumes. Not sure what’s happening, maybe it will solve itself on Jetstream

raff · July 19, 2022, 9:27pm

My bad I didn’t notice it. I deleted oclclient-prd from bonga and let it run on bele. Fixed the certificate issue.

Added oclclient-dev to bonga. I’ll leave the oclclient-clone config around, but I won’t deploy it unless someone asks for it.

cintiadr · July 20, 2022, 12:24pm

Update of the day:

maji has a discourse running. I was forced to move to stable. Will migrate talk over the weekend.
gode is down? Not sure what happened, didn’t touch it.

burke · July 21, 2022, 4:42pm

Thanks @cintiadr, @ibacher, and @raff for the migration. Great to see the progress!

I noticed we can’t edit any pages on the OpenMRS wiki (trying to edit any page returns a System Error page). It appears to be caused by: Confluence MySQL database migration causes content_procedure_for_denormalised_permissions does not exist error. The solution is to include --routines in the mysqldump command when backing up to include stored procedures that were introduced since Confluence 7.11.0. I see a mysqldump.sh.j2 ansible template. I’m guessing we’d want to add --routines to its OPTIONS, assuming this is what is used to backup our Confluence data. I’m leery to make these changes, since I don’t want to break things when we only have 10 days left to complete the migration.

Can we make a new backup from our Jetstream1 Confluence instance using the --routines option? I think we need this before our wiki will work again.

cintiadr · July 24, 2022, 4:45am

In case you haven’t seem, Burke was correct, I copied the routines and it seemed to do the trick.

I come with bad news about talk. I spent hours trying to get the migration going. I wanted to finish it during the weekend, to disrupt you the least. It wasn’t successful at all.

Let’s see what they have to say.

cintiadr · July 24, 2022, 12:51pm

Today’s update:

Please note I’m keeping vms docs always updated

maji: I’m worried about talk. Hopefully the request we open will be enough help.
gode: staging for addons and atlas. Done
goba: migrated addons and atlas. Missing implementation, quizgrader, shields and radarproxy. Should be done this week.

I will continue to delete Jetstream 1 machines as the week progresses.

Known issues

I reckon backups for atlassian jira/wiki/bamboo is probably not working well and successfully
We still have the issue of having to restart LDAP every couple of months to pick up new certs. To be honest, I might have added new certificate issues there…
I was forced to add a -refresh=false on our terraform plans as it was attempting to create new data volumes. Not sure what’s happening, maybe it will solve itself on Jetstream

burke · July 25, 2022, 12:22pm

Maybe it’s because our split config only upgrades web by default. So, while our Talk might report itself as, say 2.9.0.beta7, it’s really only that version for the web component and an older version (last manual rebuild of data) for the data component. That could cause havoc for a migration that expects the data to be tests-passed but is getting data from some arbitrary older state.

Did you rebuild both web and data on prod before creating the backup for migration?

cintiadr · July 25, 2022, 12:49pm

I’m creating a whole new server from scratch. I did delete all the data and rebuild both containers dozens of times.

So turns out the problem was the branch we were using to clone the discourse launcher. Somewhere along the line it changed from master to main, but our ansible continued to point to master.

The new talk server is empty, but finally up! with the new version I needed, 2.9.0.beta7. I will schedule to migrate talk probably in a few hours from now, my lunch time. I think it will be the least disrupting time.

maji: New talk is up. I will attempt to migrate it again tomorrow.
goba: I migrate all little things there. I’m not sure if radarproxy and shields are working… they had an empty screen when accessing from the browser, so I’m not if I broke something else.

Somehow bonga machine was tainted (marked for full recreation) in terraform. I undid that because I don’t think we need to delete it right now.
Previous known issues still apply