Migrating to Jetstream 2!

I’m sorry for leaving the party without notice. I had to deal with some infection and stayed in bed for the last few days. Thanks @ibacher for stepping in.

I see you fixed the coreapps issue already.

Yeah. Switching to Firefox in headless mode seems to work. The agents have Chromium rather than Chrome which, apparently, the (very old) version of karma doesn’t recognise.

Let me get the latest update here before I go to sleep and confuse everyone else in the process.

  • All vms should be now created - full list. I can successfully run ansible in all of them, while the actual docker containers/services/atlassian suite haven’t been installed or are probably broken
  • I was forced to upgrade datadog and change our custom monitoring to get them working again. Seems fine.
  • I’ve cleaned up a bunch of things which I reckon aren’t being used anymore. If I deleted something I shouldn’t have, there’s always git to recover it!
  • Instead of all the forks of multiple ansible roles, I just added them to custom_roles folder. It will be simpler for us to maintain it
  • I removed Jetstream 1 machines from our ansible inventory, as the changes I’m doing are, most likely than not, incompatible. If you need to change something there, do it manually for the time being, and let me know.

I’m not sure exactly which machines I will be migrating this weekend, but I will let you know when I have a step-by-step on each machine if someone with jetstream/terraform/ansible access would like to help

3 Likes

@cintiadr, you are a ROCK STAR!

1 Like

Setup

  • Follow the instructions on terraform readme file to generate a credentials file with both Jetstream 1 and 2. It only works if you already had terraform permissions
  • Ensure you can run ./build.rb plan <machine> on both a Jetstream 1 and 2. Check the region on our docs to differentiate Jetstream 2 (region: v2) from Jetstream 1
  • Ensure you have access to our backups in AWS S3. Read about how to recover docker backups

Per machine

  • Change ansible on the machine until you are satisfied with the status
  • Verify what needs to have a backup, and how to extract backups. For docker compose apps, check the backups. Otherwise, atlassian apps will have their data either in a database or home folder in /data.
  • Create a maintenance notification in our status page
  • Go to terraform and edit the previous DNS records and add -v1 via terraform. Change the -v2 record to the new one. Apply via terraform plan/apply. Please note that most our DNS entries have a TTL of 5 minutes.
  • Update the terraform docs via ./build.rb docs && ./build.rb plan docs && ./build.rb apply docs
  • Update ansible variables with the new value. You will need to recreate the letsencrypt cert and nginx config (tags tls and web). Follow the instructions on the README.
  • Please note that the previous server do not have ansible. So if you need to access it, you may need to modify things manually
  • Generate the backups and move them accross to the new server. Apply them.
  • Confirm things work as expected
  • End the maintenance in out status page
1 Like

Current state of new and old machines:

Migrated machines

  • bele and bonga are ready from my point of view. I’m not sure which parts of emr-3 need to be migrated? I migrated some. Feel free to edit terraform/ansible to get it back, @ibacher . Also, @raff , feel free to point the OCL dns entries to the new machines. bonga will _probably need to be upgraded from quad to medium soon, so go ahead if you need to do it. From my point of view, balaka, dowa, nairobi, nakuru and narok can be powered off as soon as you let me know
  • worabe, our new CI server, took me a while! It was broken in several different ways, our backups weren’t working, it seems like storing artefacts in S3 was also not working, took me forever to upgrade. lobi can be powered off.
  • adaba, our new ID server. I migrated crowd, ldap and ID there, but I discovered in the process that our SMTP server isn’t working anymore. I had to do some ungodly tricks with symlinks to get ldap to work with TLS. So new sign ups aren’t working. ako, ambam and baragoi can be powered off.
  • mojo, our new jira server, seems file. maroua can be powered off.
  • mota, our new wiki server, seems ok. menji and salima can be powered off

Pending machines:

  • maji: I’m also struggling to get the new discourse/talk up, it’s complaining about some ruby things I’m clueless about. To be discovered, but I don’t want to migrate before we fix the SMTP issues anyway. Let me know if you’d like to investigate.
  • goba and gode: miscellaneous services, haven’t even started. Will probably do during the week.
  • jinka: website and several redirects. Haven’t even started. I guess I might to it, at least partially, during the week.

If you think you can help me, please pick goba, gode or jinka.

3 Likes

Updated bonga to include oclclient-prd in addition to stg, qa, demo and pointed to OCL DNS entries. @ibacher is oclclient-dev or oclclient-clone still needed? It’s currently not deployed to bonga. I haven’t updated the VM to medium yet.

oclclient-dev would probably be good to keep up (it’s essentially tracking the master branch). oclclient-clone was intended to be a short-lived part and can be dropped, AFAIK.

1 Like

@raff I had deployed oclclient-prd to bele, not sure if we should delete from there, then? Also, we seem to be having certificate issues there with https://openmrs.openconceptlab.org/


Remaining machines

Please note I’m keeping vms docs always updated

  • jinka: redirects and website migrated successfully :tada: . mua and campo can be powered off. I’ll keep an eye, but if there’s any issues, you can manually change the DNS to the old server and it should automatically work.
  • maji: same as before. Discourse wasn’t starting last time I checked
  • goba and gode: miscellaneous services, I haven’t even started.

Known issues

  • I reckon backups for atlassian jira/wiki/bamboo is probably not working well and successfully
  • We still have the issue of having to restart LDAP every couple of months to pick up new certs. To be honest, I might have added new certificate issues there…
  • I was forced to add a -refresh=false on our terraform plans as it was attempting to create new data volumes. Not sure what’s happening, maybe it will solve itself on Jetstream

My bad I didn’t notice it. I deleted oclclient-prd from bonga and let it run on bele. Fixed the certificate issue.

Added oclclient-dev to bonga. I’ll leave the oclclient-clone config around, but I won’t deploy it unless someone asks for it.

1 Like

Update of the day:

  • maji has a discourse running. I was forced to move to stable. Will migrate talk over the weekend.
  • gode is down? Not sure what happened, didn’t touch it.

Thanks @cintiadr, @ibacher, and @raff for the migration. Great to see the progress!

I noticed we can’t edit any pages on the OpenMRS wiki (trying to edit any page returns a System Error page). It appears to be caused by: Confluence MySQL database migration causes content_procedure_for_denormalised_permissions does not exist error. The solution is to include --routines in the mysqldump command when backing up to include stored procedures that were introduced since Confluence 7.11.0. I see a mysqldump.sh.j2 ansible template. I’m guessing we’d want to add --routines to its OPTIONS, assuming this is what is used to backup our Confluence data. I’m leery to make these changes, since I don’t want to break things when we only have 10 days left to complete the migration.

Can we make a new backup from our Jetstream1 Confluence instance using the --routines option? I think we need this before our wiki will work again.

1 Like

In case you haven’t seem, Burke was correct, I copied the routines and it seemed to do the trick.


I come with bad news about talk. I spent hours trying to get the migration going. I wanted to finish it during the weekend, to disrupt you the least. It wasn’t successful at all.

Let’s see what they have to say.

1 Like

Today’s update:

Please note I’m keeping vms docs always updated

  • maji: I’m worried about talk. Hopefully the request we open will be enough help.
  • gode: staging for addons and atlas. Done :tada:
  • goba: migrated addons and atlas. Missing implementation, quizgrader, shields and radarproxy. Should be done this week.

I will continue to delete Jetstream 1 machines as the week progresses.


Known issues

  • I reckon backups for atlassian jira/wiki/bamboo is probably not working well and successfully
  • We still have the issue of having to restart LDAP every couple of months to pick up new certs. To be honest, I might have added new certificate issues there…
  • I was forced to add a -refresh=false on our terraform plans as it was attempting to create new data volumes. Not sure what’s happening, maybe it will solve itself on Jetstream

Maybe it’s because our split config only upgrades web by default. So, while our Talk might report itself as, say 2.9.0.beta7, it’s really only that version for the web component and an older version (last manual rebuild of data) for the data component. That could cause havoc for a migration that expects the data to be tests-passed but is getting data from some arbitrary older state.

Did you rebuild both web and data on prod before creating the backup for migration?

I’m creating a whole new server from scratch. I did delete all the data and rebuild both containers dozens of times.

So turns out the problem was the branch we were using to clone the discourse launcher. Somewhere along the line it changed from master to main, but our ansible continued to point to master.

The new talk server is empty, but finally up! with the new version I needed, 2.9.0.beta7. I will schedule to migrate talk probably in a few hours from now, my lunch time. I think it will be the least disrupting time.


  • maji: New talk is up. I will attempt to migrate it again tomorrow.
  • goba: I migrate all little things there. I’m not sure if radarproxy and shields are working… they had an empty screen when accessing from the browser, so I’m not if I broke something else.

  • Somehow bonga machine was tainted (marked for full recreation) in terraform. I undid that because I don’t think we need to delete it right now.
  • Previous known issues still apply
3 Likes

And here I thought GitHub was supposed to have some clever redirects to handle that!

Whoops! My fault! I forgot to undo that (when it lost connectivity I was originally just going to try recreating it before I found out I could solve it much more easily…)

Something we should probably do across the board in OpenMRS. I changed my default branch for personal repos from master to main years ago.

Yay! You’re awesome, @cintiadr!

You can test shields with https://shields.openmrs.org/plan/TRUNK/MASTER

You can test radarproxy with https://radarproxy.openmrs.org/openmrs%20radar.json

I added this info to our ITSM wiki, including a new page for radarproxy.

In any case, these both are working fine on goba. Thanks again @cintiadr!

1 Like

Alright, let’s see if this email lands in my mailbox. Testing testing.

I have a suspicious that it’s not technically viable to easily do that, due to how git works. That said, a warning on the discourse launcher logs would have kept me sane!


Alright, all machines are migrated! :smiley: Took a hot minute, but here we are.

All machines are migrated! I will slowly deleting all the other machines, a few per day. On the weekend, I will delete the old networking as well.


Probably creating tickets for all follow ups tasks.

4 Likes