Migrating critical services off of IU XSEDE and setting up offsite backups using Amazon Glacier

r0bby · August 12, 2016, 2:31am

Given the recent outage today with IU XSEDE – we should consider putting some of our infrastructure into AWS – I’ll draft a budget soonish about how much this may cost. If we need money for anything, it’s infrastructure.ts We have all critical infrastructure on IU XSEDE (JIRA, Confluence, Bamboo (and all Bamboo agents), and Modulus). This is bad. If a fire ever occurs in that data center, we’re in big trouble. The other option is to spread out our services between AMS1 and AMS2 DigitalOcean data centers. I will also price that out and see how much it would cost. We currently have two VPSs (Virtual Private Servers) in each of those data centers.

Relying on free is nice, but we get now SLA – which means uptime is a bit hard to guarantee. As of a few minutes ago, I had go to disable the globalnavbar in Discourse because it was causing Discourse to load slowly.

We need to definitely consider paying for Amazon Glacier(Amazon S3 is also an option) to store database backups nightly. It’s cheap to do it – but getting it out is pricey, we shouldn’t ever need to touch the backups (hopefully). We currently do not have backups.

mogoodrich · August 12, 2016, 12:23pm

+1 to what @r0bby said. I can’t comment on the cost, but the slowness + outages have definitely limited my productivity.

Adding to the services Robby mentioned, our Maven repo is there as well (I assume, because it was down yesterday as well). This is actually one of the biggest blockers, because the SDK relies on it. (Which actually may be a bigger deal for those in limited bandwidth setting, but that’s another issue @raff).

Mark

mogoodrich · August 12, 2016, 12:25pm

And, please, let’s prioritize the funds to get backups in place ASAP!

Mark

raff · August 12, 2016, 12:45pm

@mogoodrich, you can switch to offline with -o to continue using SDK in limited bandwidth setting or no connectivity at all.

Personally, I do not see much point in maintaining OpenMRS maven repo, when we can publish to bintray and/or oss.sonatype.org for free as do many OSS projects. Actually, I’m planning a QA sprint, which will add releasing to bintray from travis-ci for as many Reference Application modules as we can fit in our timeline. The migration will be transparent as OpenMRS maven repo proxies maven central so we can keep OpenMRS repo running and continue to use both as long as it takes.

mogoodrich · August 12, 2016, 12:56pm

Cool, great. I know little about bintray or oss.sonatype.org, but if there’s a free cloud alternative to maintaining our own repo, it’s by all means worth pursuing.

Mark

r0bby · August 13, 2016, 6:09am

I think relying on free is a bad idea. We’ve been doing that and it’s not working – when no outages occur, we’re great but as you saw on Thursday, the community came to a screeching halt because we host vital infrastrce ucture all in one place, this is bad. We also need a reasonable SLA. We also need to ensure that if one data center has a network outage, we don’t lose everything.

I need to dedicate some time on Monday and just pull the trigger on Amazaon Glacier, because we need it now and I can’t wait and sit on my hands. Amazon Glacier is surprisingly cheap to put data in, but really expensive to get it out. What I’m proposing is that we we keep the latest db backup on the servers in case we need to do a restore (to eliminate the need to pull it out of Glacier – which will not be cheap) .

I am also considering S3 – S3 has support to use Glacier in some cases. Glacier is expensive and some of our databases are HUGE (in Gigabytes!).

burke · August 15, 2016, 2:49pm

@r0bby, what’s are the rough sizes of our Confluence & JIRA backups (i.e., .tgz of SQL backups & .tgz of instance folder)? Are we talking 10 GB, 100 GB, or 1000 GB in Glacier?

r0bby · August 15, 2016, 5:20pm

I wouldn’t just backup confluence/jira – I’d back up Bamboo, ID Dashboard, and Crowd. too. It would probablyry cost us less than $1.00 to put it in there; to get it out will be another story. I’d actually like to use S3 over Glacier – it lets us retreieve it easier.

darius · August 15, 2016, 5:44pm

From peeking at Glacier pricing, it seems like it only costs 9 cents per gigabyte to get data out.

Regardless, Burke’s question is the important one: how much data are we talking about?

r0bby · August 15, 2016, 5:53pm

I will check in a few.

I’d still like to use Amazon S3, it’s cheaper to get data out of and S3 can actually use Glacier behind the scenes. I will look into the size of the backups.

r0bby · August 15, 2016, 7:07pm

Jira database dump compressed as bz2ball (tar+bzip2):

uncompressed: 470M jira-backup.sql compressed: 98M jira-backup.sql.tbz2

Confluence database dump compressed as bz2ball (tar+bzip2):

uncompressed: 2.3G confluence-backup.sql compressed: 262M confluence-backup.sql.tbz2

So this is actually doable in Amazon Glacier. Also bzip2 > gzip in every possible way! It’s slow but the compression is the best.

burke · August 15, 2016, 8:46pm

Thanks @r0bby. I’m not surprised that the SQL dumps are relatively small. How about the instance folders – i.e., not the location where Confluence or JIRA runtimes are installed (e.g., /opt/confluence or /opt/jira), but the instance data (e.g., /var/confluence or /var/jira) where configuration settings and attachments are stored? The instance should be considerably larger than the SQL, even with bzipped. A backup would include both files: bzipped mysql dump + bzipped instance data.

r0bby · August 16, 2016, 1:25am

We have to move JIRA/Confluence/Bamboo off of XSEDE – XSEDE machines can be used for less critical applications.

As I understand it, XSEDE offers no real availability guarantees. This is pretty bad wh can’'t handle that when we rely on them. The community came to a screeching halt.

r0bby · August 16, 2016, 1:38am

The actual data we care about is in /opt/confluence-data/

I’m doing a backup to see how big it is NOW – it will change.

We need to aggregate logs. For that – we’ll use th ELK stack. There’s a lot of things we need to do – I need to write this stuff down somewhere.

May I fire up a digitalocean droplet @burke to set up ELK (ElasticSearch, LogStash, Kibana) machine? A $5 machine should do for now.

r0bby · August 16, 2016, 2:00am

Okay – learned that the hard way – we ran out of disk space. We need to move the files off the server immediately. The size of the uncompressed sql dumps are too much. We don’t have much free space on that box.

burke · August 16, 2016, 5:26am

I agree we should try to avoid single points of failure for high availability; however, moving services from one VM to another doesn’t overcome the single point of failure. The recent interruption of IU services was caused by road construction cutting through a fiber line (not something that should be happening frequently).

XSEDE is being sunset anyway, so we’ll need to move. Jetstream has some potential:

IU continues to graciously donate resources for us (i.e., these are not free resources, they’re donated)
With Jetstream, we are given an allocation of resources (much like AMS or DO) and have full control to create our own networks, subnets, assignment of IP addresses, and spinning up VMs within that allocation.
Jetstream services are spread across campuses. For example, our current allocation for testing is hosted in Austin. It’s likely with Jetstream, we will at least be able to spread our services across Indiana University and Austin sites if not additional sites.

That said, our fundraising efforts are aiming to support infrastructure as well as a sustainable future for OpenMRS development. So, there’s a good chance we’ll have more options going forward.

LOL. Been there. That’s why I always run sudo df -h before creating new backups.

How big is the disk and how much space is available? Are there old copies or backups taking up space? Have you run sudo du -sh * to see where the space is being used?

pascal · August 16, 2016, 7:00am

Did you figure out how big the data directory is?

BTW, thanks for doing all this investigation!

r0bby · August 16, 2016, 7:01am

confluence’s data directory is 5 GB – I suspect that the bzball will be around 3 GB. We do not have that much disk space on that machine right now.

r0bby · August 16, 2016, 7:08am

This is good. OpenLDAP is a pain in the butt with docker and our current configuration method.

We definitely could use more servers. First things first – let’s try to spin up a docker ELK stack container – @darius’ PM tool will have to wait (sorry!)

Is there any way to get some kind of an idea how much money we have to play with right now?

chagara · September 13, 2016, 12:42am

We should do this right ASAP. @r0bby do you still want to do this or you want one of us to do it ?

@burke how do we get involved or help into getting the jetstream resources?