Move SSO to cloud

raff · December 8, 2023, 1:29pm

I’d like us to consider moving our SSO to AWS or some other cloud provider. JetStream has had many hiccups in the past. We are slowly migrating much of our infrastructure to cloud with Atlassian Cloud being the most substantial move.

Still we rely heavily on SSO, which is deployed on JetStream and is used for Atlassian cloud and self-hosted services. If SSO is not available, we pretty much loose access to all services including those in Atlassian Cloud.

It’s about $100 a month to run ECS cluster with 2 replicas for SSO and LDAP, but gives us very reliable service. It may be also possible to get a grant from AWS to cover our costs. The whole setup can be included in our terraform. Thoughts?

ibacher · December 8, 2023, 3:19pm

So a lot of thoughts:

We have some AWS credits that are available, I think, though I understand from @jennifer that AWS wants us to create a new AWS account to be able to leverage those credits. (We already use AWS, but only for storing backups and the Terraform state).
I strongly dislike the idea of just migrating a single service to another cloud provider. We do not have enough ops support already and adding to the complexity by having things in multiple clouds seems like it’s actually asking for more headaches here.
Ultimately, I’d defer to @burke on this.

burke · December 9, 2023, 5:47am

I agree with @ibacher.

I would not be a fan of an option that requires having multiple AWS accounts, especially when the granted AWS credits that require it are not perpetual.

The benefit of moving SSO to AWS would be 100% uptime. This would be even more compelling if we didn’t manage the service (i.e., if we were using an SSO service and letting someone else manage its infrastructure, like we’re doing with Confluence & JIRA).

When reviewing our uptime of Jetstream over the past year, we’ve had over 99% uptime and over half the downtime was from an incident in the earlier days of the Jetstream2 transition. Most of our service downtime has been related to the services themselves (e.g., OpenMRS ID or Crowd failing and needing to be restarted) rather than the infrastructure itself.

At this point, I’m not sure it’s worth splitting our infrastructure and paying >$1000 USD each year to protect us from the chance that we might have a few hours (or less) of downtime each year. Maybe we’ll feel differently during a future downtime.

raff · December 11, 2023, 9:38am

Thanks! Fair enough. Let’s hope things go better with Jetstream next year.

The downtime felt more than 80 hours this year, but it may be just my perspective dealing with some of the issues. We also had one incident with data corruption and it took at least 2 days to investigate and fix so even though Jetstream was technically up… we were down Maybe we need to better measure downtime and the time spent on infra issues. There’s the hidden cost of time needed to investigate issues and how they hinder productivity of our community members.

ibacher · December 11, 2023, 5:08pm

I absolutely agree that there are huge improvements we need to make to monitoring and, frankly, automating our infrastructure maintenance and if AWS can help with that, it might be worth a look.

The thing is that the number of Jetstream-specific issues I’m aware of are pretty small:

On October 8th there was an issue with data corruption for certain volumes that affect 3 volumes in our instance. I ran fsck on all three of them; only 1 showed any issues (the volume on Maji, which hosts Talk).
On August 13th there was an upstream network issue that caused Jetstream to be unavailable via the Internet, though the servers themselves remained up
On August 8th there was a power management issue at the Bloomington data center
On June 29th, heavy rains in Indiana brought down the data center
On April 20th, Jetstream’s internal DNS server briefly went down, which broke our internal connections to LDAP and the DB.

It’s perfectly possible, though, that things happened I’m unaware of. I’m aware of many other disruptions to various services, but these are basically traceable to how things are configured on our end, I think.

burke · December 11, 2023, 7:41pm

Jetstream is a government-funded, university-led large OpenStack service focused on providing a high performance CPU infrastructure alternative for researchers. It will never compete with AWS on reliability. Likewise, AWS will never compete with Jetstream on price.

@raff makes a fair point that reliability may be worth a small price (including a bit more complexity in our infrastructure) for SSO. He also makes the very real observation that it feels like more than 80 hours of downtime over the past year and we had the data corruption.

At this point, I’m betting on the data corruption being a one-time event (given it’s the only time we’ve had such an issue with Jetstream) and considering the uptime data from our more stable services (e.g., Talk & Atlas) that show about 48 hours of downtime over the past year, 47 of which occurred during 4 of the incidents @ibacher outlined.

So, we’re weighing more than a thousand dollars annually indefinitely vs. the a day or two each year where OpenMRS infrastructure is down. Sure, I’d prefer it be zero downtime, but let’s keep things simple , free, and give Jetstream a chance. If things get worse rather than better over time, then I’ll eat my words and we’ll justify the cost of AWS.

burke · December 15, 2023, 9:33pm

6 posts were split to a new topic: Latency of our infrastructure