RFC: Production Tiers for our infrastructure

cintiadr · December 26, 2017, 5:02am

Hi everyone,

For those who aren’t aware, today we divide our infrastructure in testing, staging and production. Theory is production outages have more impact than testing outages, and hence we do deploy changes first to the lower environments. Also, outage response can be impacted by that, as testing machines are perceived as not being as important.

The idea is that we want to group machine by their impact if there’s a full or partial outage, allowing us to prioritise the work and calculate deployment risks better.

But IMO the current model is not that helping us at all due to the nature of our machines and services. The infrastructure team is here to aid and support the community. All the services we provide is to enable the community (developers AND non-developers) to create a better software.

By that definition, even the said testing environments ARE production - they are a service provided to the community.

Also, I cannot believe that atlas outages have more impact than qa-refapp being down. Production machines are not all the same.

Then I thought we could borrow the severity levels concept to differentiate between our different levels of expected uptime by the wilder community. Here’s my proposal:

Tier 1:

The most used and relevant services for the community. A large portion of the active community would consider 1h long outage during business time a bad experience, and would prevent them doing their work. These are the most critical systems that affect most of our community, most of the time.

Outages are expected to last less than 1h.

JIRA
Wiki
Talk
OpenMRS ID (and subsystems)
Maven repository (redirect)

Tier 2:

These services can be down quite possibly for a couple of hours before it actually blocks someone. Outages are expected to last less than 4h.

CI (and agents)
qa-refapp environment *
demo environment *
modules-refapp *
yourls / link shortener

Tier 3:

Similar to Tier 2, but a smaller percentage of the community would be affected. Outages are expected to last less than a day.

Addons (prd) *
OpenHIM
OpenConcept Lab (prd) *

Tier 4:

Either affects a very small percentage of the community, or accessed/used much less frequently. Outages are expected to last less than 2 days.

All other OpenMRS environments *
Sonar
Atlas
Chat bots
mdsbuilder
Addons (stg) *
OpenConcept Lab (stg and qa) *
Quizgrader

* while we don’t take care of software deployed itself, the servers and OS are provided by us

Does this ordering make sense? Is there any machine or service you disagree and should be ranked higher or lower?

Keep in mind that even if there’s a machine that you (and only you) use frequently, that doesn’t make the machine highly valuable for the most of the community.

Note that I didn’t really add cloud or PaaS services here, because we are not in control of the OS or application. There’s nothing to do on those cases other than raise a ticket or wait for the problem to be fixed.

r0bby · December 26, 2017, 5:26am

Sounds good to me!

dkayiwa · December 26, 2017, 8:14am

Thanks @cintiadr for this division!

Where does demo.openmrs.org belong?

I would shift Addons (which replaced modulus) up a bit

cintiadr · December 26, 2017, 10:28am

I initially put in Tier 4. The rationale is that while qa-refapp and modules-refapp appears to be actively used for continuous integration and modules tests, it appears to me that demo has a lot less impact. When it was down for weeks and weeks, we pretty much didn’t have complaints (and it doesn’t break most development flows). My thought was that, it demo is down for 2 or 3 days, it would affect much less people than qa-refapp or modules-refapp.

Let me know if you disagree.

What would happen if addons is down for, let’s say, a whole day? Do you feel the same amount of people would be affected than if Tier 2?

dkayiwa · December 29, 2017, 8:27am

Could it be that most users of the demo server are new comers who do not yet have an openmrs id or even are not aware of help desk? And hence just go away without reporting?

Ummmm, i think there would be more complaints in Tier 2 than 3

raff · December 29, 2017, 1:25pm

It sounds reasonable to me. It would help to set an expected time of addressing issues for Tier 3 and Tier 4 similar to Tier 1 (within minutes) and Tier 2 (within hours).

I think it’s fine to have the demo server in Tier 4 having an issue addressed within 2-3 days.

burke · January 1, 2018, 2:33pm

This is a great idea! Did you consult our Google Analytics data when assigning tiers? If not, I’m happy to help add this info. It might be worth looking at it to check our assumptions.

cintiadr · January 2, 2018, 6:10am

Nope, I don’t think I have access to our google analytics. That would be helpful

mogoodrich · January 2, 2018, 1:21pm

Thanks @cintiadr, this is great.

Re: addons… assumedly when it is down end users will be unable to search & download modules? As a developer, this isn’t much of an issue to me as I can build all the modules locally (and usually do), but to the average end user I think I would agree with Daniel to vote moving this up to Tier 2. I also think that if OCL starts to become adopted by institutions to manage their concepts (as we are hoping it will be) it would need to be moved up to Tier 2. However, it sounds like from the comments that @darius made in another thread, we aren’t actually there yet.

Also, if we get to point that there is more that should be in Tier 1 / Tier 2 than we we think our currently all-volunteer team can support/guarantee, this could be an opportunity to show the gaps.

Take care, Mark

darius · January 2, 2018, 8:33pm

Happy New Year, all.

I would put demo as tier 1 or 2. (Where is OpenMRS.org? This should be high tier also.) Even though these aren’t used much by active community members, they are important outward facing servers that the rest of the ecosystem would judge us on.

Almost everything in addons can be downloaded directly from bintray. Bonus points if we could somehow have the “down” page for addons document this with a link. (I’m okay with addons as tier 3)

Aside about the OCL and OpenHIM servers… I’m open to us providing this service to other projects (especially OCL) but I would like to understand the rationale and process, as well as the impact on scarce resources like our infra team’s time. @burke, what is the history here?

cintiadr · January 3, 2018, 7:18am

Alright, I moved demo to Tier 2

I assume that this list will always be updated.

That instance is not really controlled by us, it’s the provider. I do have plans on moving it to Jetstream later this year, but if it goes down today it’s up to our provider to fix it.

burke · January 7, 2018, 2:28am

Sessions per month

Wiki: ~20,000
Talk: ~17,000
JIRA: ~4,000
Addons: ~2000
Atlas: ~800

Activity drops off on weekends, but all of these (even Atlas) are getting daily usage.

Sites we should be tracking but aren’t (yet)

demo environment - awaiting ITSM-4096
qa-refapp environment - not tracked (could take same approach as ITSM-4096)
modules-refapp - not tracked (could take same approach as ITSM-4096)
yourls / link shortener - awaiting ITSM-4097
Addons - awaiting AO-15 (stats above are for modules.openmrs.org)

@paynejd, are you already using Google Analytics for OCL? If not, we should probably add this.

paynejd · January 7, 2018, 3:28am

Hi all, OCL does not yet use Google Analytics. I will create a ticket for that. Also agree that Tier 2 makes sense as adoption increases.

cintiadr · January 7, 2018, 10:19pm

Does those statistics exclude pingdom?