For those who aren’t aware, today we divide our infrastructure in testing, staging and production. Theory is production outages have more impact than testing outages, and hence we do deploy changes first to the lower environments. Also, outage response can be impacted by that, as testing machines are perceived as not being as important.
The idea is that we want to group machine by their impact if there’s a full or partial outage, allowing us to prioritise the work and calculate deployment risks better.
But IMO the current model is not that helping us at all due to the nature of our machines and services. The infrastructure team is here to aid and support the community. All the services we provide is to enable the community (developers AND non-developers) to create a better software.
By that definition, even the said testing environments ARE production - they are a service provided to the community.
Also, I cannot believe that atlas outages have more impact than qa-refapp being down. Production machines are not all the same.
Then I thought we could borrow the severity levels concept to differentiate between our different levels of expected uptime by the wilder community. Here’s my proposal:
The most used and relevant services for the community. A large portion of the active community would consider 1h long outage during business time a bad experience, and would prevent them doing their work. These are the most critical systems that affect most of our community, most of the time.
Outages are expected to last less than 1h.
- OpenMRS ID (and subsystems)
- Maven repository (redirect)
These services can be down quite possibly for a couple of hours before it actually blocks someone. Outages are expected to last less than 4h.
- CI (and agents)
- qa-refapp environment *
- demo environment *
- modules-refapp *
- yourls / link shortener
Similar to Tier 2, but a smaller percentage of the community would be affected. Outages are expected to last less than a day.
- Addons (prd) *
- OpenConcept Lab (prd) *
Either affects a very small percentage of the community, or accessed/used much less frequently. Outages are expected to last less than 2 days.
- All other OpenMRS environments *
- Chat bots
- Addons (stg) *
- OpenConcept Lab (stg and qa) *
* while we don’t take care of software deployed itself, the servers and OS are provided by us
Does this ordering make sense? Is there any machine or service you disagree and should be ranked higher or lower?
Keep in mind that even if there’s a machine that you (and only you) use frequently, that doesn’t make the machine highly valuable for the most of the community.
Note that I didn’t really add cloud or PaaS services here, because we are not in control of the OS or application. There’s nothing to do on those cases other than raise a ticket or wait for the problem to be fixed.