Wiki Page Shows Error

tunmi · July 8, 2018, 7:31am

I tried visiting the wiki page today and this is what I got:

Please, how do we go about fixing this?
cc: @dkayiwa

suthagar23 · July 8, 2018, 7:36am

Yes, OpenMRS Wiki is down. Could someone make this up?

CC : @cintiadr

dkayiwa · July 8, 2018, 7:38am

Here is where you can check the status: http://status.openmrs.org/

r0bby · July 9, 2018, 4:05am

They know – there’s just nobody to restart it.

cintiadr · July 9, 2018, 6:23am

@burke @pascal and @jeremy, can you please restart confluence?

As I mentioned before, I’m on holiday. No private keys, no laptop

dkayiwa · July 9, 2018, 6:34am

Oh, how i wish i knew how to restart it!

r0bby · July 9, 2018, 8:04am

The solution is actually a lesson I would have hoped would be learned by now: Not relying on one person to keep the lights on. There should be 24/7/365 coverage for outages like this. This happened when I was running things – outages lasted longer when I decided I needed to step back. OpenMRS SHOULD actually have the money to hire someone to do this.outage

This outage has now lasted over 24hrs now and it’s not good. Do not rely on one person, I’m super glad that @cintiadr did not bring her laptop or her private keys – it’s forcing OpenMRS leadership to realize that they need to fix things.

jwnasambu · July 9, 2018, 8:39am

This is my thinking over this issue. Is it ok they train some people who wish to volunteer to increase on the manpower? I for one am interested if it ok with the management.

dkayiwa · July 9, 2018, 9:38am

Whoever did the restart, thank you!

janflowers · July 9, 2018, 4:30pm

Hi @jwnasambu - Great for you to join the infrastructure team. Just work with @cintiadr and @pascal and @burke to get involved. They have some documentation and a telegram channel for members to work with. I’d love to hear about your experience getting on board with the team and what could be improved to make it easier for folks to join.

irenyak1 · July 9, 2018, 6:03pm

Surely this needs not to happen again. Is there like an immediate plan by OpenMRS to train some us to handle this work @janflowers? It would be a good idea to have us the junior trained to handle such areas which I think are also important to all of us as the OpenMRS community. If OpenMRS has intentions to train people to handle this please I am willing to be trained and work.

r0bby · July 9, 2018, 7:32pm

The infrastructure team has a Telegram chat (I’m not on the team-- I just kinda lurk) at https://om.rs/infra

cintiadr · July 15, 2018, 5:37am

Sometimes it seems like I can’t turn my back and everything is on fire

So let’s retrace all the steps, and make sure we don’t really fall into availability bias just because it’s a recent outage.

Fact 1: There are 4 people with full root access to all machines, and I’m one of them. Let’s call this group ‘Level 2 support’. We also have 2 people (@dkayiwa and @jwnasambu) in Level 1 support, and they are getting more and more access to answer all the support cases as they get used to all our services.

Fact 2: Our machines are restarted on Sundays fairly often due to security patches. Very rarely (imo, every 4 or 5 weeks) there’s some race condition that prevents one of the services to start on the first attempt (e.g. , database wasn’t yet up). I never prioritised to automate this because it’s not even on my top 10 most painful things to maintain. It’s a minor inconvenience.

Let’s go with the timeline, our best friend in outage postmortems.

On the 1st of July, I wrote a message here in talk that I’d be on holidays. I mentioned everyone on those two groups. I did not attempt to notify them in any other way.

On Sunday, after the security patch window, I saw the alarm for wiki being down. I sent a message in Telegram to ask someone to restart it. The topic here was created an hour before I sent the message in Telegram, but I wasn’t looking my emails/talk anyway.

Around 24h later, @whiscard was available to check confluence, and restarted it.

So, everyone wants to help, but I think it’s important to put things in perspective.

@whiscard volunteered to automate restarts. As they say, instead of train humans to do computer repetitive work, just automate it. What’s a simple ‘restart if that didn’t work on the first time’, of course that only covers a small subset of outages we’ll have!

Yeah, there’s a lot of misunderstandings here to unpack.

In my opinion, OPS has two very distinctive sides:

planned work: this is coding. We use ansible, and terraform to create and modify infrastructure. This is a skill anyone can learn, in their own pace, with pull requests and formal reviews, there’s a possibility of running it locally (albeit takes a loooooot longer than java development). We do have heaps of work like this, and I’d be so happy to help you learn it. The only thing I need here is your patience and commitment. I promise it’s super cool and trendy stuff!
unplanned work: outages and similar. This is not something you learn at uni, this is more like investigative work. Why doesn’t this service start? Why those two services stopped communicating? Why that person cannot login? Handling outages is somehow looking for clues, and attempting different things live. You need and have a really powerful tool (root access) because you know how to not get hurt with it.

How do you learn how to handled unplanned work? Usually with knowledge on linux OS, that comes after a lot of planned work + pairing with senior colleagues. So I do not feel comfortable giving a chainsaw to a child, similar reason why I don’t feel comfortable giving root access to a person who is not already used to handling servers, as I cannot be by their side when they will be using the server.

If all we wanted was to grant access was to blindly restart the service and never run any other investigation, deploying monit is probably less work and will do it better.

That said, please please help me with the planned work! Even if you just have basic knowledge in command line linux, if you want to learn infra-as-code, join me. I’m super fun!

If you don’t want to commit to planned work and you are already a sysadmin/ops, and you are happy to not break my infra-as-code, please let me know. If you are happy to check why the disk is full, why we have alarms, why the ssl decided to fail, and will never run a ‘apt-get install’ manually, I’m quite keen to have you to help me

I must admit that even if our infrastructure tech is reasonably cool, the entry bar for any OPS work tends to be higher than, let’s say, java work. I’ve been trying different approaches, but people tend to give up very early. I couldn’t find yet a tradeoff that worked to get people onboard - without sabotaging ourselves.

I try to get people with the planned work, but even setting up the environment takes time (it’s devops after all).

Sometimes I think our own telegram channel acts against volunteers. I really want to get more people involved and retain them, but that’s not working well and I cannot tell exactly what do to about it.

My last question would be: @pascal, @whiscard and @burke, on those situations, what’s the best way to notify you?

dkayiwa · July 15, 2018, 8:12am

This is very insightful! Thank you so much @cintiadr

whiscard · July 17, 2018, 6:19am

@cintiadr, amen To answer your last question, pingdom and telegram works well for me, I actually saw the alert but I also dint have access to my laptop. One of those rare days. Oh and yeah, onboarding and setting up stuff takes time indeed… there’s a saying: if you want to cut down a tree, spend more time sharpening the axe

r0bby · July 28, 2018, 3:00am

Well said @cintiadr!

I would just like to point out that this is a lesson that OpenMRS leadership (yes, I am blaming this specific group) hasn’t seemed to learn since I resigned. @cintiadr is going to burn out just like I did if you don’t learn from this…

@Leadership: you need to step up and start putting some of that funding you got to the infrastructure team. There needs to be someone on-call – if you don’t want to use pagerduty, use the pingdom app…but outages need to be handled in a timely manner and you SHOULD NOT rely on one person. Yes, I am overly critical and yes, I do point to the fires that happened as a result of poor management and leadership. I am definitely repeating myself – it’s worth it. The fact that Cintia had to say this – is baffling.

cintiadr · July 28, 2018, 9:27am

I’d be cautious to extrapolate that a single prolonged outage should dictate all priorities.

ssmusoke · July 28, 2018, 6:26pm

I am going to recommend adding an Infrastructure discussion on the weekly Project Management calls so that we can actively work on building out the support function.

@dkayiwa please take note just in case I can’t connect on Monday