How to join the infrastructure team?

raff · August 30, 2016, 11:05am

What does it take to join the infrastructure team?

I’d like to be able to help in situations like with Nexus or login issues in JIRA, yet I have my hands tied as I do not have access to most of the servers and I cannot even do research.

I’d like to apply to join the infrastructure team. I don’t think I have a bandwidth to answer helpdesk on regular basis though. I feel I do can help with addressing some of the issues.

janflowers · August 30, 2016, 3:20pm

Thanks @raff for bringing this up! This has been a hot topic lately and I think there are several developers and community members that would also like to know this. I know @burke has been working on this with the current infrastructure team. But maybe we can just discuss here as a community what we think the infrastructure team should be structured like, and what we would like the process to be for joining it?

It seems like the team should have a Lead person who isn’t necessarily responsible for all of the tasks, but they are responsible for identifying those tasks and coordinating people to do those tasks. I think this should include a rotating calendar for on-call assignments for emergencies that people could sign up for. The calendar should probably be set up for each quarter and people can sign up for the week that they would like to cover that quarter. This way it isn’t all on a single person and it doesn’t overwhelm someone and burn them out. Then it brings up the question - should on-call be the person who solves the problems, or is it good enough to have someone be on-call that can solve some problems and then engage the right people to solve the problems they don’t know how to solve?

For infrastructure team assignments - seems like we could set everything up as a list of tasks that need to happen that could then be divided up among the team members. I’m not sure what that set of assignments are, but I’m guessing things like monitoring Nexus storage could be one of them… most people would have multiple assignments, but maybe someone only has time for one. Breaking it down allows people to participate at the level they feel comfortable with in both skillset and time.

Lastly, onboarding into the infrastructure team - I think @burke is helping to organize the documentation and identifying gaps where documentation is needed. This will help a lot! But based on those tasks broken down as assignments, we may be able to let newer folks do some level of support while we aren’t sure of there skill set, and those more seasoned members such as yourself could take on more trusted levels of support.

Thoughts?

r0bby · August 30, 2016, 9:42pm

First, you will need to join Telegram, that’s where we do most of our coordination.

We kinda know what the issues with JIRA are and there’s not much we can do at this moment. It’s running out of memory. As far as Nexus goes, that just required somebody drill into it…PS ncdu is awesome.

This is what we actually need help with. On-call rotations and helpdesk handling.

darius · August 30, 2016, 10:53pm

I like where you’re going with this Jan. +1 to all of:

making it possible for people to help out with our infrastructure ops in “different-shaped” ways
making the calendar more visible and better defined, so we can get more people helping out
potentially splitting up receiving the on-call notification from necessarily being the one to actually fix it

Regarding Rafal’s request (and my own request yesterday), I think we need to move towards a higher-trust approach where proven long-term contributors can have access to systems, and help troubleshoot problems. (And, logging into a machine to troubleshoot a full disk should not have to coincide with being part of the on-call rotation.)

(Stepping back, I’d also like to see us adopt more of a devops philosophy, where delivering high-quality systems becomes a shared responsibility, rather than having engineering and infrastructure teams throwing things across the wall to each other. But that’s a bit of a tangent to this specific conversation.)

r0bby · August 30, 2016, 11:05pm

janflowers:

It seems like the team should have a Lead person who isn’t necessarily responsible for all of the tasks, but they are responsible for identifying those tasks and coordinating people to do those tasks. I think this should include a rotating calendar for on-call assignments for emergencies that people could sign up for. The calendar should probably be set up for each quarter and people can sign up for the week that they would like to cover that quarter. This way it isn’t all on a single person and it doesn’t overwhelm someone and burn them out. Then it brings up the question - should on-call be the person who solves the problems, or is it good enough to have someone be on-call that can solve some problems and then engage the right people to solve the problems they don’t know how to solve?

There’s a volunteer role for handling on-call rotations. I’ve been acting as Lead for awhile now.

We have monitoring tools in place for this. [quote=“janflowers, post:2, topic:7813”] Lastly, onboarding into the infrastructure team - I think @burke is helping to organize the documentation and identifying gaps where documentation is needed. This will help a lot! But based on those tasks broken down as assignments, we may be able to let newer folks do some level of support while we aren’t sure of there skill set, and those more seasoned members such as yourself could take on more trusted levels of support. [/quote] Whatver I do not like about this is that @burke hasn’t involved the infra team in this AT ALL – I wasn’t even aware of it until I read the Leadership call minutes. There should be a thread on Talk about this, yet this is the only one I see. Onboarding onto the infra team isn’t hard – we have documents which say where each service is located. I can even share my ssh_config config file which makes it so that I don’t even have to think when I need to ssh into a machine. We have a shared lastpass folder with important passwords shared amongst the infra team.

That’s not what devops is @darius. Devops is what the infra team does. None of the devs need access to the production/staging systems outside of Bamboo. This won’t be happening. I’m in agreement with @michael’s approach. In addition, we had someone from the Bahmni team do some really weird things to the Atlas staging server, which only added to my hesitance to give access. If you’d like to know what happened, I’d be glad to tell you. I reverted whatever it is they did – I’m not even sure what else was done.

I have to know you REALLY know what you’re doing before I give your access to a production system. I consider staging servers similarly. I need to know that you won’t mess up the server, make a configuration change, or do something which will cause a server to become compromised or go down. Trusted developer != Trusted Sysadmin. That level of trust must be built up.

I am VERY uneasy giving people access to production systems. Especially if they are not available on a full-time basis if they mess something up and I have to deal with iat 3am, I’m not gonna be happy. So no, I’m not going to give people access to production/staging servers.

I have to trust people. I trust @raff and @darius are great developers, but I don’t know anything about their sysadmin skills. This isn’t me being a dictator, this is me making sure our systems remain secure.

r0bby · August 31, 2016, 1:50am

The OpenMRS Infrastructure Team operates under the Community Management team. Wanna join, instructions can be found on the wiki.

r0bby · August 31, 2016, 4:45am

Just adding one more post: if the language and reasoning behind why I am against giving developers access to production systems, or why I’m strict with control came off personal and dictatorial, it’s not supposed to come off that way at all, just good practice to not give root access for production systems without ensuring people know what they’re doing.

We ARE open to adding people and building up that trust. If you’d like to help out, message me and get on Telegram, and I’ll need your phone number as I need to add you as a contact to add to you to our infra chat. I’m willing to work slowly to build up trust =).

We need to definitely develop guidelines for adding new people and what the infra team expects you to know (or be willing to learn).

@darius, @raff: ^^^

paul · August 31, 2016, 5:25am

Thanks, @r0bby.

It’s probably important to relate some facts here on top of what @r0bby has said.

As of now, four community members have full access to all of our infrastructure machines, @r0bby, @pascal, @mayank, and @ryan.

However, as I’ve discussed with @r0bby, I think that’s insufficient coverage given our community size. We’re a few cycles down over the past few months, and need to restock/reload.

Therefore, I’d like to nominate @burke, @darius, @raff, and @wyclif as potential members to the infra team as well, and I know 3 of the 4 have already expressed interest in this to me, and all four are long term fundamental members of the community.

One of the challenges of group management is coordination. I think it’s super important to move quickly towards a model that ensures that members of this team are acting in coordinated ways, asking for peer support, and avoiding making unilateral changes to infrastructure without group assent.

I trust that @r0bby will move on this quickly.

r0bby · August 31, 2016, 5:40am

Thanks @paul.

You need to join telegram and monitor it daily, this is where we coordinate our activities. Contact me privately if you are interested.

Do any of you have devops/system administration experience?

raff · August 31, 2016, 1:24pm

@r0bby, I do have experience in addressing software related issues, especially running on JVM. I do know enough to not mess around out of my areas of expertise.

Telegram works for me. See you there.

maurya · August 31, 2016, 3:58pm

If someone wants to volunteer/get on-boarded, to know their skill level can the Infrastructure team define tasks to be done for a particular level and demonstrate it to you guys maybe on a separate temporary server? will that help?

r0bby · August 31, 2016, 7:52pm

Honestly I’m willing to start out slow and build up trust. What we really need is more people willing to be on-call.

We really need to move away from direct ssh access and move to puppet or ansible…we use ansible to do initial setup on all servers, as well as keep them up to date. We have root access but should rarely be using it.

All deploys should happen in configuration management and never on the server itself manually.

chagara · August 31, 2016, 9:24pm

if you want I can help configuring ansible or puppet. Currently doing it at work. I do prefer ansible cause of the use of ssh.

r0bby · August 31, 2016, 9:37pm

Both can work just fine actually – but I do like ansible a lot.

mogoodrich · September 2, 2016, 1:28pm

Thanks @r0bby!

r0bby · September 2, 2016, 11:33pm

The infra chat I created disappeared when I left it. people can feel free to do whatever they choose to do – whoever decides to replace me.