PagerDuty quota notice and thoughts on how we could improve on-call coverage

@pascal & @maany,

I know the recent outage caused a lot of PagerDuty alerts, but scanning the logs it looks like we may not be using PagerDuty effectively. I know the two of you are currently bearing the full burden of being on-call for the community. What do you guys think?

Some thoughts:

  • On-call should only get notified of things requiring immediate action (e.g., follow-up on non-emergency help desk tickets and “X is back up” don’t need to fire PagerDuty notifications).
  • The logs suggest notifications are sent by email, push (to phone app), phone call, and SMS for each incident. Is this really necessary?
  • I see a lot of OpenMRS ID alerts. Maybe we should focus community efforts on improving/stabilizing OpenMRS ID.
  • We could add a “Being On-Call” section to the ITSM wiki to document on-boarding, requirements (e.g., skills, accounts), and the common tasks required when on call.
  • Improving PagerDuty settings and targeting weak links in our infrastructure could simultaneous make your call experience easier and reduce the likelihood of exceeding PagerDuty quotas. Combined with some basic documentation of how to be on call & handle issues, we may have a better chance of recruiting a few more people to take call.
  • Ideally, we’d have at least 8 people capable of taking call with four in GMT -4 to -8 timezones and four in GMT +4 to +8 timezones, allowing a week of call every 4-6 weeks and/or the ability to reduce/eliminate the need for overnight calls. But we should a short term goal of getting at least 1-2 more people participating in the call schedule.

-Burke :burke:

I agree with all of the above. I only look at the phone app notifications. Also, we don’t need notifications when systems come back up IMO.

@mayank, would phone app notifications alone suffice for you as well? Removing email/phone/SMS notifications alone would help avoid hitting our monthly notification quota at PagerDuty.

When I get a chance, I’ll try looking into how we can cut down on some of the noise.

Sounds good @burke ! I’ll set up the app on my phone (currently using the tablet for app notifications, which is less accessible)

I totally agree with the other points regarding getting more hands on deck. I can draft a getting started wiki page for a new recruit. A lot of the information on the tasks that need to be done (restarting services, granting JIRA/Wiki access to folks, whitelisting IP’s etc) is already on the “How To’s” thanks to @cintiadr’s amazing contributions.

Do we need pageduty at all?

What pagerduty is offering us more than pingdom by itself?

My understanding is that PagerDuty adds ability to consolidate incidents (integrates with, pingdom, JIRA, etc.), automatic call schedule management with escalation policies, and flexibility in notifications (can notify via email, phone, sms, or push). At the moment, I think it’s the consolidation of incidents via & pingdom and the call schedule with escalation policies that are most useful to us.

Yes, PagerDuty is necessary. There’s a lot of moving parts within the Community and it’s pretty much the gold standard for infrastructure monitoring.

@burke, Deploying 2.1 will fix that. ID Dashboard leaks like crazy right now. That’s going to happen tonight or tomorrow.

Also @burke, I have been telling you this for MONTHS before I ultimately just got fed up. We should NOT have been reliant on 3 people shouldering the burden…at one point it was literally @maany and myself. It’s not the work that was the problem…that was automated…it was the responsibility.

Try getting some money to hire infrastructure people.

So let me explain back my point on pageduty. Pageduty is an amazing product to take care of on-call requests and escalation policies.

The idea behind it is that if everyone is responsible at the same time, no one will actually feel personally responsible to fix a certain outage in inconvenient hours. That works extremely well for small teams, which need 24/7 service while based on relatively the same timezone.

As far as you have a distributed team (timezones helping), you happily give up of on-calls (at least first-level ones). You just pass the bastion to the next timezone.

There are things alerting on-call which shouldn’t imo. Helpdesk tickets should never alert. There’s never something so urgent that needs to wake someone up in there. 90% of all ‘critical’ requests wait 10-15 days before the user actually comes up with the answer of what they need exactly. Telegram is the absolutely fastest way to get us know of anything which needs to be done immediately, by whoever is awake.

I’d advocate we don’t have on-calls anymore. Instead, we are all in telegram and have pingdom in our phones. As we are all volunteers and we are all around the world, that’s what I think it’s better and saner.

1 Like

I agree with the general principle that’s it’s not appropriate to expect a volunteer team to be woken up by a helpdesk request. But I’m not one of the people volunteering, so it’s really up to them…

@darius – sometimes it IS important. I’ve gotten woken up by help desk cases prior…and I chose to ignore it and go back to sleep…if a system went down…I woke up and/or restarted it from my phone.

HOWEVER, I’m not really in a position where I’m on call. PagerDuty can notify us with alerts from NewRelic(system load goes through the roof, disk IO usage high etc)…Pingdom only does cursory “is this thing working?” tests…it doesn’t give insight into WHY.

Unless we actually stop and spend time (and money) to get new relic working in all/most of all machines, I cannot consider it part of our monitoring/alerting system.

(While I do prefer datadog for that).

Set up SOMETHING – PS datadog can integrate with newrelic. There needs to be alerting for these things.