We now have datadog to do all the fine-grain monitoring on each machine (disk, memory, certificates, docker, dns, you name it). The few machines with new relic are being replaced.
Datadog sends emails to infra group, so I can take a look on the next day or so.
For urgent stuff (as in, if a relevant system is down) we have pingdom. Pingdom will notify status.io and configured users (in my case, I receive push notifications on my phone). Which is great, because if I’m awake and available, I will see the notification pretty soon.
@pascal / @maany / @burke , in the last year, did you receive any alert from pageduty that you guys actually took any action?
I’d recommend you set up a Telegram bot and have all the outage alerts go to the infra chat. PagerDuty was the bane of my existence when I was on the infra team. I’ve yet to actually implement that kind of alerting again.