Metrics and proper alerting for our infrastructure :D :D :D

I’m pleased to announce that we now have Datadog for our infrastructure! We did qualify for they opensource program, and it was recently approved! I’ve been using datadog for a few months at work, and I was very happy to use a cloud service for our monitoring and alerting.

The credentials are shared as usual; if you are not part of infra team and want a user, let me know and I can create it for you.

I configured (almost) all our machines to have the datadog agent, and I also tagged them based on environment and provider:

I also created some basic alerts (CPU, disk, memory, swap…)

Because of those small things, I discovered that wiki service was eating all of the CPU. I increased the memory of the JVM, and… magic happens!!

Here’s where I ask for help. We are using the official ansible role. I’d love some help to setup the following integrations with all our machines:

And even to create awesome dashboards!

The checks are easy – See here – define them in the host_vars files. Gonna have to read the docs for each check as to what needs to be passed.

I never bothered to set up mysql – I THINK I set up nginx…I can’t remember.

There are easy enough to get someone started with our ansible code (maybe @swathivarkala ? :smiley: )

I just don’t have time to do everything by myself, and I don’t even want to be the only one using datadog.

1 Like

I published


Great writeup @cintiadr!