Exposing system metrics for monitoring OpenMRS servers

judeniroshan · February 25, 2020, 7:52pm

Hi All,

I checked this following GSoC project wiki page and I came across some doubts on few points mentioned over there.

https://wiki.openmrs.org/display/projects/Expose+System+Metrics+For+Monitoring

Abstract

the goal of this project is to create a module to facilitate server monitoring by exposing many of the key server metrics through a monitor-friendly web page or pages

Description

It is beneficial to improve any aspect of infrastructure monitoring for OpenMRS. Currently, we do not have any native support from OpenMRS standard distribution, but PIH has developed a customized solution for this purpose. Therefore, a new module which can expose some key system level metrics are encouraged to be on our standard distribution.

It was decided that creating a new module in openmrs server and exposing some metrics to 3rd party monitoring solutions(e.g. Nagios, OpenNMS, Munin, Cacti) is the way to go.

Cross-Checks

Although Usage Statistics Module was mentioned and referred in project wiki, I believe this module has nothing to do with infrastructure monitoring.
Most of these monitoring tools are usually run on standalone instance and pointing to the targeted server instances. So, in this way, monitoring is not part of the application, rather isolated, dedicated server instance with the sole purpose of monitoring application hosted servers. How can we collect the data which is transferred between the target servers and nagios(or any other monitoring) instance?
In the objectives, it was mentioned that it is necessary to white-list the users who can access these information; Possibly through a dashboard. All those monitoring solutions comes with elegant dashboards which shows many metrics in detail. In these dashboards, administrator can block/allow users as they wish. So, my question is why do we want to implement our own module which does more-over the same or less than what those monitoring solutions do?

@brucemakallan @ibacher @hamish
Given that, we could look for alternative approaches for enhancing the monitoring in our infrastructure. Feel free to provide your thoughts/answers on this. Your ideas matters.

cc: @mozzy @suthagar23

ayesh · February 26, 2020, 7:33pm

@judeniroshan I agree with you the usage statistics module is more towards data analytics of the platform level, not server level.

As per the understanding, we need to create a new module that exposes the server stats metrics as a rest endpoint. When it comes to Nagios, OpenNMS dashboards are like solutions that provide an out of the box solution with dashboards all in one package (Although I am not quite sure whether it provides a Rest API which can be again exposed. But still I think if it provides an API it’s kind of a rework which is not good ) .

But as @judeniroshan mentioned it’s good to think about alternative solutions like Prometheus which can be easily installed. Also, Prometheus provides a flexible query language that will help us to create a module that exposes systems metrics to the outside world. Also, Prometheus can be used as a data source for Grapahana like modern real-time monitoring dashboards.

cc : @burke @dkayiwa @hamish @ibacher @mozzy @suthagar23

ibacher · February 26, 2020, 8:06pm

So before I start, let me just say that I’m of the opinion that if you’re willing to mentor a GSoC project, you should be able to shape it into whatever you want it to be. However, it’s probably good to understand the context in which this project arose to understand what it’s aims are.

First, however, let me just point out that in terms of system level monitoring, we can probably leverage JMX with some simple configuration to be able to monitor some baseline system aspects. Almost any system monitoring tool will be able to extract relevant information from JMX out of the box (system uptime, session count, resource usage, etc.).

The original purpose of this project was to be able to collect not just system-level data but also to have a means of exposing things like application usage to track not only questions like “is the system up?” but also questions like “do people use these forms we developed?” and “how long does it take to fill out this form on average?”. Hence, the original target was not just infrastructure-level data, but also data to allow us to be able to gather information on the usefulness of the system.

And, yes, there’s no need to develop dashboards as part of this, since any decent system monitoring tool will be able to generate dashboards given the appropriate data.

ayesh · February 26, 2020, 8:29pm

Hi, @ibacher I’m interested in applying as the student of the project. I think in that case the project wiki page needs some refinements currently it only has below objectives.

Create a new module which can collect the necessary real-time data for the metrics in OpenMRS deployed servers.
Present a monitor-friendly page of server metrics , including -server uptime, -number of active sessions, -database connections, -system usage counts,
Constrain access to the page in a manner that doesn’t rely on authentication (e.g., by limiting access to a set and/or range of IP addresses).

Yes, I think moving forward with JMX module is a good way forward.

ibacher · February 26, 2020, 8:35pm

Oh, I agree that the wiki page should be updated.

cintiadr · February 27, 2020, 9:58am

Oh, I can give quite some ideas here!

First, for java monoliths, it’s pretty common to use dropwizard. I know springboot has it by default, and the developers seem to be able to configure anything that I wanted. Example

On any application, I expect 3 distinct health endpoints:

/ping or similar: it returns 200 OK, sometimes it has the version deployed as well. It doesn’t do any checks. I use this check to configure auto-healing (restarting container/service/machine). It doesn’t check any dependency downstream. This is expected to be triggered every couple of seconds.
/health or similar: checks connection to database and any downstream dependency. It’s a json with an overview of the system checking all dependencies. It should return 50x if there’s any errors on a dependency. I use this for alarming and monitoring (not for autohealing), and it’s probably getting pulled every minute or so. If there’s an obvious problem/exception, should show up here.
/metrics or similar: a json with a bunch of metrics to be pulled by metrics systems. That can lead to alarms as well. Pulled every 30 seconds or a minute.

Now, metrics I expect are specifically about JVM metrics. How much memory it’s using for each area of the heap, how much time and CPU it’s using for garbage collector. Number of database connections could be nice, you could also have a counter for errors on the last minute. But overall, my problem is always GC with java, so I’d ask for that.

Overall, I’m happy for these endpoints to be ‘public’ and we limit them in the reverse proxy (for example, all of them are in /monitoring/*. If you want to go with authentication, make sure to have a user/password that doesn’t have access to anything else on the system, and that we can use basic auth to see the endpoints.

I wouldn’t block the IP on tomcat level as it’s harder to control and easier to have security vulnerabilities.

cintiadr · February 27, 2020, 10:01am

So metrics endpoints can be exported to any metrics aggregator system, like graphite, prometheus or datadog. There’s usually quite a few bridges to create have connectors from dropwizard, as it’s so popular.

Server uptime and so it’s usually less useful to me, as I have that information on the process itself, and it’s not something that you want to import to the metrics system.

Number of active sessions is quite interesting.

ayesh · February 28, 2020, 8:15am

Thanks alot for the great input @cintiadr I agree with you will start working on the praposal for the project