the goal of this project is to create a module to facilitate server monitoring by exposing many of the key server metrics through a monitor-friendly web page or pages
Description
It is beneficial to improve any aspect of infrastructure monitoring for OpenMRS. Currently, we do not have any native support from OpenMRS standard distribution, but PIH has developed a customized solution for this purpose. Therefore, a new module which can expose some key system level metrics are encouraged to be on our standard distribution.
It was decided that creating a new module in openmrs server and exposing some metrics to 3rd party monitoring solutions(e.g. Nagios, OpenNMS, Munin, Cacti) is the way to go.
Cross-Checks
Although Usage Statistics Module was mentioned and referred in project wiki, I believe this module has nothing to do with infrastructure monitoring.
Most of these monitoring tools are usually run on standalone instance and pointing to the targeted server instances. So, in this way, monitoring is not part of the application, rather isolated, dedicated server instance with the sole purpose of monitoring application hosted servers. How can we collect the data which is transferred between the target servers and nagios(or any other monitoring) instance?
In the objectives, it was mentioned that it is necessary to white-list the users who can access these information; Possibly through a dashboard. All those monitoring solutions comes with elegant dashboards which shows many metrics in detail. In these dashboards, administrator can block/allow users as they wish. So, my question is why do we want to implement our own module which does more-over the same or less than what those monitoring solutions do?
@brucemakallan@ibacher@hamish
Given that, we could look for alternative approaches for enhancing the monitoring in our infrastructure. Feel free to provide your thoughts/answers on this. Your ideas matters.
@judeniroshan I agree with you the usage statistics module is more towards data analytics of the platform level, not server level.
As per the understanding, we need to create a new module that exposes the server stats metrics as a rest endpoint. When it comes to Nagios, OpenNMS dashboards are like solutions that provide an out of the box solution with dashboards all in one package (Although I am not quite sure whether it provides a Rest API which can be again exposed. But still I think if it provides an API itās kind of a rework which is not good ) .
But as @judeniroshan mentioned itās good to think about alternative solutions like Prometheus which can be easily installed. Also, Prometheus provides a flexible query language that will help us to create a module that exposes systems metrics to the outside world. Also, Prometheus can be used as a data source for Grapahana like modern real-time monitoring dashboards.
So before I start, let me just say that Iām of the opinion that if youāre willing to mentor a GSoC project, you should be able to shape it into whatever you want it to be. However, itās probably good to understand the context in which this project arose to understand what itās aims are.
First, however, let me just point out that in terms of system level monitoring, we can probably leverage JMX with some simple configuration to be able to monitor some baseline system aspects. Almost any system monitoring tool will be able to extract relevant information from JMX out of the box (system uptime, session count, resource usage, etc.).
The original purpose of this project was to be able to collect not just system-level data but also to have a means of exposing things like application usage to track not only questions like āis the system up?ā but also questions like ādo people use these forms we developed?ā and āhow long does it take to fill out this form on average?ā. Hence, the original target was not just infrastructure-level data, but also data to allow us to be able to gather information on the usefulness of the system.
And, yes, thereās no need to develop dashboards as part of this, since any decent system monitoring tool will be able to generate dashboards given the appropriate data.
Hi, @ibacher Iām interested in applying as the student of the project. I think in that case the project wiki page needs some refinements currently it only has below objectives.
Create a new module which can collect the necessary real-time data for the metrics in OpenMRS deployed servers.
Present a monitor-friendly page of server metrics , including
-server uptime,-number of active sessions,-database connections,-system usage counts,
Constrain access to the page in a manner that doesnāt rely on authentication (e.g., by limiting access to a set and/or range of IP addresses).
Yes, I think moving forward with JMX module is a good way forward.
First, for java monoliths, itās pretty common to use dropwizard. I know springboot has it by default, and the developers seem to be able to configure anything that I wanted. Example
On any application, I expect 3 distinct health endpoints:
/ping or similar: it returns 200 OK, sometimes it has the version deployed as well. It doesnāt do any checks. I use this check to configure auto-healing (restarting container/service/machine). It doesnāt check any dependency downstream. This is expected to be triggered every couple of seconds.
/health or similar: checks connection to database and any downstream dependency. Itās a json with an overview of the system checking all dependencies. It should return 50x if thereās any errors on a dependency. I use this for alarming and monitoring (not for autohealing), and itās probably getting pulled every minute or so. If thereās an obvious problem/exception, should show up here.
/metrics or similar: a json with a bunch of metrics to be pulled by metrics systems. That can lead to alarms as well. Pulled every 30 seconds or a minute.
Now, metrics I expect are specifically about JVM metrics. How much memory itās using for each area of the heap, how much time and CPU itās using for garbage collector. Number of database connections could be nice, you could also have a counter for errors on the last minute. But overall, my problem is always GC with java, so Iād ask for that.
Overall, Iām happy for these endpoints to be āpublicā and we limit them in the reverse proxy (for example, all of them are in /monitoring/*. If you want to go with authentication, make sure to have a user/password that doesnāt have access to anything else on the system, and that we can use basic auth to see the endpoints.
I wouldnāt block the IP on tomcat level as itās harder to control and easier to have security vulnerabilities.
So metrics endpoints can be exported to any metrics aggregator system, like graphite, prometheus or datadog. Thereās usually quite a few bridges to create have connectors from dropwizard, as itās so popular.
Server uptime and so itās usually less useful to me, as I have that information on the process itself, and itās not something that you want to import to the metrics system.