OMRS Monitoring Tools

corneliouzbett · November 8, 2022, 11:49am

Almost everyone needs some sort of a regular system performance monitoring, no matter how small their organization. System performance can take a variety of forms, each one designed to address different issues with system and server. Regardless of the implementation size, the demand for server, network, and infrastructure monitoring using the finest technology cannot be overlooked.

I’ve been struggling with unplanned sluggish system performance, discovering system bottlenecks, and system failures for quite some time. Although I am not an Infrastructure specialist, I have learned a few things about system performance and infrastructure. With the correct monitoring tools, you can detect performance issues, gain additional insight into what’s going on in that server, and uncover reasons. I’ve always found it inconvenient to read and browse through server/application logs. I could only hope to export to an external platform with advanced filtering and analytical capabilities. Let’s not worry about logs now, instead look into key metrics to measure;

OS metrics - CPU utilization and memory usage
JVM metrics - JVM memory, Garbage collector(collection count & collection time)
Tomcat metrics - Request throughput and Latency, Thread pool and executors, Errors(error count)
Database connection pool metrics

I’m seeking for a solution that is adaptable, low-maintenance, modern technology, and simple monitoring tools, integrable with an alert/notification manager. Prometheus + Grafana is the popular for this case. However, I am aware that various implementations use various setups for application and infrastructure monitoring. Please share your monitoring experience, particularly as it relates to an OpenMRS instance. I’ve seen slns in the community i.e. emr-monitor and usage statistics module

FWIK OpenMRS doesn’t provide out of the box monitoring tools. I intend to work on a solution to optionally(configuration-based setting) expose JVM-based metrics i.e Metrics on classloaders, memory, garbage collection, threads, etc. and ability to check application health. I envision with a configuration ENABLE_METRICS: true to expose JVM metrics through an endpoint /metrics. Then visualize using Grafana.

First approach;

Using prometheus JMX Exporter: A collector that can configurable scrape and expose mBeans of a JMX target. Since OpenMRS already ship an openmrs-core docker image with tomcat as the base, the setup should be straight forward as adding the java agent jar file to container and the configuration:

CATALINA_OPTS="$CATALINA_OPTS -javaagent:/path/to/<metics>.jar"

Second approach;

Targetting tomcat: This requires building a custom java client agent.

Basically, both approaches exposes data to be scraped by prometheus, stored as time series then visualize using grafana. This is not new, I would like hear more how other implementations have achieved system monitoring and reliable alert system.

Thoughts?

@dev5 @dev4 @dev3 @dev2 @dev1 @Platform_Team

burke · November 8, 2022, 2:24pm

@corneliouzbett,

Thank you so much for your thoughtful and deliberate approach toward monitoring. As you’ve noticed, we’ve had several efforts at monitoring over the years with variable success and uptake. I applaud an approach that tries to leverage existing tooling rather than creating bespoke solutions that are harder to maintain.

I’d love to hear what groups like PIH, Mekom, UgandaEMR, Palladium, etc. think of your approach. It would be wonderful to find an approach that two or more groups were excited about using (increasing the likelihood of wider adoption).

Do you envision some metrics always available and the ability to increase metrics output during troubleshooting – i.e., is this ENABLE_EXTENDED_METRICS?

I’m assuming a metrics API would be read-only and little security risk, but I also assume not every implementation will want to make metrics world readable. How do we make metrics readily accessible to common tooling without exposing it to anonymous clients? I don’t know what is considered best practice (e.g., can most tools POST a token to provide a modest level of security)?

/cc @mseaton @mksd @ssmusoke @aojwang

grace · November 18, 2022, 1:14pm

FYI @achachiez

grace · November 18, 2022, 1:26pm

Next steps on this project, discussed w/ Daniel & Bett today:

@dkayiwa @ibacher @raff do you have any feedback on the endpoint security concern?
@corneliouzbett is going to proceed with setting up a prototype with PR support from @dkayiwa - so he’ll get things working with endpoints. Let’s not block his fellowship project progress given he has only <6 weeks left.
I’ll intro Bett to some folks at Jembi who have been significantly focused on monitoring tooling. But this should not be a blocker for Bett to proceed with prototype work.

ibacher · November 18, 2022, 1:59pm

It’s probably worth taking a look at this module that was a GSoC project that didn’t quite get far enough, but which implemented a number of ideas around this, including exporting metrics via JMX.

I’d assume we can create some kind of OMRS service account with w/e permissions are necessary to query the endpoint. Any monitoring tool should be able to handle sending requests with Basic Auth (if there’s nothing built-in, you can usually configure things to include a fixed header in the web request, which is all that’s needed for that to work).

raff · November 18, 2022, 2:10pm

@corneliouzbett please have a look at GitHub - google/cadvisor: Analyzes resource usage and performance characteristics of running containers. as well for container metrics. It can be nicely integrated with grafana.

achachiez · November 24, 2022, 7:36am

Thanks @corneliouzbett for bringing this up. I think we have had a plan at @Mekom to have this for a while. I believe we are taking a similar approach to what you are suggesting. You can take a look at this draft PR https://github.com/mekomsolutions/bahmni-docker/pull/54 which implements this in the @Mekom OpenMRS docker image.