Proactively monitoring application logs for issues

In many OpenMRS deployments, the primary mechanism by which developers and implementers learn about problems / bugs with the application, or discover unexpected usage patterns, is through communication and feedback with end users. However, there are quite often barriers to this communication and it is common that easily identifiable and fixable issues persist far longer than they should.

One way to mitigate this is to invest in tools that will automatically monitor application log files for errors or particular activity. Doing something as straightforward as parsing catalina.out for error messages and other patterns, and aggregating these into a store or into email alerts for easy review and tracking over time would seem like a non-controversial first step, yet this is something I have rarely seen in practice, and I’m not aware of any documentation, tools, or approaches to this that have been shared or promoted in the OpenMRS community.

I wanted to start this thread to see if there are groups already doing this kind of thing successfully, and if so, to try to learn from their approaches. I suspect that groups like @Mekom and @AMPATH may have long been doing this kind of things successfully, and I’m eager to learn more.

If anyone has any experience to share please do so here - thanks!

1 Like

@mseaton thanks for staring this thread. It’s been on my mind as well. It’s dead simple when you deploy OpenMRS to any cloud provider since they provide tools like AWS CloudWatch, Azure Monitor, etc., but it’s not a common deployment model for OpenMRS.

I think the on-premise solution for OpenMRS needs to aggregate logs at least from tomcat, db, nginx (for O3 deployments) and frontend js errors. Another thing is to gather metrics on resource consumption, which would also greatly help to troubleshoot issues combined with logs.

One of the most popular on-premise tools for logs aggregation and metrics monitoring is the ELK stack (ES, LogStash and Kibana). It’s quite resource consuming to deploy the whole stack though. In practice it needs at least 4 GBs of RAM. It’s probably not the best fit for many of our implementations, which tend to be small in size, but it’s extremely powerful for bigger implementations running on beefier hardware.

A very good alternative is Grafana (Loki for logs and Prometheus for metrics). Grafana services need much less RAM. Probably around 500 MBs would be enough to run it. It can store all data in filesystem (supports other storage like S3 as well).

I think that as a community we could provide a simple docker setup with Grafana that would be best suited for most of our implementations.

Looking forward to hearing experience from others as well.