OpenMRS goes down due to 'Java Heap Size - Out of Memory' issue in Bahmni Lite

ramkumar.g · November 29, 2022, 8:03am

We are currently doing performance test for Bahmni Lite in a dedicated performance environment. Further details on the setup and test scenarios could be found here.

When we started the performance test, we ran with default JAVA_OPTS(JVM) for OpenMRS

-Dfile.encoding=UTF-8 -server -Xms256m -Xmx768m -XX:PermSize=256m -XX:MaxPermSize=512m

but we encountered the ‘Java Heap Size - Out of Memory’ issue intermittently as soon as we initiated the execution. After discussion with community and various references for optimal JVM specs, we currently utilise the following JAVA_OPTS value

-Dfile.encoding=UTF-8 -server -Xms2048m -Xmx2048m -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:MetaspaceSize=768m -XX:MaxMetaspaceSize=768m -XX:InitialCodeCacheSize=64m -XX:ReservedCodeCacheSize=96m -XX:SurvivorRatio=16 -XX:TargetSurvivorRatio=50 -XX:MaxTenuringThreshold=15 -XX:+UseParNewGC -XX:ParallelGCThreads=16 -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:+CMSCompactWhenClearAllSoftRefs -XX:CMSInitiatingOccupancyFraction=85 -XX:+CMSScavengeBeforeRemark

But now we are encountering the same heap issue for longer duration execution (8 hours simulation) and observed the following:

70 Active Users 8 hours - Not even a single successful simulation completed
50 Active Users 8 hours - The simulation was successful but after a certain time the OpenMRS went down
40 Active Users 8 hours - Simulation successful and the heap was hovering around the max size but the OpenMRS never went down

The observations have been documented and are available in the Bahmni Wiki page.

Source Code: GitHub - Bahmni/performance-test

Please provide your suggestions to optimise JVM further and prevent the issue further!

cc: @n0man @gsluthra @angshuonline

raff · November 29, 2022, 10:11am

Thanks for sharing! An analysis of heap dump needs to be done. Most likely we have some memory leaks that need to be addressed.

Could you please share a heap dump after running for a few hours with e.g. jconsole, jvisualvm or automatically when getting out of memory with:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=<path>

?

(see https://www.baeldung.com/java-heap-dump-capture)

ramkumar.g · November 29, 2022, 12:29pm

Sure @raff , will share the heap dump from the next failure run.

ramkumar.g · December 1, 2022, 7:26am

@raff we have completed the 70active users 8 hour performance test and observed the same issue again. We have captured the heap dump as you suggested and attached below for your reference.

HeapDump

raff · December 1, 2022, 11:14am

Thank you! I’ll get the chance to look into that early next week.

ibacher · December 1, 2022, 3:28pm

So this is less a memory leak so much as too many things being held in memory in the FHIR module in particular. What happens is that when you make a request to search the FHIR module, if the returned results are larger than the configured page size, we store the results in a FIFO queue. This is because as part of the results, as required by the FHIR spec, we return next and previous links and if implementations follow those links, we need to be able to resume the results from where they left off. That queue for holding search results is growing to an absurd size (~2 GB) for two reasons:

We are allowing the FIFO queue to contain two many search results
The search results are stored as UUIDs. Looking through the heap dump, almost everything in the queue has on the order of 8k-12k results, almost entirely UUIDs for patients.

On the FHIR2 side, the quick-fix for this (which should be available momentarily in 1.8.0-SNAPSHOT) is just to severely ramp-down the number of results we allow to be queued. I’m also going to try to ensure that instead of storing UUIDs, we’re storing integers (i.e., the record ids), which should save quite a bit of space.

Update: Turned out to be easier than I thought to move from UUIDs → Integers, which is done here: FM2-537: SearchQueryBundleProvider should use Integer primary keys · openmrs/openmrs-module-fhir2@4930275 · GitHub.

@ramkumar.g Are you able to re-run your tests with the 1.8.0-SNAPSHOT version of the FHIR2 module. If the fixes I’ve pushed solve this issue, I’ll release 1.8.0.

ramkumar.g · December 14, 2022, 1:20pm

Hi @ibacher , Thanks for the 1.8.0 snapshot, we were able to perform various tests and found the memory consumption by FHIR services have been reduced to a great extent. Please plan your release for this version. Also any updates on the search results queue limitation feature for this module?

ibacher · December 14, 2022, 1:30pm

Awesome!

Could you please clarify what you’re asking here?

ramkumar.g · December 15, 2022, 6:07am

Hi @ibacher You mentioned a quick fix were we limit the number of results to be queued. Is this fix already available in 1.8.0-SNAPSHOT along with the integer update?

ibacher · December 15, 2022, 1:37pm

Yes, that’s already in place.