Enabling Horizontal Scaling in OpenMRS

raff · January 16, 2025, 12:30pm

Thanks @jacob.t.mevorach for your continued work on this! A few comments from me.

Session management can be done with different levels of failure resistance. The easiest for us might be to just put a load balancer in front and have it connect a client always to the same instance. This approach doesn’t even require any changes to openmrs-core. One drawback is that if a node fails, the session is lost and user starts a new session on a different node. It’s not such a big issue as most operations are short-running and it’s fine in most cases for the user to login again and repeat. As long as the user sees a failure massage instructing him to do that. We could go a level higher and introduce session replication and as you suggest put session data in Redis. However, it’s not strictly necessary to support OpenMRS replicas. It simply adds to HA. In storing session data we might look for a solution that is Spring based instead of Tomcat (though I think pretty much everyone deploys OpenMRS with Tomcat these days).

In regards to Shared Database Usage. Database can be shared by design by multiple instances. A different thing is when you have a DB cluster as your backend with multiple masters or slaves. Then I’d put ProxySQL in front to load-balance connections and remove failed connections from the pool. It is generally not needed to add more complexity to the application to control, which DB node to use for writes and reads and have ProxySQL handle all that. I’ll be adding ProxySQL to our helm charts when choosing Galera cluster as a DB to have a proof of concept. In AWS you have RDS proxy. Aurora RDS might be able to deal with that on its own. The only remaining thing for HA is retrying queries in case of failures by getting a new connection to a working node, which your code tries to accomplish. It’s nice to have, but not strictly necessary for scaling OpenMRS instances.

Now comes the real challange with running multiple OpenMRS replicas that requires substantial amount of work. I don’t think we need anything like “swarm mode” for an OpenMRS instance. OpenEMR might have a specific use case to do that distinction, but in general it shouldn’t be needed. It doesn’t really matter for OpenMRS, which instance is the leader as long as we store data properly. What it means in practice is that we don’t just write data to a shared directory or use instance memory for caching (e.g. in the form of static fields on controllers with HashMaps), but use services that ideally can be distributed instead:

Have file service that writes to distributed file storage to store Obs, etc.
Have caching service that writes to distributed cache, which is used to store any transient data that is currently stored with static fields in code with HashMaps and alike.
Have Hibernate Search use OpenSearch instead of in-built Lucene index that is written to disk by each instance.

I’m planning to suggest a GSoC project to go through code and address 1st and the 2nd point, which is mostly straightforward, but tedious. 3rd might require someone who is familiar with Lucene and OpenSearch to make adjustments so it might be hard to find such a GSoC student.

If you are interested in trying the load-balancer approach with nginx for example to handle sticky sessions and then maybe adding session data persistence with Redis, I’m happy to work with you. I’m thinking of nginx, because we already have the gateway service that is proxying all requests that we could use for that specific purpose. Due to limitations in openmrs-core I would test the approach without using a shared data file system and have each instance run with its own data store, but connected to the same DB. It should mostly work as long as replicas are started after the system is running and DB is created.

Augmenting the data source to retry queries would also be a nice contribution. I don’t think it needs to be as complex as AI suggests See e.g. How to create custom retry logic for Spring Datasource? - Stack Overflow