Enabling Horizontal Scaling in OpenMRS

Hello all!

Wanted to type up some initial thoughts on changes to enable horizontal scaling just so I could start a conversation and try to get some feedback.

I wanted to ask because, although I’ve done my best to try to study the code, I’m relatively new to this project while there are many people in this community who are experts.

I would greatly appreciate if you could please offer honest and direct feedback so I can contribute to this project I’ve grown quite fond of.

2 Likes

Session Management:

  1. My thinking is that no matter what OpenMRS needs the backend SQL database to function. Because of this I was thinking about submitting a PR to have OpenMRS core by default use MySQL Session Persistence for Tomcat. This would make it so that the default behavior is that all sessions are stored and accessed from the database which in and of itself would eliminate the need for sticky sessions.
  2. This should be tested and confirmed at some point, but my hunch is that Redis based session management would be far more performant and scalable however using it would add support burden (i.e. whoever was maintaining OpenMRS run in this manner would need to be familiar with Redis as well). My initial thought would be to offer options for setting environment variables that when set switch session management over from MySQL to Redis Based Tomcat Session Management. We should also take care to make sure that support is included to connect to Redis using SSL so that everything can be encrypted in transit and at rest. I should note that for the AWS architecture I was able to encrypt all connections between OpenMRS and the database backend by 1st downloading SSL materials on to the shared EFS storage volume here and then pointing OpenMRS when it’s started later to use those SSL materials by constructing a JDBC connection URL here.
  3. Both methods for session management require WARs to be downloaded so part of this implementation will have to involve changing the container images as well to include these steps (it should be noted we could include the download in “startup.sh” but I wouldn’t recommend it).
  4. The idea would be that startup.sh will detect what environment variables are set and based off that perform the configuration to the Tomcat files (notably “server.xml”, “pom.xml” and “context.xml”) to enable either MySQL or Redis based session management.

Editing the Startup Script to Allow OpenMRS to Boot Safely as a Replica:

  1. OpenEMR is another open-source medical record software that can boot safely with replicas. OpenEMR’s equivalent of OpenMRS’s “startup.sh” script would be “openemr.sh”. The latest production version (currently v.7.0.2) of this script which runs once every time a container running OpenEMR is run can be found here. I’ll be referring to specific lines in that script later in this section to explain how OpenEMR handles its “leader election” process, how this can be implemented for OpenMRS and how this allows OpenEMR to safely boot with replicas.
  2. For the purposes of this conversation the important part of the “openemr.sh” script starts at line 74. When you boot an OpenEMR container if the “SWARM_MODE” environment variable is set to “yes” then OpenEMR begins its leader election process (for reference I have to set this in Host OpenEMR on AWS Fargate to make it work here). The leader election process happens pretty concisely on line 77 where the container attempts to make an empty file called “docker-leader” and if it can it becomes the leader and if it can’t because another container already made the file first then the other containers wait for the leader container to perform any one time setup configurations that should only be performed once.
  3. For the OpenMRS startup.sh case the way this would work is that at line 9 we would insert a three line leader election process similar to the OpenEMR openemr.sh example. Then everything from lines 10 to 47 would be done by a single OpenMRS container and then once that was done the OpenMRS container would create an empty file like “openmrs-setup-complete” and the other containers would wait until that file was created and then once it was all the containers would start Tomcat at the same time as is done today at line 49.
  4. In this case we would have one shared volume for /usr/local/tomcat so that the configuration generated by the leader container could be read by the other containers.
1 Like

In Regard to General Shared Storage:

  1. One of the things I did in OpenMRS Contrib Cluster AWS ECS is mount an encrypted EFS volume to be shared between the containers at “/openmrs/data” (this can be found on line 1293 here). This was so that, in theory, the containers would be able to write and read data from a shared location. In practice though startup-init.sh writes a ton of stuff to this directory that shouldn’t be written more than once. A good example of this is the OpenMRS properties file which if we boot a bunch of container replicas will, today, end up just being appended to ad nauseum.
  2. A simple fix here could be instead of mounting a shared volume to “/openmrs/data” to instead mount it to “/openmrs/data/complex_obs”. All files uploaded will become a complex observation anyway and the intention with this volume is to provide shared file storage so that no matter what container you’re being served by you can still access all the files.

In Regard to Shared Database Usage:

  1. Probably the hardest challenge that would have to be solved to allow OpenMRS to run as a horizontal replica would be to modify the application so that multiple replicas can safely use the database at the same time. To accomplish this we’ll need multiple changes to both the server configuration as it exists in the backend docker image today and the OpenMRS application code.
  2. Truth be told I haven’t finished identifying all the things that would need to be changed to accomplish this but I do think I have a good starting spot for us to start at.
  3. In short Tomcat offers pretty nice support for using a distributed connection pool to the database which when paired with a transaction manager should allow us to safely run transactions against the database when running replicas horizontally.
  4. I generated a code example with an LLM I’ll post below that shows 1/ Java code examples for interacting with a database 2/ A sample tomcat configuration that achieves horizontal safe connections. For sure there’s bugs and this is definitely not production code that anyone should copy and run some place as is ( i.e. a lot of the packages referenced in there I’m pretty sure aren’t even maintained anymore) but I think it does a good job illustrating concepts and showing things.
// 1. XA DataSource Configuration Class with horizontal scaling considerations
package com.example.config;

import com.mysql.cj.jdbc.MysqlXADataSource;
import org.apache.commons.dbcp2.managed.BasicManagedDataSource;
import javax.sql.XADataSource;
import javax.transaction.TransactionManager;
import org.apache.geronimo.transaction.manager.GeronimoTransactionManager;
import javax.naming.Context;
import javax.naming.InitialContext;
import javax.naming.NamingException;

public class XADataSourceConfig {
    
    // Use environment variables for configuration to support multiple instances
    private static final String DB_URL = System.getenv("DB_URL");
    private static final String DB_USER = System.getenv("DB_USER");
    private static final String DB_PASSWORD = System.getenv("DB_PASSWORD");
    
    // Instance identifier for logging and monitoring
    private static final String INSTANCE_ID = System.getenv("INSTANCE_ID");
    
    public void configureDataSource() throws Exception {
        // Create and configure the XA DataSource
        MysqlXADataSource xaDataSource = new MysqlXADataSource();
        xaDataSource.setUrl(DB_URL);
        xaDataSource.setUser(DB_USER);
        xaDataSource.setPassword(DB_PASSWORD);
        
        // Critical for XA transaction consistency across replicas
        xaDataSource.setPinGlobalTxToPhysicalConnection(true);
        
        // Enable replication-aware settings
        xaDataSource.setRewriteBatchedStatements(true);
        xaDataSource.setAllowMultiQueries(false); // Security consideration for distributed systems
        xaDataSource.setInteractiveClient(false);
        
        // Create the transaction manager with unique instance name
        Properties txProps = new Properties();
        txProps.setProperty("InstanceName", INSTANCE_ID);
        TransactionManager transactionManager = new GeronimoTransactionManager(txProps);
        
        // Create the managed data source with connection pooling
        BasicManagedDataSource managedDataSource = new BasicManagedDataSource();
        managedDataSource.setXaDataSourceInstance(xaDataSource);
        managedDataSource.setTransactionManager(transactionManager);
        
        // Configure pool sizes based on instance capacity
        int maxConnections = calculateMaxConnections();
        managedDataSource.setInitialSize(maxConnections / 4);
        managedDataSource.setMaxTotal(maxConnections);
        managedDataSource.setMaxIdle(maxConnections / 2);
        managedDataSource.setMinIdle(maxConnections / 4);
        
        // Shorter wait times for distributed environment
        managedDataSource.setMaxWaitMillis(5000);
        
        // Aggressive connection validation for distributed systems
        managedDataSource.setTestOnBorrow(true);
        managedDataSource.setTestWhileIdle(true);
        managedDataSource.setValidationQuery("SELECT 1");
        managedDataSource.setValidationQueryTimeout(2);
        
        // More aggressive abandoned connection handling
        managedDataSource.setRemoveAbandonedOnBorrow(true);
        managedDataSource.setRemoveAbandonedTimeout(30); // 30 seconds
        managedDataSource.setLogAbandoned(true);
        
        // Add JMX monitoring for distributed environment
        managedDataSource.setJmxName("pool.xa." + INSTANCE_ID);
        
        // Bind to JNDI with instance-specific name
        bindToJNDI(managedDataSource);
    }
    
    private int calculateMaxConnections() {
        // Calculate based on available system resources
        int availableProcessors = Runtime.getRuntime().availableProcessors();
        long maxMemory = Runtime.getRuntime().maxMemory() / (1024 * 1024); // MB
        
        // Conservative connection count based on resources
        return Math.min(availableProcessors * 10, (int)(maxMemory / 100));
    }
    
    private void bindToJNDI(BasicManagedDataSource dataSource) throws NamingException {
        Context ctx = new InitialContext();
        // Include instance ID in JNDI name for isolation
        ctx.bind("java:comp/env/jdbc/XADataSource/" + INSTANCE_ID, dataSource);
    }
}

// 2. Transaction Service with retry logic for distributed environment
@Service
public class TransactionService {
    private static final int MAX_RETRIES = 3;
    private static final Logger logger = LoggerFactory.getLogger(TransactionService.class);
    
    @Resource(name="java:comp/env/jdbc/XADataSource")
    private DataSource dataSource;
    
    @Transactional
    public void performDistributedTransaction() throws SQLException {
        int attempts = 0;
        while (attempts < MAX_RETRIES) {
            try {
                executeTransaction();
                break;
            } catch (SQLException e) {
                attempts++;
                if (attempts == MAX_RETRIES) {
                    throw e;
                }
                // Exponential backoff
                try {
                    Thread.sleep((long) Math.pow(2, attempts) * 100);
                } catch (InterruptedException ie) {
                    Thread.currentThread().interrupt();
                    throw new SQLException("Transaction interrupted", ie);
                }
            }
        }
    }
    
    private void executeTransaction() throws SQLException {
        try (Connection conn = dataSource.getConnection()) {
            try (PreparedStatement ps = conn.prepareStatement(
                "INSERT INTO my_table (column, instance_id) VALUES (?, ?)",
                PreparedStatement.RETURN_GENERATED_KEYS)) {
                ps.setString(1, "value");
                ps.setString(2, System.getenv("INSTANCE_ID"));
                ps.executeUpdate();
            }
        }
    }
}

// 3. Configuration in context.xml for Tomcat
<?xml version="1.0" encoding="UTF-8"?>
<Context>
    <Resource name="jdbc/XADataSource/${INSTANCE_ID}"
              auth="Container"
              type="javax.sql.DataSource"
              factory="org.apache.commons.dbcp2.managed.BasicManagedDataSourceFactory"
              transactionManager="org.apache.geronimo.transaction.manager.GeronimoTransactionManager"
              xaDataSource="com.mysql.cj.jdbc.MysqlXADataSource"
              
              # Database Connection Settings
              url="${DB_URL}"
              user="${DB_USER}"
              password="${DB_PASSWORD}"
              
              # XA and Transaction Settings
              pinGlobalTxToPhysicalConnection="true"
              rewriteBatchedStatements="true"
              allowMultiQueries="false"
              interactiveClient="false"
              
              # Instance Configuration
              instanceId="${INSTANCE_ID}"
              jmxName="pool.xa.${INSTANCE_ID}"
              
              # Connection Pool Settings - Dynamic based on instance size
              initialSize="#{calculateMaxConnections()/4}"
              maxTotal="#{calculateMaxConnections()}"
              maxIdle="#{calculateMaxConnections()/2}"
              minIdle="#{calculateMaxConnections()/4}"
              maxWaitMillis="5000"
              
              # Connection Validation
              testOnBorrow="true"
              testWhileIdle="true"
              validationQuery="SELECT 1"
              validationQueryTimeout="2"
              
              # Abandoned Connection Handling
              removeAbandonedOnBorrow="true"
              removeAbandonedTimeout="30"
              logAbandoned="true"
              
              # Performance Optimizations
              timeBetweenEvictionRunsMillis="5000"
              minEvictableIdleTimeMillis="60000"
              
              # Transaction Manager Properties
              transactionManagerProperties="InstanceName=${INSTANCE_ID}"
              
              # Additional Safety Settings
              fastFailValidation="true"
              disconnectionSqlCodes="08S01,40001"
              
              # Monitoring and Metrics
              jmxEnabled="true"
              enableMetrics="true"/>
              
    <!-- Transaction Manager Configuration -->
    <Transaction 
        factory="org.apache.geronimo.transaction.manager.GeronimoTransactionManagerFactory"
        instanceName="${INSTANCE_ID}"
        defaultTransactionTimeout="300"/>
</Context>

Thanks @jacob.t.mevorach for sharing all this progress here :pray:

I think best would be for @raff to skim through it and chime in :slight_smile:

Thanks @jacob.t.mevorach for your continued work on this! A few comments from me.

Session management can be done with different levels of failure resistance. The easiest for us might be to just put a load balancer in front and have it connect a client always to the same instance. This approach doesn’t even require any changes to openmrs-core. One drawback is that if a node fails, the session is lost and user starts a new session on a different node. It’s not such a big issue as most operations are short-running and it’s fine in most cases for the user to login again and repeat. As long as the user sees a failure massage instructing him to do that. We could go a level higher and introduce session replication and as you suggest put session data in Redis. However, it’s not strictly necessary to support OpenMRS replicas. It simply adds to HA. In storing session data we might look for a solution that is Spring based instead of Tomcat (though I think pretty much everyone deploys OpenMRS with Tomcat these days).

In regards to Shared Database Usage. Database can be shared by design by multiple instances. A different thing is when you have a DB cluster as your backend with multiple masters or slaves. Then I’d put ProxySQL in front to load-balance connections and remove failed connections from the pool. It is generally not needed to add more complexity to the application to control, which DB node to use for writes and reads and have ProxySQL handle all that. I’ll be adding ProxySQL to our helm charts when choosing Galera cluster as a DB to have a proof of concept. In AWS you have RDS proxy. Aurora RDS might be able to deal with that on its own. The only remaining thing for HA is retrying queries in case of failures by getting a new connection to a working node, which your code tries to accomplish. It’s nice to have, but not strictly necessary for scaling OpenMRS instances.

Now comes the real challange with running multiple OpenMRS replicas that requires substantial amount of work. I don’t think we need anything like “swarm mode” for an OpenMRS instance. OpenEMR might have a specific use case to do that distinction, but in general it shouldn’t be needed. It doesn’t really matter for OpenMRS, which instance is the leader as long as we store data properly. What it means in practice is that we don’t just write data to a shared directory or use instance memory for caching (e.g. in the form of static fields on controllers with HashMaps), but use services that ideally can be distributed instead:

  1. Have file service that writes to distributed file storage to store Obs, etc.
  2. Have caching service that writes to distributed cache, which is used to store any transient data that is currently stored with static fields in code with HashMaps and alike.
  3. Have Hibernate Search use OpenSearch instead of in-built Lucene index that is written to disk by each instance.

I’m planning to suggest a GSoC project to go through code and address 1st and the 2nd point, which is mostly straightforward, but tedious. 3rd might require someone who is familiar with Lucene and OpenSearch to make adjustments so it might be hard to find such a GSoC student.

If you are interested in trying the load-balancer approach with nginx for example to handle sticky sessions and then maybe adding session data persistence with Redis, I’m happy to work with you. I’m thinking of nginx, because we already have the gateway service that is proxying all requests that we could use for that specific purpose. Due to limitations in openmrs-core I would test the approach without using a shared data file system and have each instance run with its own data store, but connected to the same DB. It should mostly work as long as replicas are started after the system is running and DB is created.

Augmenting the data source to retry queries would also be a nice contribution. I don’t think it needs to be as complex as AI suggests :wink: See e.g. How to create custom retry logic for Spring Datasource? - Stack Overflow

1 Like

Well, another thought is that adjusting the gateway service might not make much sense. Kubernetes has its own load balancer setup that we could use enabling sticky sessions so it’s even simpler and easier to handle with auto-scale.

Hello!

Thanks @raff and @mksd; it’s been really fun to get involved and dig into things and everyone has been really kind and supportive so far.

Regarding sticky sessions:

  1. OpenMRS Contrib Cluster AWS ECS implements sticky sessions today. I implement it on line 1379. It’s enabled by default, can’t be turned off and there’s a parameter set in cdk.json called “sticky_session_cookie_length_in_minutes" you can use to alter the duration the cookie that enables sticky sessions lasts. The default sticky session cookie length is currently set to 60 minutes in cdk.json on line 30.
  2. This means that when you first navigate to OpenMRS in the browser the application load balancer will provision the user a cookie that gets stored in the browser and then as long as there’s some activity every 60 minutes that cookie will continue to exist and will be used to continually route requests from the user to the same instance. This functionality worked for me in testing and I was able to repeatedly access the same instance I had logged into.
  3. In the “Is This Production Ready?” (tldr; “No”) section of the documentation I wrote some of my thoughts on sticky sessions but to summarize I would strongly advise against the use of sticky sessions when there’s another option available.
  4. When making highly scalable web applications you want the replicas that are being spun up and down to be stateless and for all state to be offloaded to shared database, shared file storage and shared caching (i.e. Redis/Valkey). The common adage said for this is that replicas should be “cattle not pets”.
  5. While it’s true that if you deprovision a node a user is on while using sticky sessions they can just sign into the next node they get routed to, this can make for a poor user experience and hurts the scalability of the application.
  6. What if I had an installation with six replicas running and a user gets evicted from replica #6 only to be routed to #5 which is suddenly also deprovisioned only for the user to be routed to #4 which is suddenly also provisioned and so on and so on?
  7. While logging in and out once may not be terribly inconvenient doing so multiple times in short succession can be frustrating. That’s also not to mention that data that wasn’t saved to shared storage will be lost every time a node is deprovisioned.
  8. You can mitigate some of this by offering a user a notice that they’re going to be evicted from a node in “N” number of minutes. Having said that whatever “N” is is now the fastest you can deprovision a new node during a scale-in event. This makes your cloud installation, and this is true for all cloud providers, less cost effective because you’d ideally like to be able to deprovision unnecessary nodes quickly in response to a drop in traffic. However, if you’re going to be offering users a few minutes notice to log out this is going to result, over time, in many minutes’ worth of compute replicas not doing too much useful work as users use them to do nothing but sign out and over time the costs from this will add up.

Regarding shared database usage:

  1. If the database can be written to by multiple replicas at once that’s great news and will save us a lot of time!
  2. So the Aurora RDS Serverless v2 database we use is a cluster but it abstracts the cluster topology and provides us with a single endpoint we can use to do read and writes. If we use RDS Proxy we’d still end up with a single endpoint.
  3. By using RDS Proxy we’d have a shared connection pool that replicas could draw from which would prevent the need to spin up more connections to the database. Spinning up database connections is a resource intensive process so it would reduce resource consumption by the database. However the act of using the proxy will add latency to all calls to the database. This means that we should either make RDS proxy something that can be toggled on/off (something we can do in CDK and create a parameter for in cdk.json) or do testing to see whether or not the additional latency is worth the reduced resource consumption on the database.
  4. My hunch is that the additional latency is not going to be worth the reduced resource consumption on the database. Typically RDS Proxy is good for cases where you want to squeeze more performance out of a fixed amount of database resource. In our case our database is set to autoscale (and scale to zero ACUs when not being used) so we shouldn’t really have too many problems spinning up a lot of connections very quickly should we need to so it makes sense to just take the improved latency (note that I could still be proven wrong in testing).

On Everything Else:

  1. That sounds great! I’d be happy and excited to work with you on implementing points 1 + 2.
  2. I agree with you on the gateway service; I also think it would be easier to implement that on a Kubernetes level. Should be noted that the ECS setup already has a load balancer too.

I’ll be away for a few days but will available again on Tuesday.

All sounds right! I agree that with auto-scaling enabled session replication is a must. It’s always a question what kind of scale we are talking about.

I think that most OpenMRS installations will rely mostly on scaling DB. When it comes to OpenMRS instance replicas, it might be more about high availability and no downtime upgrades thus having maybe 2-3 replicas running at all times rather than auto-scaling to dozens of instances.

Based on what you wrote we won’t be recommending RDS proxy for Aurora since it provides single endpoint for connections already and distributes traffic, but we do need ProxySQL for Galera deployments in Kubernetes and for regular RDS instances. Is that right? Do we still need the code to remove broken connections from the connection pool in case Aurora node fails or is it handled by Aurora?

Let’s touch base on 1+2 points next week. I said it could be a GSoC project, but if you have spare cycles and interest in diving into Java code to help us out then great. I’ll start a thread on the design of the new storage service to abstract away the underlying implementation (writing to local dir, s3 or whatever). We could probably discuss that on a platform call next week. It would be great if you could join us.

2 Likes

Hello @raff ! Hope this message finds you well and that you’re having a good day.

The primary use case for RDS Proxy is when you have multiple apps using a single database. If I have three different apps each spinning up their own connections to a single database, it’s possible that bad connection handling logic could result in this overwhelming the resources of the database even if I have good application logic, monitoring and observability in place for the other two applications. By outsourcing how apps handle and create connections to RDS proxy you can get nice and predictable connection handling and do so effectively for multiple apps. It’s basically a managed service for the role that is served by a connection handler.

RDS Proxy does other things:

  • Handles the connection pool for one or more application that are accessing shared database resources.
  • Generally, helps you get more performance out of a database than you could by implementing a connection pool on your own from scratch.
  • Can abstract away certain aspects of cluster topology from apps not equipped to do so by providing a single endpoint.
  • Performs other useful functions like maintaining a bunch of active connections and recycling those, autoscaling etc. etc.

There are also some drawbacks:

It depends on the architecture you’re using and how your application is written on whether RDS Proxy makes sense to use. If it improves performance enough that it significantly lowers resources strain on the database, it might be cheaper to use RDS Proxy than not in some cases.

Having said that for our use case where we’re using Aurora Serverless v2 for our architecture I would say it doesn’t make sense to use RDS proxy. We already get a single endpoint and the version of the MySQL compatible aurora engine we’re using (for reference it’s version 3.08; which gets set as a variable first on line 65 and gets set as the engine for our database on line 734) is pretty advanced as is Aurora Serverless v2. It shouldn’t be that taxing on the database to spin up a bunch of connections quickly (it’s built for scale) and so if we use RDS proxy my prediction would be that we’d spend more money and get added latency on database calls (although I can be proved wrong in testing, I’d be willing to bet money I’m right on this).

In general, I’d recommend using Aurora Serverless v2 because 1/ it works for OpenMRS 2/ it’s performant and autoscales automatically 3/ most importantly, regarding cost, anything that’s using engine v3.07+ (we use v3.08) allows for scale to zero.

What this means is that if no one is using the database for more than 5 minutes (this can be set as a parameter if you’d prefer a different interval) the compute for the database will scale to zero and you’ll only pay for the data storage costs of the data written. The compute starts back up in seconds when someone goes to make a request after it’s been idle and scaled to zero.

I don’t think it’s possible to be more cost effective for OpenMRS while using RDS because any of the other options you can use will have some amount of fixed compute cost you’ll have to pay even when you’re not using the database. We don’t have that so if no one is using the database at 3 A.M. we won’t pay compute costs; they’ll be $0.

I also set up the database to have a reader serverless instance that scales along with the writer. In the event the primary serverless writer fails the serverless reader (which is already scaled up to where the writer was before it failed) will be promoted to become the new writer and then a new reader instance will be provisioned and scaled up automatically to where the new writer is and then in the event the writer failed again the process would repeat again. For reference this is implemented on line 746. The user shouldn’t notice a node failure. If the reader fails it’ll be replaced as well automatically; no need for manual intervention.

We also use multiple availability zones for the database as well. This means that there are multiple datacenters powering our database within a specific region so even if something happens with one datacenter, we will be resilient to that failure (same goes for compute with Fargate and file storage with EFS).

The other cool thing for Aurora Serverless is that it automates and manages the upgrades for the database engine for you. You can set it to do automatic minor version upgrades and even for major upgrades it’s pretty easy to do from the console. There’s documentation that outlines what the process looks like here: Upgrading the major version of an Amazon Aurora MySQL DB cluster - Amazon Aurora .

In general, I’ll try my best to make it to the platform calls. I’ll add it to my calendar. This week I won’t be able to make it but next week I should be able to.

It’s going to be hard for me to commit some sort of regular amount of time to coding. However, in general I’ll try to do my best to produce code example and answer questions asynchronously upon request.

In general (and this is true for anyone reading this thread as well) please don’t hesitate to reach out in the event you think I could be helpful with anything. I’m always happy to help :grinning::

Thanks Jacob. Let’s reschedule the discussion for next week platform call.