Optimizing the Backend: Examples of slow responses in O3

burke · April 26, 2024, 4:21pm

I agree that any work on performance is best done in discrete tickets with clear goals (e.g., “Improve performance of XYZ API call to respond within 100 ms”) and a predictable enough test environment/scenario to be able to reliably measure the improvement. Ideally, we’d have unit tests to catch any regressions; however, we have to be careful not to create tests that fail randomly 10% of the time (e.g., when a CI environment is under unusual load).

On the other hand, I also think we would benefit from a strategic approach for prioritizing performance improvements and to help define targets. I would expect a strategic approach to address issues that might not be considered on a ticket-by-ticket basis. For example:

Should we spend effort fixing a method that takes 2000 ms to respond but is only called once while deferring improving a method that responds in 100 ms but is called hundreds of times?
Can we use tools like Lighthouse to identify and prioritize targets? It seems like a combination of proactively identifying worst offenders + user-identified pain points would be best.
Avoid focusing on technical performance and lose sight of the fact that perceived performance is more important than actual performance.
When should we put effort into improving performance of a FHIR end-point vs bypassing FHIR and using our custom REST API? If our goal is to be increasingly FHIR-compliant and reducing the OpenMRS-specific learning curve over time, then any such change is a trade-off between performance and technical debt.

We probably need to thinking about performance top-to-bottom (i.e., not just as a backend or frontend issue… or for backend & frontend “teams” to consider separately).

I think we have a big enough bandwidth problem (chattiness) in OpenMRS 3 that we should address it separately from performance. For example, we have examples of OpenMRS 3 distributions using 30x the bandwidth of OpenMRS 2.x over a month. Perhaps this deserves a separate thread.

dkayiwa · April 29, 2024, 6:11pm

ibacher:

The Visits view in the O3 frontend as it currently exists is unworkable.

To power the Visits view (which is very cool), we currently use a call like the following:

/ws/rest/v1/visit?patient=<patientUuid>&v=custom:(uuid,encounters:(uuid,diagnoses:(uuid,display,rank,diagnosis),form:(uuid,display),encounterDatetime,orders:full,obs:full,encounterType:(uuid,display,viewPrivilege,editPrivilege),encounterProviders:(uuid,display,encounterRole:(uuid,display),provider:(uuid,person:(uuid,display)))),visitType:(uuid,name,display),startDatetime,stopDatetime,patient,attributes:(attributeType:ref,display,uuid,value)&limit=5

For one of the demo patients (Betty Williams - 7521943e-dd1a-4e27-9b29-bb4241c52bef) each page of 5 results takes 20 seconds to return to my computer (which, it’s well-established that dev3 → me is faster than it is for most people). I think the issue here is that the query requires a large number of joins, which, in turn, results in a large number of individual queries being run on the backend. Hopefully, we could create an API endpoint customised to getting just the subset of data we need for this that would perform better than that (we’ve had reports of this taking over 60 seconds with real-world data, i.e., this vaguely-worded ticket). The data returned is also on the order of 2.5MB, which seems excessive. (This query also doesn’t properly implement pagination, so it just keeps requesting the same 5 visits again and again and again).

More than 80 percent of the performance degradation for this particular call was caused by the unnecessary repeated database calls for loading the layout.name.template global property. For just one such a REST call, i counted more than 500 repeated calls for just loading this global property. This has been fixed and dev3 has been updated with openmrs-core version 2.6.5-SNAPSHOT which has the fix. Of course we can continue to explore how we can make this run even faster.

ibacher · April 29, 2024, 6:41pm

Awesome! I had some suspicion that this was a relatively new regression in performance, but hadn’t had time to look into it in depth.

dkayiwa · April 29, 2024, 7:31pm

The same layout.name.template global property calls turned out slowing this one too. On my machine, removing these unecessary calls improved the performance from 4 seconds to 800 ms

ibacher · April 29, 2024, 8:25pm

That’s excellent! I’m also a little concerned that we’re running the same query repeatedly, but that’s more of a frontend than backend issue.

janflowers · May 1, 2024, 5:02pm

Where’s our alarmed face emoji instead of the heart emoji for this comment?!

ibacher · May 1, 2024, 5:07pm

We have this one:

raff · May 7, 2024, 12:02pm

It’s extremely valuable work so far in finding and addressing all bad performing queries!

Perceived performance is worth considering once you pass a certain point and addressed all serious issues that we seem to have right now.

I think we need to focus on both REST API and FHIR API. FHIR is not a go to spec for web frontend and mobile since it’s limited to what is in the spec and you cannot implement targeted endpoints for certain workflows to improve performance and data efficiency. I don’t believe it can completely replace our REST API. However, it’s an important interoperability standard and we must handle it well.

There are 2 things that I think we need in place to be more strategic:

Have a test instance with demo data volume that we expect O3 to handle well. We all know that optimising e.g. SQL queries for a table with 10k rows is completely different story than for a table with 1 mln rows. Let’s get our test targets right.
Since REST API and FHIR API are our main gateways to the application let’s have a very simple test doing http calls against the test instance for any bad performing query that has been identified and fixed so far using e.g. GitHub - openjdk/jmh: https://openjdk.org/projects/code-tools/jmh for microbenchmarking so we quickly discover regressions. Tests can be implemented in JUnit with JMH annotations and using REST Assured. We need to be able to configure the REST base URL via an environment variable to point it to any test instance. JMH would provide us simple metrics (run times), but if we want to go fancy we could even use the very same tests with JMeter orchestration and JUnit Sampler to do load testing and analyse other performance metrics. These tests can also serve the purpose of integration tests to some extend and test the business logic so we get 3 benefits from writing one test

We also need to put best practices for implementations to employ this kind of testing in their development, thus the test framework needs to be easy to run locally and in CI.

If you agree with the proposed approach, I’d be happy to setup the test framework including CI build.

I’d only need help with determining what’s our target O3 size i.e. how many patients, how many encounters, obs, visits, etc. We used to have some tool to generate demo data. Is that still available?

ibacher · May 7, 2024, 1:46pm

Hmmm… FHIR is actually intended to offer a REST API, and you certainly can implement targetted endpoints for any workflow via custom operations. It’s got a very different model from HL7 V2, which is a pure messaging standard.

There are reasons that FHIR cannot completely replace the REST API, because there are certain parts of a standard application that FHIR explicitly disclaims having a standardized way of supporting (user management, configuration management, etc.). There are also certain requirements which make the FHIR API “chattier” than a REST API needs to be.

Yes, and we’ve made quite a few enhancements to it, but it’s still probably not as good a model of a realistic data load as, well, an anonymised result would be.

I know @jayasanka has been spiking on some work around performance tests for OpenMRS, I believe using the Gatling framework. Definitely having some implementation-usable performance tests would be a huge win in terms of tracking down issues and errors.

raff · May 8, 2024, 9:46am

I didn’t know that. I was only aware of resource extensions. Is it implemented anywhere in the FHIR module? I’m reading that custom resources can also be created so theoretically it could be extended to support every need if we don’t mind verbosity.

ibacher · May 8, 2024, 12:50pm

So, we don’t currently have support for OperationDefinition, which is how you would do that in pure FHIR, but we do have a couple of custom operations defined anyways.

FHIR Operations themselves usually accept and return Parameters objects, which are basically just arbitrary key-value pairs. It’s also possible (though discourage) to create completely new resources.