Optimizing the Backend: Examples of slow responses in O3

I agree that any work on performance is best done in discrete tickets with clear goals (e.g., “Improve performance of XYZ API call to respond within 100 ms”) and a predictable enough test environment/scenario to be able to reliably measure the improvement. Ideally, we’d have unit tests to catch any regressions; however, we have to be careful not to create tests that fail randomly 10% of the time (e.g., when a CI environment is under unusual load).

On the other hand, I also think we would benefit from a strategic approach for prioritizing performance improvements and to help define targets. I would expect a strategic approach to address issues that might not be considered on a ticket-by-ticket basis. For example:

  • Should we spend effort fixing a method that takes 2000 ms to respond but is only called once while deferring improving a method that responds in 100 ms but is called hundreds of times?
  • Can we use tools like Lighthouse to identify and prioritize targets? It seems like a combination of proactively identifying worst offenders + user-identified pain points would be best.
  • Avoid focusing on technical performance and lose sight of the fact that perceived performance is more important than actual performance.
  • When should we put effort into improving performance of a FHIR end-point vs bypassing FHIR and using our custom REST API? If our goal is to be increasingly FHIR-compliant and reducing the OpenMRS-specific learning curve over time, then any such change is a trade-off between performance and technical debt.

We probably need to thinking about performance top-to-bottom (i.e., not just as a backend or frontend issue… or for backend & frontend “teams” to consider separately).

I think we have a big enough bandwidth problem (chattiness) in OpenMRS 3 that we should address it separately from performance. For example, we have examples of OpenMRS 3 distributions using 30x the bandwidth of OpenMRS 2.x over a month. Perhaps this deserves a separate thread.

3 Likes

More than 80 percent of the performance degradation for this particular call was caused by the unnecessary repeated database calls for loading the layout.name.template global property. For just one such a REST call, i counted more than 500 repeated calls for just loading this global property. This has been fixed and dev3 has been updated with openmrs-core version 2.6.5-SNAPSHOT which has the fix. Of course we can continue to explore how we can make this run even faster.

3 Likes

Awesome! I had some suspicion that this was a relatively new regression in performance, but hadn’t had time to look into it in depth.

The same layout.name.template global property calls turned out slowing this one too. On my machine, removing these unecessary calls improved the performance from 4 seconds to 800 ms

7 Likes

That’s excellent! I’m also a little concerned that we’re running the same query repeatedly, but that’s more of a frontend than backend issue.

3 Likes

Where’s our alarmed face emoji instead of the heart emoji for this comment?! :disguised_face:

We have this one: :is_fine:

It’s extremely valuable work so far in finding and addressing all bad performing queries!

Perceived performance is worth considering once you pass a certain point and addressed all serious issues that we seem to have right now.

I think we need to focus on both REST API and FHIR API. FHIR is not a go to spec for web frontend and mobile since it’s limited to what is in the spec and you cannot implement targeted endpoints for certain workflows to improve performance and data efficiency. I don’t believe it can completely replace our REST API. However, it’s an important interoperability standard and we must handle it well.

There are 2 things that I think we need in place to be more strategic:

  1. Have a test instance with demo data volume that we expect O3 to handle well. We all know that optimising e.g. SQL queries for a table with 10k rows is completely different story than for a table with 1 mln rows. Let’s get our test targets right.
  2. Since REST API and FHIR API are our main gateways to the application let’s have a very simple test doing http calls against the test instance for any bad performing query that has been identified and fixed so far using e.g. GitHub - openjdk/jmh: https://openjdk.org/projects/code-tools/jmh for microbenchmarking so we quickly discover regressions. Tests can be implemented in JUnit with JMH annotations and using REST Assured. We need to be able to configure the REST base URL via an environment variable to point it to any test instance. JMH would provide us simple metrics (run times), but if we want to go fancy we could even use the very same tests with JMeter orchestration and JUnit Sampler to do load testing and analyse other performance metrics. These tests can also serve the purpose of integration tests to some extend and test the business logic so we get 3 benefits from writing one test :wink:

We also need to put best practices for implementations to employ this kind of testing in their development, thus the test framework needs to be easy to run locally and in CI.

If you agree with the proposed approach, I’d be happy to setup the test framework including CI build.

I’d only need help with determining what’s our target O3 size i.e. how many patients, how many encounters, obs, visits, etc. We used to have some tool to generate demo data. Is that still available?

3 Likes

Hmmm… FHIR is actually intended to offer a REST API, and you certainly can implement targetted endpoints for any workflow via custom operations. It’s got a very different model from HL7 V2, which is a pure messaging standard.

There are reasons that FHIR cannot completely replace the REST API, because there are certain parts of a standard application that FHIR explicitly disclaims having a standardized way of supporting (user management, configuration management, etc.). There are also certain requirements which make the FHIR API “chattier” than a REST API needs to be.

Yes, and we’ve made quite a few enhancements to it, but it’s still probably not as good a model of a realistic data load as, well, an anonymised result would be.

I know @jayasanka has been spiking on some work around performance tests for OpenMRS, I believe using the Gatling framework. Definitely having some implementation-usable performance tests would be a huge win in terms of tracking down issues and errors.

I didn’t know that. I was only aware of resource extensions. Is it implemented anywhere in the FHIR module? I’m reading that custom resources can also be created so theoretically it could be extended to support every need if we don’t mind verbosity.

So, we don’t currently have support for OperationDefinition, which is how you would do that in pure FHIR, but we do have a couple of custom operations defined anyways.

FHIR Operations themselves usually accept and return Parameters objects, which are basically just arbitrary key-value pairs. It’s also possible (though discourage) to create completely new resources.

2 Likes