The promise of OpenMRS ETL: Advanced Reporting/Analytics for OpenMRS

And throwing this need out there, is there a way to have the equivalent of Lucene indexing the data as it is entered/updated within OpenMRS to simplify reporting?

Probably providing a way to build custom data sets etc, especially for smaller sites

1 Like

Thanks @bashir and @burke for refining it further. I think we can leverage a lot on streaming since it requires less processing and can run alongside openmrs deployment. I’ll be testing the streaming-mode in resource-constrained environments, I’m happy to share the results.


I just wanted to share what I’ve been working on and the changes we intend to make to an existing project that involves extracting data out of OpenMRS for integration with various other systems.

I believe some of you have heard about the DB sync project which was written by a developer that moved on to other things, I have spent the past couple of weeks trying to make improvements to the project.

Several DB systems provide replication tools that can do the sync job but that’s all to them and in some cases can be pricey, so this project was built to provide OpenMRS DB to DB sync of selected tables while providing an integration layer in a way that is as agnostic as possible to the underlying DBMS.

At the heart of the project is Apache Camel which allows it to route messages between the various applications in the ecosystem, initially the project was running a scheduler that periodically polled the configured OpenMRS tables to find rows to sync, it would keep track of the last sync date to determine what had to be synced for subsequent runs by comparing the last sync date to date created and changed of each row, with this approach it was easy to miss some changes between runs. Currently, it’s updated to use after insert, update and delete DB triggers that write to a sync record journal in the remote OpenMRS instance that the tool reads to determine rows that need to be synced to the HQ instance.

The trigger approach works fine but we’re considering switching to something like Debezium to replace the triggers, sync record journal and the existing mechanism that broadcasts DB events to all listening routes. Of course we are concerned about what it means to run a full blown Debezium instance with ZooKeeper and Kafka which can demand more resources for a site with already constrained resources, it’s possible we might run an embedded instance of Debezium by default with an option to use an external one. We then listen for DB change events emitted by Debezium, these are typically very raw messages similar to a SQL resultset, we want to be able to enrich the message with a more loaded version of the affected entity along with it’s flattened associations at the application level before publishing it to a configured message broker of choice, the enriched message can be consumed by registered listeners which could be other camel routes that feed other systems.

We’re still in the initial phase of figuring out what tools to use and how everything will come together, we strongly believe our choices could be influenced by the product owner based on the resources of the stakeholder, we will share what we have put together in the near future.




Thanks, @wyclif for providing possible tools that we can use. We will definitely need to do a lot of benchmark and testing on these tools while considering the ability to scale up and/or down, depending on diff server specs. I am a huge proponent for debezium mainly because it is an out-of-the-box solution that has minimal impact on production DB. Additionally, debezium/Kafka tends to be rather light on the CPU and RAM. Dockerized zookeeper-Kafka-connector, uses negligible JVM heap; Thus for small deployments, 500mb to 1 GB RAM should be enough for each container. During runtime, Kafka consumers buffers at least 2MB of RAM per consumer; while producers consume up to 64 MB. Therefore, the resource it needs highly depends on # of records/messages per sec. When it comes to batch mode, we will definitely need modern processors with multiple cores - also highly dependent on the amount of data in the DB. for < 1000 patients with less approximately 1 million clinical observations, 8gb ram, and 4 cores should be able to power the entire stack including production line.

For a number of these solutions (eg. Debezium) it seems that they likely rely on reading the MySQL bin log. I don’t believe that at PIH we have typically/by default configured our MySQL instances beyond the defaults in this area, and so I’m interested in how we might start thinking about this and whether we need to planning for how we might adapt our existing and future MySQL instances to prepare for these tools. Particularly for some large-scale rollouts - I’m concerned that updating hundreds of existing installations to use a new MySQL binlog format might be a major challenge for some groups.

Does anyone have experience or best practices in this area that they can share?

If the above is correct, we might want to start issuing some recommendations around MySQL configuration that accounts for this.

Thanks, Mike


Thanks @wyclif for this nice summary. As I alluded above, I think the solution for the sync problem can play a role for the streaming part of an ETL/Analytics solution. One thing that I asked here but still am not clear about is what were your reasons for not building on top of Sync 2.0 and instead creating dbsync? Because at least part of what you are describing is similar to how Sync 2.0 uses the atomfeed module to keep a log of DB changes.

FWIW, for streaming ETL, I think beside approaches based on debezium or DB logs, another way is to rely directly on OpenMRS to keep the change logs, like an extended version of atomfeed module; or even directly write into the data warehouse once a change happens in OpenMRS. I’ll try to talk through some of these ideas in this meeting.

1 Like

Would this be in order to make the solution lighter?

Yes, potentially a more lightweight solution and also easier to deploy, e.g., you just add another module and configure it.

If it were my infrastructure, I would rather go through the pain to Dockerise my MySQL instances progressively and ensure that bin logging is turned on… rather than trying to install the Atom Feed module (and whatever else will be needed with it) 100+ times on instances that might still run ancient versions of OpenMRS.

I like the non-intrusive pattern enforced by Debezium, for me this is a big winner.


Thanks @mseaton for bringing this up! When I started working with debezium I had the same concerns.

Practically speaking, deploying the debezium CDC is quite easy; would take the same effort as installing a new module. It entails adding two lines to the MySQL config then powering up the required container stack. We could write a bash script to automate this process. The config of interest are:

log_bin           = mysql-bin
binlog_format     = row

Switching binlog format is the easiest part, this can be done at runtime by running this query :

SET GLOBAL binlog_format = ‘ROW’;

Although this can not be switched when the replication channel thread is currently running, thus we will need to switch after 5pm, when no one is performing data entry.

We should encourage the community to use ROW based binlog format instead of statement base.

1 Like

@wyclif, thanks for the write up.

Do I understand correctly that you hit the performance limit that makes you consider the switch from DB triggers to Debezium? What do you use for queuing right now? Could you please share some stats? We’d have to have more data on the setup to recommend the path forward.

Debezium may indeed be an overkill for smaller setups. Can you leave a gate open to switch between DB triggers or Debezium depending on the scale of operation?

If we use this approach, it will be essential to upgrade atom or create a new module that caters for the CDC processing guarantees downstream the pipeline. Most importantly, can we guarantee data consistency between Openmrs (monolith) and the intermediate store of interest (lake/warehouse)? Remember, it is very easy to lose updates downstream the analytical pipeline. Something which debezium is designed to prevent

Operationally Debezium/kafka shields us from missing any updates downstream the analytical pipeline even beyond the E of ELT, even in case of power failure or internal server/application faults. Unlike debezium/kafka, Most CDC systems fail to guarantee the exactly-once processing semantics

The question is do we want to go down this path or instead us a modern and out-of-the-box solution for our CDC requirement?

Hi @raff, @wyclif to provide further details but the short answer is no. We are just testing tech stacks that enforce best and reusable practises. And it looked like Debezium was doing off the shelf a lot of what we needed and could cover the whole ‘E’ process for multiple use cases: ETL for data analytics, pure low level sync, data out streams for EIP (with Camel for instance)… etc.

It looks like that @wyclif’s approach with DB triggers is working perfectly fine. However, those triggers require maintenance and a setup that is more invasive than Debezium. They also imply to keep sync records journal tables within the OpenMRS database.

Again @wyclif to confirm but it appears that Debezium is reducing the overhead in terms of maintenance.


I see. So you are still exploring your options and you haven’t rolled out a solution yet.

In my opinion the Debezium is the way to go in such a case. It comes at the price of having beefier machines to run the whole stack, though it could be mitigated by using the embedded version as @wyclif mentioned. Have a look at which nicely couples the Debezium embedded engine with Apache Camel. It may be a great fit, since you are already using Camel.


Thanks @raff for sharing this. This will solve most of the concerns we have been having. Apart from eliminating Kafka and zookeeper, I’m glad this component provides the ability to recover from faults. This will definitely work for small deployments as well as large-scale

1 Like

We have options with all that we have tried so far (basic management DB based on the last sync date, then more recently the DB triggers evolution), so at worst we have plans B. But it looks like Debezium starts looking like plan A :slight_smile:

Thanks for your input again @raff.

1 Like

@akimaina, well large-scale is pretty vague, but it definitely gives you a considerably easy upgrade path to Kafka and clustering to handle a truly large-scale if you start hitting performance limits.

1 Like

I am afraid I do not fully understand the point here. I understand that in a truly distributed environment exactly-once semantic is a difficult problem but I thought earlier we established that is not the case here. Sure, OpenMRS is a monolith but so can be the change stream processor that updates the warehouse, no? For example, in the atomfeed example, the state of both the feed producer and consumer are tracked in the DB and it is not clear to me how they can miss an update or process it twice (the order issue can be another story though).

BTW, I mentioned atomfeed mostly as an example of another potential approach; relying on Debezium sounds like a great option too, as you mentioned. What I am not sure about is whether complicating the infrastructure with something like Kafka, which is mainly designed for a truly distributed system with many producers/consumers, is warranted for this problem.

One of the advantages that i can see of Debezium over our approaches that use the hibernate interceptor, is ability to also capture database changes that happen outside of the openmrs application. For instance when one directly makes changes to data in the database or via some other application.

@raff I think DB triggers work fine if you want to monitor a few tables and don’t plan to be adding new tables or change the trigger logic continuously, to be honest triggers do the job pretty well. The primary reason why we are considering debezium is that it solves 3 things that had to be manually implemented in the code base i.e. monitoring db events, publishing events and Apache camel already has built-in components to bring the glue together. The main issue we’re foreseeing with debezium is how to avoid going in circles i.e. replay a change back to the origin, you don’t want to update a patient on server A, the event is consumed by server B, during sync the change is again written to the binlog of server B and then gets sent back to server A, this is not an issue with a one-way sync but would need to be addressed for a 2-way sync.

@bashir as @dkayiwa pointed out sync 2.0 just like any other tool based on the hibernate interceptors has some holes, the most obvious being you miss any change that takes place outside of a hibernate transaction. The other big flaw with sync 2.0 is that it uses the rest API to extract and load sync entities, the rest API is heavily based on the java api with all the entity relationships, it also excludes a lot of required fields that a rest client doesn’t need to know but on the other hand need to be synced. I think REST is great for integration but not db sync.



1 Like