The promise of OpenMRS ETL: Advanced Reporting/Analytics for OpenMRS

wyclif · May 15, 2020, 4:19am

Hi,

I just wanted to share what I’ve been working on and the changes we intend to make to an existing project that involves extracting data out of OpenMRS for integration with various other systems.

I believe some of you have heard about the DB sync project which was written by a developer that moved on to other things, I have spent the past couple of weeks trying to make improvements to the project.

Several DB systems provide replication tools that can do the sync job but that’s all to them and in some cases can be pricey, so this project was built to provide OpenMRS DB to DB sync of selected tables while providing an integration layer in a way that is as agnostic as possible to the underlying DBMS.

At the heart of the project is Apache Camel which allows it to route messages between the various applications in the ecosystem, initially the project was running a scheduler that periodically polled the configured OpenMRS tables to find rows to sync, it would keep track of the last sync date to determine what had to be synced for subsequent runs by comparing the last sync date to date created and changed of each row, with this approach it was easy to miss some changes between runs. Currently, it’s updated to use after insert, update and delete DB triggers that write to a sync record journal in the remote OpenMRS instance that the tool reads to determine rows that need to be synced to the HQ instance.

The trigger approach works fine but we’re considering switching to something like Debezium to replace the triggers, sync record journal and the existing mechanism that broadcasts DB events to all listening routes. Of course we are concerned about what it means to run a full blown Debezium instance with ZooKeeper and Kafka which can demand more resources for a site with already constrained resources, it’s possible we might run an embedded instance of Debezium by default with an option to use an external one. We then listen for DB change events emitted by Debezium, these are typically very raw messages similar to a SQL resultset, we want to be able to enrich the message with a more loaded version of the affected entity along with it’s flattened associations at the application level before publishing it to a configured message broker of choice, the enriched message can be consumed by registered listeners which could be other camel routes that feed other systems.

We’re still in the initial phase of figuring out what tools to use and how everything will come together, we strongly believe our choices could be influenced by the product owner based on the resources of the stakeholder, we will share what we have put together in the near future.

Regards,

Wyclif