The promise of OpenMRS ETL: Advanced Reporting/Analytics for OpenMRS

burke · April 24, 2020, 4:27am

Nearly everyone who has implemented OpenMRS has eventually needed to generate reports from or perform analytics on their data. One of the key opportunities for greater collaboration within the OpenMRS community is around reporting through ETL (Extract, Transform, and Load). This was underscored during discussions at OMRS19. Despite getting a lot out of the existing OpenMRS reporting framework, it’s difficult to do modern analytics on OpenMRS data. The manual process of designing and developing ETL queries is time-consuming, complex, and is being tackled one implementation at a time.

Where we’ve been

OpenMRS began largely as a tool to capture data from HIV encounter forms for clinical care and research. Most data analysis in those days was through brute force SQL queries to extract data into analytic tools. It wasn’t long before people started referring to OpenMRS as a “data bucket” – data go in, but they don’t come out. There have been numerous attempts to address this over the years:

Reporting Module
BIRT Module integrated with Eclipse BIRT
Integrations with Pentaho
Jasper Reports
DHIS2 Reporting Module
ADX and mADX efforts
Tools like SQLeo
Flattening tables (AMPATH, Bahmni, others)
… many more

One of the key insights from our prior efforts with traditional reporting tools was the realization the typical star schema used by data marts works well when you know your data model and the questions you want to ask of your data. However, in medicine our data model is constantly changing (new knowledge about disease, new tests, new treatments) and the questions being asked of the data are always changing. Fortunately, tools like Elasticsearch, Apache Spark, and Apache Kafka can help address this issue.

Where we are today

People continue to get a lot out of the Reporting framework (Reporting Module)
Implementations often run reports overnight to reduce the impact on performance
We see many bespoke reporting solutions
Several sites are having local successes exporting data into flattened tables
Sites with a reporting solution are often doing full loads (instead of incremental updates)
The community is still struggling to meet reporting needs efficiently
We’re missing opportunities to collaborate on solving shared reporting challenges

Assumptions

There are some reporting needs shared across many implementations and there are some reporting needs unique to each implementation.
Reporting needs will change over time.
In general, reporting is easier from flattened tables.
There’s a set of flattened tables that would be widely useful as a starting point to meet any implementation’s reporting needs, especially if the process to extend the solution to create additional tables was straightforward.
Not every implementation has a team to set up and maintain a complex ETL solution.
While we want a solution that can scale into the cloud, we don’t want a solution that depends on cloud services.
We don’t want a solution that must run on the same server as OpenMRS.
A scalable ETL solution could be achieved leveraging existing open source tools (e.g., Elastic Search, Apache Spark, Apache Kafka, etc.).
Aggregate data is not useful for reporting/analytics/research, but can be useful when brought back to the EMR (e.g., showing quality indicators within the EMR, using aggregate data to provide decision support in the EMR, etc.).

Where do we go from here

We’ve seen convincing examples of how ETL of OpenMRS data into flattened tables can vastly improve the ability to generate reports and rapid analytics. What we need is an approach that can be used by more than one organization – i.e., that we can share within the community – without trying to solve every reporting need up front. Focusing on a handful of practical tables and creating a solution that can easily be extended to meet local needs would lay a foundation for increasing opportunities for collaboration within reporting. At OMRS19, we discussed targeting a specific PEPFAR report that many sites need to generate and to build a Docker-based solution that would integrate with OpenMRS and generate the tables needed for the report in a way that could be extended to other (shared or local) reporting needs.

Requirements

Batch or stream. Should provide ability to extract and load data out of OpenMRS in both batch and streaming modes, with the least amount of burden to OpenMRS production line.
Aggregate or patient-level. Support both aggregates/counts and patient-level analysis/reports
Extensible. Not everyone needs the same reports and the reports we need tomorrow won’t be the same reports we need today.
Independent. Be able to run independent from OpenMRS. Be able to update reports without requiring downtime of OpenMRS
Integrated. OpenMRS should be able to use reporting data.
Practical. Features like flattened core tables (e.g., encounter, obs, orders) and auto-build tables for forms to make it immediately useful to implementations.
Customizable. The configuration of “core” tables or initial reports should serve as examples for implementations to extend the solution to meet local needs.
Leverage existing tools. Configure existing tools together so we can focus on the domain-specific needs instead of wring our own infrastructure.
Kind. Low barrier of entry. You shouldn’t have to be an expert in ETL tooling to get the basic functionality and begin customizing it for local needs.
Scalable in both directions. A solution that can run locally, but also scale to the cloud.
Compatible. Support for different versions of OpenMRS can be added as needed.
Shareable. Reports can be exchanged between implementations.

Next steps

Imagine if a few organizations (like Mekom Solutions, PIH, The Palladium Group, and AMPATH possibly joined with groups from Nigeria, Uganda, Mozambique, and Kenya) decided to work together on a shared reporting solution for OpenMRS to address a shared problem (e.g., generating a PEPFAR report they all have to do), reviewed reporting efforts being done by themselves and in the community, and ended up weaving Apache Kafka, ElasticSearch, etc. into a Docker Compose stack that could be run alongside OpenMRS without bringing OpenMRS to its knees. In its simplest form, this would be a “pipeline” extracting data (via batch and/or stream) into a warehouse using commodity open-source tooling to generate some widely useful views (e.g., flattened encounters) and extending that approach for a real-world use case (e.g., a PEPFAR report).

Imagine if we could create a re-usable data pipeline for OpenMRS using modern, open-source tools that allow data from the production database to feed through a ready- and easy-to-use analytics environment supporting NoSQL (loosely structured) and flattened schemas. The included configuration for generating flattened encounters and tables for a PEPFAR report could serve as examples that implementations could extend to meet their local reporting needs. Then OpenMRS would have the foundation for a reporting solution that could be enhanced over time and the community could focus on the OpenMRS-specific parts of transforming & analyzing data (i.e., the clinical domain) rather than spending energy on building another reporting infrastructure best left to Apache and other open source communities. Imagine if countries & implementing organizations could share their reporting & analytic configurations.

Some implementations & organizations have already begun working on next steps locally. For example, some exciting ideas have been shared by @aojwang (see this post), @akimaina (see this post), and @bashir. What we need next is to find at least 2-3 organizations committed to addressing their shared reporting needs by collaborating on a shared & extensible reporting solution with help from the community (an “ETL Squad” … or perhaps an “ELT Squad”). Our biggest opportunity is through collaboration. We can walk further if we walk together. Collaboration is hard; it takes more energy than going your own way… but the returns can be amazing. Collaboration is how OpenMRS was born and it’s our best way forward.

aojwang · April 28, 2020, 7:36am

Thanks, @burke for moving this forward. I am a bit worried about implementations with just one computer which may not be so powerful to power the many ETL platforms. Are we looking at a one-fits-all strategy or a mix of solutions? Also, I would appreciate an approach that will be usable with the existing reporting framework.

akimaina · April 30, 2020, 9:38am

Just to highlight some of the comments @aojwang and @Ryan mentioned over slack:

The proposed solution should be able to scale out in large data-set as well as scale down in small datasets. For medium/small dataset, it should not require upgrading of hardware on existing infrastructure
The solution should support different deployment strategies for OpenMRS as well as for the pipeline, i.e.
- on-premise centralized deployments
- on-premise silo deployments
- cloud deployments

something to think about!

paulamendola · April 30, 2020, 3:18pm

Hi team - The VecnaCares dev team have developed and operationalized an OpenMRS data warehouse that is currently in product. A high-level slide deck is attached, and we can schedule a demo and provide more information. It’s been in use for the past couple of months. There are emails at the end of the slides that can be contacted for more technical information.

We’re really happy to collaborate - please just let us know. Data Warehouse VecnaCares.pdf (1.3 MB)

bashir · May 1, 2020, 7:49pm

Just to follow up from Slack: I am curious how distributed deployments of OpenMRS work around issues like synchronization. Since I could not find a clear answer in talk/wiki, I posted this thread.

This is relevant to the topic of ETL/ELT/Analytics because if what we care about is mostly single machine deployments, then processing incremental updates to the OpenMRS database (i.e., a streaming ETL/ELT solution) can potentially work on the same machine or another [single] machine working off of a slave copy of the DB. I always assumed that the ETL solution we are talking about here needs to be scalable to many nodes but am now questioning whether this is the right assumption.

@burke when you talk about the scalability of the ETL solution, where is a good place to get some stats on the data scale we are aiming for? Where are some places that have a distributed deployment of OpenMRS on a large number of nodes?

burke · May 2, 2020, 2:13am

I wouldn’t constrain us to a “one-size-fits-all or nothing” approach. But an elegant cloud solution that can’t be deployed on a local server will be near useless to the majority of implementations either because of connectivity and/or the politics of placing health data into the cloud. So, I’m picturing that we aim at a docker-compose that can function with single instances of each service, can run on a local server, can run much slower on a workstation, and can be expanded (e.g., via docker swarm, kubernetes, etc.) in the cloud.

I would recommend that we not conflate ETL with synchronization. While it’s easy to relate the two, synchronization is a massive undertaking in its own right and would introduce complexity and scope to this discussion that could kill any momentum. For now, let’s assume implementations can either get their data into a single instance or can leverage ETL to flatten data into a form that’s easier to merge.

I believe AMPATH has one of the largest implementations.

bashir · May 3, 2020, 5:14am

Okay, if we limit the problem to a single OpenMRS instance then I think for the streaming part of the ETL/ELT solution, we can simplify the design and avoid any distributed stream processing technology, which obviously simplifies design/deployment significantly.

Basically the assumption is that the ETL/ELT processing cost of a data manipulation event E is proportional to the processing that OpenMRS does for E. In that case, the streaming part of ETL/ELT can be done on the same machine that runs OpenMRS (or another single machine/process that works off of a replication of OpenMRS’s DB or even be implemented as yet another module for OpenMRS). I am obviously making other assumptions about the intermediate “warehouse/lake”.

burke · May 4, 2020, 1:03am

The ideal first step, IMHO, would target sites that can run their ETL engine on a secondary “reporting” machine, but work (with compromises on performance) on the same server as OpenMRS and be designed up front for horizontal scalability. So, sites with a single machine could get some functionality and those with extended capabilities could leverage multiple machines and/or the cloud for higher performance.

I’m assuming/hoping we can provide configuration to generate a core set of flattened tables along with a report using the same approach implementations could extend to meet their local needs.

mksd · May 7, 2020, 6:08am

About DbSync vs ETL. @frederic.deniger just pointed that we could look at this:

CDC approaches based on Camel and Debezium.

@wyclif will look into it to figure out whether the approach suggested on this article works better/is more suitable than what we had envisioned so far. If yes, then openmrs:extract from DbSync might just end up doing exactly what @akimaina and @bashir seem to be after. I.e. DbSync might be able to cover the ‘E’ part.

hamish · May 7, 2020, 12:10pm

Thanks Burke for a really good summary of this critical need and a practical way to create a shared solution. As you note we have had a diversity of partial solutions over the years and we need a common approach that also works on small, single server implementations of which there are thousands at present only some of which have good internet.

akimaina · May 7, 2020, 12:42pm

Thanks, @mksd for pointing this out. Both camel and spark integrate really well with debezium. @wyclif please consider batch-mode scenario when you are looking into this, I suspect camel does support batch consumption. The main data elements to consider emanates from encounter, obs, person, and orders. Lastly, consider a friendly to analytics output dataframe from the openmrs:extract process. I’m interested to know your findings on this, Please keep me posted.

akimaina · May 7, 2020, 1:03pm

Thank you, @paulamendola , for outlining this. I think we can borrow some ideas from this pipeline. I’m interested to know how the pipeline handles incremental updates. Also hope to see you demo during the ETL/ELT stand up: https://doodle.com/poll/9kefyg4ryenex5m7

mseaton · May 7, 2020, 2:40pm

This would be very cool to see. Although Sync and ETL/Reporting are definitely different use cases, I’ve always hoped that we could share an underlying core foundation of CDC (Change Data Capture) that both leverage, and which other downstream tools could also build upon, rather than inventing different solutions for both. In each case, the problem is efficiently extracting data out of OpenMRS (in batch and/or incrementally) and making that extracted data reliably available for downstream processing. It’s great to have those working on DbSync and those working on ETL ensuring that their efforts are complementary.

ssmusoke · May 7, 2020, 4:13pm

And throwing this need out there, is there a way to have the equivalent of Lucene indexing the data as it is entered/updated within OpenMRS to simplify reporting?

Probably providing a way to build custom data sets etc, especially for smaller sites

akimaina · May 11, 2020, 5:53am

Thanks @bashir and @burke for refining it further. I think we can leverage a lot on streaming since it requires less processing and can run alongside openmrs deployment. I’ll be testing the streaming-mode in resource-constrained environments, I’m happy to share the results.

wyclif · May 15, 2020, 4:19am

Hi,

I just wanted to share what I’ve been working on and the changes we intend to make to an existing project that involves extracting data out of OpenMRS for integration with various other systems.

I believe some of you have heard about the DB sync project which was written by a developer that moved on to other things, I have spent the past couple of weeks trying to make improvements to the project.

Several DB systems provide replication tools that can do the sync job but that’s all to them and in some cases can be pricey, so this project was built to provide OpenMRS DB to DB sync of selected tables while providing an integration layer in a way that is as agnostic as possible to the underlying DBMS.

At the heart of the project is Apache Camel which allows it to route messages between the various applications in the ecosystem, initially the project was running a scheduler that periodically polled the configured OpenMRS tables to find rows to sync, it would keep track of the last sync date to determine what had to be synced for subsequent runs by comparing the last sync date to date created and changed of each row, with this approach it was easy to miss some changes between runs. Currently, it’s updated to use after insert, update and delete DB triggers that write to a sync record journal in the remote OpenMRS instance that the tool reads to determine rows that need to be synced to the HQ instance.

The trigger approach works fine but we’re considering switching to something like Debezium to replace the triggers, sync record journal and the existing mechanism that broadcasts DB events to all listening routes. Of course we are concerned about what it means to run a full blown Debezium instance with ZooKeeper and Kafka which can demand more resources for a site with already constrained resources, it’s possible we might run an embedded instance of Debezium by default with an option to use an external one. We then listen for DB change events emitted by Debezium, these are typically very raw messages similar to a SQL resultset, we want to be able to enrich the message with a more loaded version of the affected entity along with it’s flattened associations at the application level before publishing it to a configured message broker of choice, the enriched message can be consumed by registered listeners which could be other camel routes that feed other systems.

We’re still in the initial phase of figuring out what tools to use and how everything will come together, we strongly believe our choices could be influenced by the product owner based on the resources of the stakeholder, we will share what we have put together in the near future.

Regards,

Wyclif

akimaina · May 18, 2020, 2:05pm

Thanks, @wyclif for providing possible tools that we can use. We will definitely need to do a lot of benchmark and testing on these tools while considering the ability to scale up and/or down, depending on diff server specs. I am a huge proponent for debezium mainly because it is an out-of-the-box solution that has minimal impact on production DB. Additionally, debezium/Kafka tends to be rather light on the CPU and RAM. Dockerized zookeeper-Kafka-connector, uses negligible JVM heap; Thus for small deployments, 500mb to 1 GB RAM should be enough for each container. During runtime, Kafka consumers buffers at least 2MB of RAM per consumer; while producers consume up to 64 MB. Therefore, the resource it needs highly depends on # of records/messages per sec. When it comes to batch mode, we will definitely need modern processors with multiple cores - also highly dependent on the amount of data in the DB. for < 1000 patients with less approximately 1 million clinical observations, 8gb ram, and 4 cores should be able to power the entire stack including production line.

mseaton · May 18, 2020, 5:35pm

For a number of these solutions (eg. Debezium) it seems that they likely rely on reading the MySQL bin log. I don’t believe that at PIH we have typically/by default configured our MySQL instances beyond the defaults in this area, and so I’m interested in how we might start thinking about this and whether we need to planning for how we might adapt our existing and future MySQL instances to prepare for these tools. Particularly for some large-scale rollouts - I’m concerned that updating hundreds of existing installations to use a new MySQL binlog format might be a major challenge for some groups.

Does anyone have experience or best practices in this area that they can share?

If the above is correct, we might want to start issuing some recommendations around MySQL configuration that accounts for this.

Thanks, Mike

bashir · May 18, 2020, 9:41pm

Thanks @wyclif for this nice summary. As I alluded above, I think the solution for the sync problem can play a role for the streaming part of an ETL/Analytics solution. One thing that I asked here but still am not clear about is what were your reasons for not building on top of Sync 2.0 and instead creating dbsync? Because at least part of what you are describing is similar to how Sync 2.0 uses the atomfeed module to keep a log of DB changes.

FWIW, for streaming ETL, I think beside approaches based on debezium or DB logs, another way is to rely directly on OpenMRS to keep the change logs, like an extended version of atomfeed module; or even directly write into the data warehouse once a change happens in OpenMRS. I’ll try to talk through some of these ideas in this meeting.

dkayiwa · May 18, 2020, 9:52pm

Would this be in order to make the solution lighter?