Nearly everyone who has implemented OpenMRS has eventually needed to generate reports from or perform analytics on their data. One of the key opportunities for greater collaboration within the OpenMRS community is around reporting through ETL (Extract, Transform, and Load). This was underscored during discussions at OMRS19. Despite getting a lot out of the existing OpenMRS reporting framework, it’s difficult to do modern analytics on OpenMRS data. The manual process of designing and developing ETL queries is time-consuming, complex, and is being tackled one implementation at a time.
Where we’ve been
OpenMRS began largely as a tool to capture data from HIV encounter forms for clinical care and research. Most data analysis in those days was through brute force SQL queries to extract data into analytic tools. It wasn’t long before people started referring to OpenMRS as a “data bucket” – data go in, but they don’t come out. There have been numerous attempts to address this over the years:
- Reporting Module
- BIRT Module integrated with Eclipse BIRT
- Integrations with Pentaho
- Jasper Reports
- DHIS2 Reporting Module
- ADX and mADX efforts
- Tools like SQLeo
- Flattening tables (AMPATH, Bahmni, others)
- … many more
One of the key insights from our prior efforts with traditional reporting tools was the realization the typical star schema used by data marts works well when you know your data model and the questions you want to ask of your data. However, in medicine our data model is constantly changing (new knowledge about disease, new tests, new treatments) and the questions being asked of the data are always changing. Fortunately, tools like Elasticsearch, Apache Spark, and Apache Kafka can help address this issue.
Where we are today
- People continue to get a lot out of the Reporting framework (Reporting Module)
- Implementations often run reports overnight to reduce the impact on performance
- We see many bespoke reporting solutions
- Several sites are having local successes exporting data into flattened tables
- Sites with a reporting solution are often doing full loads (instead of incremental updates)
- The community is still struggling to meet reporting needs efficiently
- We’re missing opportunities to collaborate on solving shared reporting challenges
- There are some reporting needs shared across many implementations and there are some reporting needs unique to each implementation.
- Reporting needs will change over time.
- In general, reporting is easier from flattened tables.
- There’s a set of flattened tables that would be widely useful as a starting point to meet any implementation’s reporting needs, especially if the process to extend the solution to create additional tables was straightforward.
- Not every implementation has a team to set up and maintain a complex ETL solution.
- While we want a solution that can scale into the cloud, we don’t want a solution that depends on cloud services.
- We don’t want a solution that must run on the same server as OpenMRS.
- A scalable ETL solution could be achieved leveraging existing open source tools (e.g., Elastic Search, Apache Spark, Apache Kafka, etc.).
- Aggregate data is not useful for reporting/analytics/research, but can be useful when brought back to the EMR (e.g., showing quality indicators within the EMR, using aggregate data to provide decision support in the EMR, etc.).
Where do we go from here
We’ve seen convincing examples of how ETL of OpenMRS data into flattened tables can vastly improve the ability to generate reports and rapid analytics. What we need is an approach that can be used by more than one organization – i.e., that we can share within the community – without trying to solve every reporting need up front. Focusing on a handful of practical tables and creating a solution that can easily be extended to meet local needs would lay a foundation for increasing opportunities for collaboration within reporting. At OMRS19, we discussed targeting a specific PEPFAR report that many sites need to generate and to build a Docker-based solution that would integrate with OpenMRS and generate the tables needed for the report in a way that could be extended to other (shared or local) reporting needs.
- Batch or stream. Should provide ability to extract and load data out of OpenMRS in both batch and streaming modes, with the least amount of burden to OpenMRS production line.
- Aggregate or patient-level. Support both aggregates/counts and patient-level analysis/reports
- Extensible. Not everyone needs the same reports and the reports we need tomorrow won’t be the same reports we need today.
- Independent. Be able to run independent from OpenMRS. Be able to update reports without requiring downtime of OpenMRS
- Integrated. OpenMRS should be able to use reporting data.
- Practical. Features like flattened core tables (e.g., encounter, obs, orders) and auto-build tables for forms to make it immediately useful to implementations.
- Customizable. The configuration of “core” tables or initial reports should serve as examples for implementations to extend the solution to meet local needs.
- Leverage existing tools. Configure existing tools together so we can focus on the domain-specific needs instead of wring our own infrastructure.
- Kind. Low barrier of entry. You shouldn’t have to be an expert in ETL tooling to get the basic functionality and begin customizing it for local needs.
- Scalable in both directions. A solution that can run locally, but also scale to the cloud.
- Compatible. Support for different versions of OpenMRS can be added as needed.
- Shareable. Reports can be exchanged between implementations.
Imagine if a few organizations (like Mekom Solutions, PIH, The Palladium Group, and AMPATH possibly joined with groups from Nigeria, Uganda, Mozambique, and Kenya) decided to work together on a shared reporting solution for OpenMRS to address a shared problem (e.g., generating a PEPFAR report they all have to do), reviewed reporting efforts being done by themselves and in the community, and ended up weaving Apache Kafka, ElasticSearch, etc. into a Docker Compose stack that could be run alongside OpenMRS without bringing OpenMRS to its knees. In its simplest form, this would be a “pipeline” extracting data (via batch and/or stream) into a warehouse using commodity open-source tooling to generate some widely useful views (e.g., flattened encounters) and extending that approach for a real-world use case (e.g., a PEPFAR report).
Imagine if we could create a re-usable data pipeline for OpenMRS using modern, open-source tools that allow data from the production database to feed through a ready- and easy-to-use analytics environment supporting NoSQL (loosely structured) and flattened schemas. The included configuration for generating flattened encounters and tables for a PEPFAR report could serve as examples that implementations could extend to meet their local reporting needs. Then OpenMRS would have the foundation for a reporting solution that could be enhanced over time and the community could focus on the OpenMRS-specific parts of transforming & analyzing data (i.e., the clinical domain) rather than spending energy on building another reporting infrastructure best left to Apache and other open source communities. Imagine if countries & implementing organizations could share their reporting & analytic configurations.
Some implementations & organizations have already begun working on next steps locally. For example, some exciting ideas have been shared by @aojwang (see this post), @akimaina (see this post), and @bashir. What we need next is to find at least 2-3 organizations committed to addressing their shared reporting needs by collaborating on a shared & extensible reporting solution with help from the community (an “ETL Squad” … or perhaps an “ELT Squad”). Our biggest opportunity is through collaboration. We can walk further if we walk together. Collaboration is hard; it takes more energy than going your own way… but the returns can be amazing. Collaboration is how OpenMRS was born and it’s our best way forward.