Real-Time ETL for OpenMRS Using Apache Flink: Proof of Concept & Feedback Request

Hi everyone,

As part of my university research on real-time ETL for EAV-modeled clinical data, I created a proof-of-concept Apache Flink application that allows users to create Flink ETL jobs for OpenMRS. This was developed under the guidance of @ibacher

Github Repo: GitHub - wikumChamith/openmrs-contrib-flink · GitHub

It supports:

  • Streaming changes from an OpenMRS MySQL database in real-time
  • Pivoting EAV observations into flat columns automatically
  • Writing to a target database with upsert support
  • No coding required, just write a YAML config file

The app works as a standalone service where ETL jobs are defined using YAML config files. When you upload a YAML file, the application automatically generates all the required Flink SQL for source tables, lookups, sinks, and inserts. Since M&E teams usually don’t want to write Flink SQL by hand, they can define simple mappings instead, including concept mappings and lookup fields.

I’ve added a few sample YAML files here: openmrs-contrib-flink/src/main/resources/sample at main · wikumChamith/openmrs-contrib-flink · GitHub

Here’s an example of what concept mapping looks like:

The README and sample files in the repository should give you a good idea of how to use the application.

This is still a proof of concept and not a production-ready tool. The key areas that still need work are:

  1. Authentication & Authorization: No user login or access control yet; anyone with access can upload and delete jobs.
  2. Secure Credential Handling: Database credentials are currently stored in plaintext in YAML files.
  3. Real Job Status Monitoring: All jobs show as “Running” regardless of their actual state.
  4. Job Logs & Error Visibility: There is currently no way to view logs or error messages from the UI.
  5. YAML Validation in UI: There is no built-in editor or validator; errors are only caught after upload

Please give it a try and share any feedback or suggestions for features you’d like to see. I’m planning to showcase this in this week’s platform call. I look forward to hearing your thoughts!

4 Likes

Thanks @wikumc for sharing this. How does this compare with mambaetl? In other words, as an implementer, when would i use this over the other? Some sort of pros and cons would also be great.

I think this approach has a few benefits compared to MambaETL:

  1. Real-time ETL using Change Data Capture. MambaETL runs periodically and regenerates flat tables. Real-time flattening would be helpful when we need optimized real-time queries, for example for something like the Patient Flags module.
  2. Runtime-configurable jobs. Jobs can be created or deleted at runtime through a REST API without redeployment. With MambaETL, configuration changes usually require recompilation and redeployment of the module.
  3. Standalone application. This runs as a standalone application, while MambaETL is an OpenMRS module. This means ETL processing does not add extra load to the OpenMRS server and can be managed independently.
  4. Supports multi-table CDC joins. For example: openmrs-contrib-flink/src/main/resources/sample/vitals-with-patient-info.yaml at 25b899564974da29a9e0bd6866411a1b31d8d72f · wikumChamith/openmrs-contrib-flink · GitHub MambaETL is primarily focused on flattening the obs table, while this approach allows joins across multiple tables.
2 Likes

Data streaming with configurable ETL leveraging Apache Flink is what Ozone Analytics already provides as an open-source stack. I would definitely not duplicate that work.

Note that within Ozone HIS, Ozone Analytics provides this for OpenMRS as well as for other apps.

@Mekom any docs that can be shared with @wikumc?

1 Like

Although this shares a common orchestration library in Flink, I don’t think the work is duplicative at all. Rather there’s somewhat of a different focus; Ozone Analytics mostly has code around, e.g., triggering ETL jobs on a streaming basis from Kafka messages or a batch basis with the actual transformations described in SQL and published to some kind of sink. The work here is much more about a descriptive layer for describing how to create flat tables from OpenMRS database structures. Which is to say, we could adapt a lot of Ozone Analytics wrappings and triggers into this project or use this project to ultimately describe the SQL executed by Ozone Analytics.

2 Likes

I read your thing ETL pipeline. I am the data engineer, if you need me any time. i will be there for the contribution in your ETL pipeline. it will be a great honour to me to work under your mentorship.

I think we’re very open to any contributions that improve things. Over the years, we’ve had several ETL-focused projects but they’ve never quite gotten to a phase where we’ve started to see them deployed, so anything we can do to make the pipelines any of:

  • Correct
  • Easy to deploy
  • Easy to work with the transformed data

Especially for things like standard healthcare indicators would be very useful!

1 Like

I am new in the field of open source contribution. So there is a suggestion from my side like if we create a production ready ETL pipeline and referencing those ETL focused projects already made. what do you think??

We will form a single ETL pipeline while using Kafka as real time queue and for better analytics we can use clickhouse or any BI tools. like it will be an external tool which hospitals can use with their existing management system and make see best analysis, make best of the decisions.

I am thinking such kind of single architecture which covers all things and then deploy this as an external tool.

What do you think??

I just want to know like If I try to create such pipeline which I have stated in the previous reply. Then it will be welcomed by the maintainers of the Open MRS and they will notice our work and attach to the official usable software after testing our pipeline??

@itz_ashar_ali I’m planning to integrate my work into the OpenMRS codebase soon. If you’d like, you can start contributing to that project after that. Otherwise, I think we’ll be duplicating our work.

Ok Fine Sounds good to me!!