Real-Time ETL for OpenMRS Using Apache Flink: Proof of Concept & Feedback Request

Hi everyone,

As part of my university research on real-time ETL for EAV-modeled clinical data, I created a proof-of-concept Apache Flink application that allows users to create Flink ETL jobs for OpenMRS. This was developed under the guidance of @ibacher

Github Repo: GitHub - wikumChamith/openmrs-contrib-flink · GitHub

It supports:

  • Streaming changes from an OpenMRS MySQL database in real-time
  • Pivoting EAV observations into flat columns automatically
  • Writing to a target database with upsert support
  • No coding required, just write a YAML config file

The app works as a standalone service where ETL jobs are defined using YAML config files. When you upload a YAML file, the application automatically generates all the required Flink SQL for source tables, lookups, sinks, and inserts. Since M&E teams usually don’t want to write Flink SQL by hand, they can define simple mappings instead, including concept mappings and lookup fields.

I’ve added a few sample YAML files here: openmrs-contrib-flink/src/main/resources/sample at main · wikumChamith/openmrs-contrib-flink · GitHub

Here’s an example of what concept mapping looks like:

The README and sample files in the repository should give you a good idea of how to use the application.

This is still a proof of concept and not a production-ready tool. The key areas that still need work are:

  1. Authentication & Authorization: No user login or access control yet; anyone with access can upload and delete jobs.
  2. Secure Credential Handling: Database credentials are currently stored in plaintext in YAML files.
  3. Real Job Status Monitoring: All jobs show as “Running” regardless of their actual state.
  4. Job Logs & Error Visibility: There is currently no way to view logs or error messages from the UI.
  5. YAML Validation in UI: There is no built-in editor or validator; errors are only caught after upload

Please give it a try and share any feedback or suggestions for features you’d like to see. I’m planning to showcase this in this week’s platform call. I look forward to hearing your thoughts!

4 Likes

Thanks @wikumc for sharing this. How does this compare with mambaetl? In other words, as an implementer, when would i use this over the other? Some sort of pros and cons would also be great.

I think this approach has a few benefits compared to MambaETL:

  1. Real-time ETL using Change Data Capture. MambaETL runs periodically and regenerates flat tables. Real-time flattening would be helpful when we need optimized real-time queries, for example for something like the Patient Flags module.
  2. Runtime-configurable jobs. Jobs can be created or deleted at runtime through a REST API without redeployment. With MambaETL, configuration changes usually require recompilation and redeployment of the module.
  3. Standalone application. This runs as a standalone application, while MambaETL is an OpenMRS module. This means ETL processing does not add extra load to the OpenMRS server and can be managed independently.
  4. Supports multi-table CDC joins. For example: openmrs-contrib-flink/src/main/resources/sample/vitals-with-patient-info.yaml at 25b899564974da29a9e0bd6866411a1b31d8d72f · wikumChamith/openmrs-contrib-flink · GitHub MambaETL is primarily focused on flattening the obs table, while this approach allows joins across multiple tables.
2 Likes

Data streaming with configurable ETL leveraging Apache Flink is what Ozone Analytics already provides as an open-source stack. I would definitely not duplicate that work.

Note that within Ozone HIS, Ozone Analytics provides this for OpenMRS as well as for other apps.

@Mekom any docs that can be shared with @wikumc?

1 Like

Although this shares a common orchestration library in Flink, I don’t think the work is duplicative at all. Rather there’s somewhat of a different focus; Ozone Analytics mostly has code around, e.g., triggering ETL jobs on a streaming basis from Kafka messages or a batch basis with the actual transformations described in SQL and published to some kind of sink. The work here is much more about a descriptive layer for describing how to create flat tables from OpenMRS database structures. Which is to say, we could adapt a lot of Ozone Analytics wrappings and triggers into this project or use this project to ultimately describe the SQL executed by Ozone Analytics.

2 Likes