As part of my university research on real-time ETL for EAV-modeled clinical data, I created a proof-of-concept Apache Flink application that allows
users to create Flink ETL jobs for OpenMRS. This was developed under the guidance of @ibacher
Streaming changes from an OpenMRS MySQL database in real-time
Pivoting EAV observations into flat columns automatically
Writing to a target database with upsert support
No coding required, just write a YAML config file
The app works as a standalone service where ETL jobs are defined using YAML config files. When you upload a YAML file, the application
automatically generates all the required Flink SQL for source tables, lookups, sinks, and inserts. Since M&E teams usually don’t want to write
Flink SQL by hand, they can define simple mappings instead, including concept mappings and lookup fields.
Here’s an example of what concept mapping looks like:
The README and sample files in the repository should give you a good idea of how to use the application.
This is still a proof of concept and not a production-ready tool. The key areas that still need work are:
Authentication & Authorization: No user login or access control yet; anyone with access can upload and delete jobs.
Secure Credential Handling: Database credentials are currently stored in plaintext in YAML files.
Real Job Status Monitoring: All jobs show as “Running” regardless of their actual state.
Job Logs & Error Visibility: There is currently no way to view logs or error messages from the UI.
YAML Validation in UI: There is no built-in editor or validator; errors are only caught after upload
Please give it a try and share any feedback or suggestions for features you’d like to see. I’m planning to showcase this in this week’s platform call. I look forward to hearing your thoughts!
Thanks @wikumc for sharing this. How does this compare with mambaetl? In other words, as an implementer, when would i use this over the other? Some sort of pros and cons would also be great.
I think this approach has a few benefits compared to MambaETL:
Real-time ETL using Change Data Capture. MambaETL runs periodically and regenerates flat tables. Real-time flattening would be helpful when we need optimized real-time queries, for example for something like the Patient Flags module.
Runtime-configurable jobs. Jobs can be created or deleted at runtime through a REST API without redeployment. With MambaETL, configuration changes usually require recompilation and redeployment of the module.
Standalone application. This runs as a standalone application, while MambaETL is an OpenMRS module. This means ETL processing does not add extra load to the OpenMRS server and can be managed independently.
Data streaming with configurable ETL leveraging Apache Flink is what Ozone Analytics already provides as an open-source stack. I would definitely not duplicate that work.
Note that within Ozone HIS, Ozone Analytics provides this for OpenMRS as well as for other apps.
Although this shares a common orchestration library in Flink, I don’t think the work is duplicative at all. Rather there’s somewhat of a different focus; Ozone Analytics mostly has code around, e.g., triggering ETL jobs on a streaming basis from Kafka messages or a batch basis with the actual transformations described in SQL and published to some kind of sink. The work here is much more about a descriptive layer for describing how to create flat tables from OpenMRS database structures. Which is to say, we could adapt a lot of Ozone Analytics wrappings and triggers into this project or use this project to ultimately describe the SQL executed by Ozone Analytics.