ETL: Sharing OHRI's Mamba-ETL solution

grace · May 8, 2022, 3:21pm

The @OHRI/@UCSF team has been working on an approach to flatten the OMRS EAV data model into query-able relational DB tables (If I understand correctly).

@dkayiwa, @amugume and I thought others would like to hear about this work. We are also very curious to hear folks’ feedback. Personally, I think our community needs a generic or reference-able EAV–>Relational DB ETL offering to help small orgs get started with more difficult queries. Could this work could provide OMRS implementers with such a starting place?

Repo: GitHub - UCSF-IGHS/mamba-ohri: Repository for ETL Scripts
Explainer presentation: OHRI ETL Presentation Mombasa April 2022 - YouTube

A big thank you especially to @amugume, both for his permission to share the video recording of his presentation, and for being very gracious in answering my questions

Credit and thanks also to @aojwang (who Arthur also credits in the recording) and members of the PIH-PETL team like @mseaton, as I believe their work substantially contributed to the envisioning of this ETL solution for OHRI.

I know @bashir and @akimaina have been doing a lot of work in the Analytics engine squad to prepare for settings that need a performant solution that can scale with very large volumes of data. So performance with large data sets was my first question. @amugume and @ssmusoke are going to test this with a large sample dataset from UgandaEMR and will get back to us.

CC @burke @ibacher @jdick & welcome others…

mksd · May 8, 2022, 3:33pm

@reagan I’ll create a spike ticket internally for you to have a look at those ETL scripts. This could come as additions to the data extracts that we have come up with for Ozone.

Cc @achachiez

reagan · May 9, 2022, 8:42am

@grace @amugume thanks for bringing this to everyones attention. Could you share with me a link to the video you have referred to?

grace · May 9, 2022, 8:53am

Will do asap @reagan - sorry for delay; the internet here is so slow that the video has been uploading since last night. It’s 70% uploaded right now…

grace · May 9, 2022, 9:04am

Another interesting thing to note is that there is kind of a business interpretation layer that happens immediately after the OHRI ETL, but as part of the pipeline:

Question: Why handle this here if you have a BI Analytics or Data Viz Tool at the end of your pipeline?
Answer: Because many implementers have found that the BI Tools can be accidentally improperly set up, so that an M&E team thinks that the charts are fetching the same data, but in fact they haven’t fully understood the correct labelling/chart set-up. The idea of this interpretation layer is to add a bit more order to chaos so that people using the BI tools later down the line are less likely to make set-up mistakes, but they will still have the freedom to ask the questions / build the reports they need.

Example: CSV logic for how patients of a specific age should be aggregated, depending on the report (eg DATIM 2022, MER 2022 vs MER 2021, etc). So eg last year PEPFAR updated their MER report requirements so that for some indicators (not all) someone age 66 should go in a group 65+; in others, they would be counted in the 55+ age bracket.
This CSV allows someone like a BA to update the logic when updated guidelines come out (and you can add unlimited columns to still support historical aggregate definitions).

@amugume did I represent this fairly? Is there a specific place in your code you can point folks to if they want to better explore this layer?

mayanja · May 9, 2022, 12:30pm

In Mamba ETL we consider two distinct layers of ETL:

Base — which is where tables are laid out as much as possible in a dimensional model, but without any transformations or algorithm computations — i.e. where data depends on more than one column. An age computation doesn’t belong in base. You can expand F and M into Female and Male, and you can correct basic data problems, such as WHK into Windhoek, but not much else. So base is basically a base layer, without any computations. Base is also not dependent on context.
Derived — which is where computations, groupings and other algorithms are applied. Age groups can happen here, determinations of viral load status, interpretations of test results, etc. Derived is context specific — meaning definitions depend on the context. For example, you can have an MOH derived layer, and a PEPFAR derived layer. Both can have different age groups, different ways to determine viral suppression, different aggregations etc. This allows for full flexibility, without getting into each other’s way.

For OHRI, the base layer flattens encounters. We have one encounter type per form, so base essentially flattens a form into a wide format, i.e. one row per submission. For now, you need to specify which columns you want to include, by incorporating their concept IDs (and the target column name). We can also consider flattening an entire encounter though (i.e. all the concepts in a form).

grace · May 9, 2022, 6:43pm

Thank you very much Fitti for this very clear explanation! The whole concept of separating the “base” from the derivations seems especially helpful for preventing the ETL from being fragile amidst changes to business rules.

grace · May 10, 2022, 6:08am

@reagan here is the video (and i’ve added it to the original post above as well):

slubwama · May 10, 2022, 8:07am

@mayanja off-topic. No way on earth I would have figured that Mayanja is your other name. Its quite a name to mean great lakes

eudson · May 10, 2022, 1:21pm

We just had another presentation around this same topic and one of the questions that came up quickly was how do we deploy this for use cases where the system is distributed across multiple servers / health facilities. I know the @PIH has done some work around this and we will be looking at it, but if you have any ideas please share.

cc: @dagimm @alaboso @amugume

grace · May 10, 2022, 3:47pm

@eudson or @amugume can you also clarify: is this approach for a SINGLE database, or would an implementation also have to set up a separate analytics database, to use this mamba-etl solution?

raff · July 19, 2023, 11:41am

Thank you for sharing your work again. Some days back we had yet another call with an update on the solution. Here are a few of my thoughts on some design aspects.

I think it would be better to run it as a separate service and not an OpenMRS module. It is easier to roll out upgrades in a way that do not affect a running OpenMRS instance and do it without any OpenMRS downtime. Moreover any issues with the ETL service would not affect OpenMRS users. It is also very common to setup a separate replica DB to take the load off the main DB for the ETL tasks. If you are running within OpenMRS instance you are by default using the main DB connection or/else you go around any OpenMRS service and have a direct connection to a different DB.
I find SQL procedures to be hard to maintain and since they are DB specific we cannot support different DB engines with the same code. SQL procedures are also hard to test. What’s your approach on testing?
If I understand the solution correctly you still need to run a separate service to take data out of DB. You do provide REST endpoints, but REST is not ideal as it means more data transformations may be needed than necessary, because data always needs to be transformed to JSON and back to the other data source format.

Finally, I’d suggest to look at Camel for building the ETL solution. You can bind Camel with lots of data sources and it even has support for Debezium, which allows near real-time data extraction for real-time analysis, if needed. An example with the use of Camel and Debezium with OpenMRS is at GitHub - openmrs/openmrs-eip: An EIP toolbox for OpenMRS: the OpenMRS Camel Component and a Spring Boot app to launch it. It is an example of a real-time data syncing mechanism, but gives some inspiration on how things could be done for the ETL solution.

The part that I wish we had for reuse in Camel integrations is OpenMRS Camel component, which would provide converters to the OpenMRS model and make it easy to query OpenMRS DB and work with the OpenMRS data model as POJO. It would be extremely easy to build integrations with such a component.

raff · July 19, 2023, 1:38pm

I also wanted to throw yet another (open-source) idea for the ETL workflows besides Camel: Nifi

It’s for building data flows with no or little code with a nice UI. It’s able to handle lots of data thanks to in-built clustering.

First released by NSA around 2006 and open-sourced in 2014 according to a Google search. Might be a good alternative, but I don’t have any experience with it, nor I know anyone using it There’s lots of good quality videos on youtube.

eudson · July 20, 2023, 5:03pm

@raff thank you for your comments, allow me to shed some light on why we decided to move with MambaETL with the current approach.

When we envisioned MambaETL we had the following mission in mind: have a solution that resolves the performance issues with the current facility level reporting and make the life of the implementers easy (an engine that does the heavy lift of writing flattening scripts for them).

We have had a couple of discussions around this topic 2 years ago and there was no concrete solution for the problem (each implementer was approaching the problem differently). During those discussions some of the solutions you mentioned were also discussed and if am not mistaken Kafka was also part of the mix. But one thing we all recognised is that the problem happens at sites (health facility) with constrained resources (staffing and hardware) therefore adding a separate service would have an impact that most projects are not willing (or can’t) to take the risk as of now (and specially 2 years ago).

So the solution we proposed and built brings no additional services neither additional knowledge for implementers, specially for the Data Managers who are in charge of deploying OpenMRS. MambaETL is part of the OpenMRS ecosystem, in fact that was a requirement when we pitched the idea to the @METS team. I remember the question: “How can we deploy your solution as part of our distribution”?

MambaETL doesn’t requires implementers to maintain SQL Procedures, the engine generates the procedures. The only thing an implementer would need to do is to configure how and when he wants is flatting to be done. @amugume can add more here.

To comment on your 3rd point and @grace’s question which escaped me a year ago

We have to approaches with where Mamba will run: (1) same database and (2) separated analysis database. We have initially envisioned running on an analysis database only but again requirements from @METS pushed as towards first supporting running the ETL on the same database (the openmrs transactional database). Since the ETL runs on the same openmrs database data can be directly accessed by reports (we a currently doing this in Rwanda) But we are working to support both approaches.

The APIs we are building are intended to support cases where one doesn’t want to make native queries to access the analysis data and we are actually using them to support some of our summary dashboards indicators in O3 (@OHRI).

Am happy to discuss more ways to improve the ETL and am hoping it could be adopted by more community members soon.

raff · July 21, 2023, 11:46am

Thanks @eudson for further clarifications and your work on this!

Given the requirements I do understand some of the choices you made.

I hope implementers progressively transition to docker deployments, which simplifies running and upgrading different services.

Is extraction and transformation run entirely on DB side or do you run some logic on Java side as well?

What versions of OpenMRS does your module support? I see it uses openmrsPlatformVersion set to 2.3.1. Is there a particular reason it can’t be run on older versions?

It was my assumption that the maintenance of SQL procedures is on you i.e. developers of the engine. You need a decent amount of testing to make sure that procedures will not fail on different edge-cases. Thus I was curious on your testing approach and how you troubleshoot in case things go wrong. Do you expose logs for running the procedures in OpenMRS or does one need to look at DB logs directly? How do we test scripts against different OpenMRS versions?