ETL: Sharing OHRI's Mamba-ETL solution

The @OHRI/@UCSF team has been working on an approach to flatten the OMRS EAV data model into query-able relational DB tables (If I understand correctly).

@dkayiwa, @amugume and I thought others would like to hear about this work. We are also very curious to hear folks’ feedback. Personally, I think our community needs a generic or reference-able EAV–>Relational DB ETL offering to help small orgs get started with more difficult queries. Could this work could provide OMRS implementers with such a starting place?

A big thank you especially to @amugume, both for his permission to share the video recording of his presentation, and for being very gracious in answering my questions :slight_smile:

Credit and thanks also to @aojwang (who Arthur also credits in the recording) and members of the PIH-PETL team like @mseaton, as I believe their work substantially contributed to the envisioning of this ETL solution for OHRI.

I know @bashir and @akimaina have been doing a lot of work in the Analytics engine squad to prepare for settings that need a performant solution that can scale with very large volumes of data. So performance with large data sets was my first question. @amugume and @ssmusoke are going to test this with a large sample dataset from UgandaEMR and will get back to us.

CC @burke @ibacher @jdick & welcome others…

2 Likes

@reagan I’ll create a spike ticket internally for you to have a look at those ETL scripts. This could come as additions to the data extracts that we have come up with for Ozone.

Cc @achachiez

1 Like

@grace @amugume thanks for bringing this to everyones attention. Could you share with me a link to the video you have referred to?

Will do asap @reagan - sorry for delay; the internet here is so slow that the video has been uploading since last night. It’s 70% uploaded right now…

Another interesting thing to note is that there is kind of a business interpretation layer that happens immediately after the OHRI ETL, but as part of the pipeline:

Question: Why handle this here if you have a BI Analytics or Data Viz Tool at the end of your pipeline?
Answer: Because many implementers have found that the BI Tools can be accidentally improperly set up, so that an M&E team thinks that the charts are fetching the same data, but in fact they haven’t fully understood the correct labelling/chart set-up. The idea of this interpretation layer is to add a bit more order to chaos so that people using the BI tools later down the line are less likely to make set-up mistakes, but they will still have the freedom to ask the questions / build the reports they need.

Example: CSV logic for how patients of a specific age should be aggregated, depending on the report (eg DATIM 2022, MER 2022 vs MER 2021, etc). So eg last year PEPFAR updated their MER report requirements so that for some indicators (not all) someone age 66 should go in a group 65+; in others, they would be counted in the 55+ age bracket.
This CSV allows someone like a BA to update the logic when updated guidelines come out (and you can add unlimited columns to still support historical aggregate definitions).

@amugume did I represent this fairly? Is there a specific place in your code you can point folks to if they want to better explore this layer?

In Mamba ETL we consider two distinct layers of ETL:

  • Base — which is where tables are laid out as much as possible in a dimensional model, but without any transformations or algorithm computations — i.e. where data depends on more than one column. An age computation doesn’t belong in base. You can expand F and M into Female and Male, and you can correct basic data problems, such as WHK into Windhoek, but not much else. So base is basically a base layer, without any computations. Base is also not dependent on context.
  • Derived — which is where computations, groupings and other algorithms are applied. Age groups can happen here, determinations of viral load status, interpretations of test results, etc. Derived is context specific — meaning definitions depend on the context. For example, you can have an MOH derived layer, and a PEPFAR derived layer. Both can have different age groups, different ways to determine viral suppression, different aggregations etc. This allows for full flexibility, without getting into each other’s way.

For OHRI, the base layer flattens encounters. We have one encounter type per form, so base essentially flattens a form into a wide format, i.e. one row per submission. For now, you need to specify which columns you want to include, by incorporating their concept IDs (and the target column name). We can also consider flattening an entire encounter though (i.e. all the concepts in a form).

1 Like

Thank you very much Fitti for this very clear explanation! The whole concept of separating the “base” from the derivations seems especially helpful for preventing the ETL from being fragile amidst changes to business rules.

@reagan here is the video (and i’ve added it to the original post above as well):

@mayanja off-topic. No way on earth I would have figured that Mayanja is your other name. Its quite a name to mean great lakes

1 Like

We just had another presentation around this same topic and one of the questions that came up quickly was how do we deploy this for use cases where the system is distributed across multiple servers / health facilities. I know the @PIH has done some work around this and we will be looking at it, but if you have any ideas please share.

cc: @dagimm @alaboso @amugume

@eudson or @amugume can you also clarify: is this approach for a SINGLE database, or would an implementation also have to set up a separate analytics database, to use this mamba-etl solution?