Thoughts on designing a data/metadata driven out of the box ETL solution for OpenMRS

aojwang · January 16, 2020, 9:05am

Background

Many OpenMRS distributions are now resorting to the use of flat tables for reporting needs. These tables serve the needs for both standard and ad-hoc reporting and are in most cases manually created from raw SQL queries designed and executed by developers/database specialists. The process involves writing queries to create and populate tables and incrementally update the tables with changes in the transactional database, openmrs.

Different distributions, however, design the flat tables differently.

Method 1: Report(s) driven tables with pre-calculated variables - This approach produces tables tailor-made to feed specific reports and will in most cases contain a number of pre-calculated variables. A table is designed to contain variables collected from multiple data entry forms.

This approach has proved to be faster to design and execute and additionally produce ready values for the target report(s). The downside is that it doesn’t readily support ad-hoc report requests since not all data variables make it to the tables.

Method 2: One-table-per-form - This approach produces a table for every configured form in the system with a table column pointing to a data element in a data entry form. A table is, therefore, designed to store raw information from data entry forms.

This approach readily supports ad-hoc reporting since the tables are designed to store all data variables in their raw forms as recorded in the data entry forms. The one-table-per-form also simplifies the architecture of the reporting tables to end-users for any ad-hoc data extraction needs. Flattening of all forms is, however, slow compared to method 1 above.

Method 3: A hybrid of methods 1 and 2 - This will result in a set of tables generated based on method 1 and another set based on method 2.

All the approaches simplify the OpenMRS data model to the end-user and make it easier and faster to extract data and reports.

Exploring an automated design for an out-of-the-box ETL solution for OpenMRS

The manual process of designing and developing ETL queries is time-consuming, tedious, complex, and not scalable, and as a community, there is a need to design an out-of-the-box ETL solution that every distribution can adapt and configure as required. In my thoughts, there are two possible ways through this we can achieve this and they include:

Exploring OpenMRS’ form technology i.e. HTML form entry and JSON form generators, and use the generated markup to inform the design and development of ETL tables.
Decompose the obs table into row-per-encounter/row-per-form through the support of the concept dictionary. This is ideal for distributions that don’t use the standard form technologies for data entry

These are not refined conclusions and still requires a lot and focussed deliberations for a good product design. I did a small prototype on HTML forms in KenyaEMR based on method 1 and the write-up can be accessed here.

Please note that there were deliberations on this topic during the conference in Mozambique and the feedback was amazing.

What do others think about this subject? @burke @ssmusoke @ningosi @jdick @mseaton @mogoodrich @nkimaina @wanyee @patryllus @dkayiwa

wanyee · January 16, 2020, 11:53am

Glad to see you’ve posted this Antony - you and I have discussed this a lot and you know what I think. Waiting to roll up sleeves and get this going!

mksd · January 16, 2020, 1:55pm

Thanks @aojwang.

Have you seen what’s been done with Bahmni Mart?

I’m confused about this focus on forms. Forms are just a medium to record data. In a framework like HFE they just generate a nested structure of one encounter and multiple obs. And that’s what’s needed, and hence that’s what’s flattened for analysis.

mseaton · January 16, 2020, 2:46pm

Thanks @mksd. I think the form focus is likely because they are the closest thing we have to an actual data model that describes a particular implementation - i.e they do have a schema as to what data is allowed, how it is structured, what options are available and what labels are used, etc. What is in the actual encounter and obs tables may not match (eg. some questions may never have been answered, or some questions that were previously asked may no longer be relevant).

Can you describe how Bahmni Mart has approached this? Is there any reason why this is Bahmni specific as opposed to having been billed as a generic OpenMRS reporting tool?

aojwang · January 17, 2020, 6:43am

@mksd I haven’t had a keen look into the Bahmni mart but I know they have something similar. Maybe @angshuonline can help shed more light on this. I may be wrong but I think Bahmni uses some sort of metadata in the form of templates and these are used to generate the tables.

mksd · January 17, 2020, 7:53am

In a nutshell there is

A microservice that queries OpenMRS every so often to extract flattened data, that’s the ETL process basically, externalised as that microservice.
This flattened data is fed to a reporting analytics platform, Metabase in this case. I guess it assumes that the reporting platform is “nearby”, so running on the same server locally.

@angshuonline and others (@mksrom) to confirm and expand. With Bahmni Mart the configuration as to how the data is flattened is made in 1. Right now the out-of-the-box config for the original requester of this feature (MSF).

P.S. There is no reason for it to be Bahmni-specific and actually I don’t think it is set aside the branding, but again, others to confirm.

At ICRC we are coming up with a different approach for DHIS2. We use the good old Reporting module to do the ETL. As such it is just a handful of SQL dataset definitions that are used every so often to generate the CSV dump that DHIS2 expects. The wit was in the way we used special concepts and their attributes to parameterise the SQL queries and keep them very generic and configurable to target DHIS2 indicators en masse.This piece of work will be packaged as a reusable module when out of its alpha stage.

@amine to provide more details on this approach.

P.S. Makes me think that this is yet another case of using a special secondary source of concepts for a very specific purpose, cfr this.

owais.hussain · January 17, 2020, 12:22pm

Hi @aojwang, thanks for bringing this up. My two cents on each method:

Method 1: This approach works great in settings where you are reusing metadata from some standard package. Had there existed program-specific modules, this one made more sense. However, from the perspective that OpenMRS is a platform, the real question is do we know what “facts” will the users be interested in knowing?

Method 2: If I remember correctly, almost every group that attended the ETL unconference session in Maputo told that they use table flattening per form, and then build Ad-hoc reports on top of these tables. This approach works for most, and seems to be the right fit. But there’s a pitfall that if we periodically drop-create flat tables, the performance is badly hurt on scale. On the other hand, create-once and incrementally update approach leaves little flexibility for changes in the forms.

Method 3: A hybrid approach requires input from various groups, and even then we’d be executing some aggregate queries which may not be used largely. This is because every installation has different objectives and contexts, determined by their KPI’s. A single-user implementation would be least interested in the user/provider stats; a multi-user, single-facility implementation may need some level of user/provider stats; multi-user, multi-facility will be primarily focusing on facility stats, and drill down on user level; a facility + community based program may have a different set of KPI’s.

Also, apart from flattening form data, we can also create denormalized tables for patients and facilities with their attributes.

dkayiwa · January 17, 2020, 12:48pm

How about dropping and recreating only when the form schema changes?

aojwang · January 17, 2020, 2:31pm

Thanks, @mksd. This was helpful

burke · January 17, 2020, 9:29pm

Great conversation! Thanks for this post, @aojwang.

I favor method 3 (hybrid approach). More specifically, I wouldn’t try to design an ETL solution that meets every need out of the box; rather, I’d suggest something like:

Create a docker-based solution
Create a process for defining a transform that the solution will generate & update. Preferably, configuring existing Apache tools.
Use that process to define some useful transforms (like flattened encounters).
Add an engine to automatically use this process to auto-flatten forms.
As a proof of concept, use this approach to generate the transforms needed for PEPFAR report(s) that everyone could use.

Then, implementations could adapt the solution to meet their needs by adding transforms the same way we defined the transforms that are included out of the box and we are relieved of the burden of needing to define every transform needed in the community.

nkimaina · January 20, 2020, 6:41am

Thanks Antony for following up on this and posting it! I agree with Burke. I would favor approach 3, being a hybrid approach. We could have some ‘base’ tables, where encounters, obs and other frequently used models are flattened (done preferably using a schema-less approach). It’s actually possible to store your data without schema, and apply schema only when querying. It’s worth looking into this. We can then create the more specific datasets like the PEPFAR datasets. These will have schemas. It’s important to consider tools built for these things, especially if the solution is to scale. The tools selected should allow incremental updates as well as complete rebuilding of these sets. The tools/solution should allow one to scale horizontally, as this is important for large implementations.

dkayiwa · January 20, 2020, 12:16pm

Do you have the time to share about why you prefered this to the bahmni approach for DHIS2 integration?

mksd · January 20, 2020, 12:21pm

We have 500 indicators and counting, we needed an (very) efficient way to produce updated indicators at least once every 24h.

dkayiwa · January 20, 2020, 12:21pm

Do you mind sharing some bit of details about this approach?

mozzy · January 20, 2020, 6:01pm

Am also seeing some sort of related discussion here

mozzy · January 20, 2020, 6:05pm

some work alredy done at Ampath

just FYI @aojwang

kuang · January 29, 2020, 7:48pm

Hi all,

Folks at AMPATH (JJ and Allan) and Antara are going to tackle something along these lines in the next 90 days. We had the following thoughts:

We will design a friendly-to-analytics schema for the data that is NOT the schema of the forms (1 option in the thread), but something more intentional for analytics, but not specific to an implementation. 2. In our brainstorm, a sparse 4D data frame that is roughly Person x Concepts x Time x Location — as a strawman starting point
We’d love your input to design such a data frame
There are existing and purpose built mechanisms for populating the dataframe, whether via openmrs-rest-api, or database query instrumented with Airflow, or one of the stream engines like Kafka (flink or nifi or ?)
Maybe if there’s enough interest, we can create a contest from various members of community, each taking a different POC approach, and we run it on a large data set (i.e. AMPATHs database?) and decide by looking at the performance characteristics
On the other side of this data frame definition, we’d like to create a set of convenience functions for aggregating the Obs-dataframe over time, locations, cohorts of patients and groups of concepts — maybe this is presented as a Jupyter/Spark notebook

Curious your thoughts.

aojwang · January 30, 2020, 4:17am

Thanks, @kuang for this information. I know most implementations use the OpenMRS reporting framework to build reports in the system. Does your approach support this? Could you also share a list of the tools that you will use for this so that we can research and understand how they work?

Thanks