Design Forum 2018-04-23: Data Warehousing and Analytics

darius · May 8, 2018, 5:33pm

Since filling the doodle poll, it turns out that we scheduled a Bahmni call in this Wednesday slot that Pramida and I need to be at (since it’s about our upcoming release).

(I’m not sure the best way around that.)

jthomas · May 8, 2018, 5:50pm

It looks like we could do an hour earlier (1-2pm UTC)

burke · May 9, 2018, 3:45pm

Were you unable to join our Data Warehousing and Analytics brainstorming discussion? You can view it here:

Notes here

jthomas · May 9, 2018, 4:21pm

@burke were there any next steps?

owais.hussain · May 10, 2018, 10:28am

Unfortunately, I couldn’t attend this yesterday. Wanted to share our experiences in country-wide implementation in Pakistan, which may help designing the DW function.

Just like everyone, we’re flattening the tables too. There’s a program which runs over night and does the following:

Update dimension tables (dim_patient, dim_concept, dim_datetime, dim_location, dim_encounter, etc.). These tables are not recreated, but rather updated. Important dimensions are:

Date and Time
Patient
Location (for facility based implementations)
User (for community based implementations)

We can later build cubes to answer queries like:

No. of Patients in a specific Time duration, being treated in a specific Facility
No. of Patients in a specific Time duration, screened by a specific Health worker

Flattening Forms (usually Encounters and Obs). These tables are reset each time, because there may be changes in forms, which need to be reflected.
Facts. Fact tables include keys from dimensions and aggregated numbers; since every implementation has different needs, therefore it is very difficult to come up with a generic set of facts.

What I learned from the discussion log – thanks for that @burke – and last implementer conference is that almost everyone is facing one or more of the issues listed:

DW being reset daily wastes time and server resources.
Handling updates is more challenging than new records.
Hierarchical structure of parent-child records adds complexity to flattening, as demonstrated by @pramidat
Attributes are saved in repetitive fashion, so a person attribute “National ID” can store multiple copies: “PK-12345” and “PK-98765”. Usually, we’re interested in latest value only, thus when flattening attributes, we only pick the latest value; what about the previous values?
There’s usually some Java job running as demon, or Bash script called in Cronjob to control the DW jobs. These jobs should be controlled via OpenMRS (like reporting module does).
The data is analyzed via Pentaho, PowerBI, etc., reporting module should also provide interface to generate reports from DW.
Importing data from multiple installation opens a new box of challenges. For example, should a separate copy of metadata be maintained for each installation, or we assume that the installations are using same concept dictionary, encounter types, etc.? What about locations and users? A hospital’s use case is different from a medical camp.
Obs groups are troublesome. For example, a multi-select question can be stored as an obs group, but when it comes to flattening, we have to live with workarounds like storing multiple selections into a single cell in flattened table, or generate a column against each combination (taking a leaf from text mining).

A quick answer to query from @darius about Spring batch vs Pentaho. Spring batch simply provides parallelism. PDI (Pentaho Data Integration) on other hand provides a plethora of tools for ETL. You can even execute java programs as Kettle jobs. I would suggest going with PDI; there’s some learning curve but it’s a proper Big-data-ready DW tool.