Creating a Bahmni Mart database for analytics

Tags: #<Tag:0x00007f1f763cc4a8>

Hi,

The database used in Bahmni/Openmrs has an EAV data model, and hierarchical data (For Ex: Obs table). This makes it difficult to plugin the database to any analytical tools (Tableau, Strata etc) as unstructured hierarchical data is limited and multiplies existing complexity. In order for implementation engineers to extract data from the existing production databases is said to be very difficult without totally understanding the openmrs data model. For an implementation to generate indicators or reports, lot of development work has to be done by writing sqls to generate the data from which the indicators can be extracted from. One of the solution thought through is to flatten the hierarchical database. This flattened database can be connected to analytical tools (Tableau, Stata etc) for further analysis of data. The generation of the analytical database will be common across all the implementations. We depend on the model of Openmrs data model to make it as generic as possible. More details about the approach and implementation are available at

Please go through it and let us know your comments.

4 Likes

Sounds very interesting @pramidat!

Just for context, is this is something that your team is planning to build for a client soon, and you’d like to get feedback up front on the design and approach, so that what you build is more generally applicable across Bahmni implementations?

Do you have a vague timeline in which feedback is especially valuable?

Yes Darius. We are building it for a client and would like to get the feedback on the design and approach. It would be helpful if we can get the feedback in a week.

We have tried this for Kenya by splitting obs table into multiple sub table per encounter type for reporting purposes using the reporting framework with no intend of exporting it to statistical packages for analysis. In Mozambique we have used Pentaho for creating specific tables for reporting purpose. It would be great to have a generic way that can be plugged in by anyone with little effort.

@ningosi Good to hear. Can you please go through the document and see if the flattened database suits your needs.

@pramidat, thanks for creating this discussion. I am very interested to see where this leads, and agree with those who indicate that this is something that would be great to design out for any “generic” OpenMRS implementation, not just one based on Bahmni.

At PIH, we have started taking a similar approach, though in our current iteration we use the Pentaho Kettle libraries as the basis for our data processing pipelines. These were chosen as they enable a graphical interface for defining jobs and are accessible to a broader range of users than our software development team. Although we have not yet fully implemented a set of flattened tables per form, where each Obs represents a column value, we have started down this path with other parts of the data model, namely flattening much of person/patient/person_name/person_address/patient_identifier/person_attribute into a single “patient” table.

We have a lightweight spring boot application that we have authored that runs our jobs (and could ultimately provide some level of a web application on top of it, though this is not done yet). That code can be seen here. This then runs one or more jobs in defined in our collection of Pentaho jobs and transforms, which are available here.

The documentation is probably not super up-to-date or comprehensive, but this is all available for sharing. In theory, one should be able to execute the load-from-openmrs job on a generic OpenMRS installation, after configurating the appropriate settings in their config file.

As @darius has mentioned in some of his commentary, we are looking now at how we might evolve into a solution that does more progressive updates over time, as currently this runs nightly (or weekly) on a given database. Whether or not we stick with Pentaho for our framework long-term is up for debate. Like you, we really just needed something more accessible to people for whom the OpenMRS data model was too complicated to query against, and this was an attempt to accomplish that.

I would definitely like to stay involved in learning about and/or brainstorming and/or contributing solutions that can generally solve these kind of problems within OpenMRS.

Best, Mike

2 Likes

I understand that this is an effort primarily lead by a team working with Bahmni and that the primary concern is therefore to end up with an adequate solution for Bahmni. Of which will be keen to benefit.

However as I attempted to say over the last PAT call, let’s not close the door to having this being used with OpenMRS in general. Let’s hear from the senior devs and OpenMRS architects what they have to say about the database ‘flattening’ that will be undertaken here. It will cost an hour of developers forum to hear their thoughts about it.


Let me give you one real world experience that we don’t want to endure twice as implementers. We have one non-Bahmni implementation that aims at transitioning to Bahmni… when the appropriate resources will be available to do so, if ever. In the meantime they would like to use appointment scheduling, which is available outside of Bahmni. However we are reluctant to enable this feature as it would not be compatible if/when they transition to Bahmni, leading to a data migration headache bigger than what it already is. This is a “low-resource deadlock” that arises, because when it came down to it, the developers behind the appointment scheduling feature of Bahmni did not really consider recycling/expanding the existing appointment scheduling module.

I want at least to raise the alarm when it comes to important and impactful features, people involved on this thread will converge to whatever decision they deem fit.

This is a good point Dimitri.

I propose that we discuss this on Monday’s OpenMRS Design Forum (4pm UTC, 9:30pm IST, see more) so that we can provide at least some opportunity for OpenMRS folks to give input, while respecting the team’s desire to move fast. @pramidat would you be able to join this call? (If not I still think it’s worthwhile to do a group OpenMRS review of the document.)

(@jthomas, @burke FYI. This would preempt the OCL topic that I currently have scheduled to discuss on Monday.)

+1000 to trying to make core features for Bahmni compatible with other distributions.

@jdick and @nkimaina have put a lot of thought into this area (example, so it might be good to see if they can join.

Proposing design time topics via om.rs/designtime also increases the chance others will see it.

Sure @darius I can join the call. Please let me know the details.

There are few approaches we have seen and/or tried to deal with this generic problem of how to report in a performance way from schemaless data without repeatedly writing long SQLs which are hard to maintain. There are three points of tradeoff:

  1. ETL to Schemaful model - which has been talked about in this thread. So not going into it. Here the complexity resides in transform part of ETL. DHIS 2 does something like this.

Performance for user is best in this. Operational complexity is high because any change in form means regeneration of at the tables and data affected by it. And if it is 80-20 scenario, you would start wishing :slight_smile: the form which have 80% data doesn’t change. Performance of creating new database could be slow, but usually works out fine. But, I have also seen systems where such jobs can take hours and days when DB grows large. Usually one can keep on going the rabbit-hole of trying to improve the performance of ETL by making it smarter and smarter. One may imagine that it is fine because users can use the old database till new one is cooking. But what is not understood always is that, the extract process loads the production database affecting main users. Doing ETL from a replicated database will solve this.

  1. We tried this in Bahmni by introducing a level of indirection via small configuration support (in Bahmni it is in JSON). Basically, this configuration drives how the SQL is generated. So hopefully you write one type of SQL once in sort of a template and then generate different SQLs by configuring them. In this case, I would imagine configuration will be about form name. I would like to verify the performance of CASE WHEN kind of SQL with EAV model. I don’t remember clearly from Bahmni about this.

  2. We have been trying this on another project, though the data model is not EAV but JSONB (postgres). Define database views per-form. Views are schemaful here. Since the data for each instance of form is filled in one column, one can use JSON expression to get the data out. I think this idea could be applied to OpenMRS EAV too, but one should check the performance first of CASE WHEN, because the data is in multiple rows. Jury is out on it, as we will know more over time, but I feel confident this idea has enough legs - to be a valid approach at least. I threw this one our here, so that we can think of different approaches.

IN terms of tooling. I would recommend looking at metabase. Much lightweight, modern and improving fast.

Few things to watch out for as they are not immediately obvious.

  • How do you model observations which have multiple coded answers? Essentially you may need to create tables for it in approach 1. Although I found that number of observations which have multiple coded answers are quite few, hence you can hold your nose just create a few more tables anyway. It will not explode the number of tables for you. Or you can use, command separated values etc too - suboptimal for all scenarios.

Lastly, if I have a lot of money and time, I would choose 1. 2 & 3 are bit cheaper but test for performance first.

Call is right now at https://www.uberconference.com/openmrs

@pramidat sorry that I wasn’t able to share call details earlier, I didn’t read your message until we were about to start.

Discussion notes are here: http://notes.openmrs.org/2018-03-19-Design-Forum

Specific things to highlight:

  1. Don’t make it Bahmni-specific unnecessarily (E.g. don’t require CentOS for some reason)
  2. Most things (except obs) could have a common model in the data mart (e.g. patient)
    • don’t need to define all of these up front but as you do come across them while doing the work
    • Look at existing work (PIH to share their flattened model, we will request that AMPATH share any flattened examples)
  3. Prioritize extensibility, e.g. anyone should be allowed to add their own job, or disable the standard jobs. (We think this is already included in your design, but just want to make sure)
  4. Prioritize incremental handling (of at least the obs table). It is harder to justify this as a core part of the Bahmni product if it’s only suitable for smaller implementations who do analytics infrequently.

Sorry @darius, could not attend the call.

  1. Don’t make it Bahmni-specific unnecessarily (E.g. don’t require CentOS for some reason) -> We are dockerizing the application so that it can be run anywhere. Even while flattening we are considering non-Bahmni use cases
  2. Most things (except obs) could have a common model in the data mart (e.g. patient)
    • don’t need to define all of these up front but as you do come across them while doing the work -> Yes, we are joining few tables to have a common model. Apart from that we are also trying to come up with some generic views
    • Look at existing work (PIH to share their flattened model, we will request that AMPATH share any flattened examples) -> It would definiltely help. Can I get some point of contact.
  3. Prioritize extensibility, e.g. anyone should be allowed to add their own job, or disable the standard jobs. (We think this is already included in your design, but just want to make sure) -> Yes. This is included in our design
  4. Prioritize incremental handling (of at least the obs table). It is harder to justify this as a core part of the Bahmni product if it’s only suitable for smaller implementations who do analytics infrequently. -> we are looking into several possibilities for incremental updates. Will update once we find something
1 Like

@mseaton, @mogoodrich, @toddandersonpih, @jdick, @nkimaina one of the point that came from this call is that even if we don’t all standardize on a toolset for ETL to an analytics db, maybe we can standardize on what a flattened data model looks like for common OpenMRS things.

Can you share the table structure (or nosql equivalent) of some of the tables in your analytics DBs?

@darius, all of our flattened tables are available at https://github.com/ampath/etl in either the flat_tables or calculated_tables directories.

For AMPATH, in our initial approach, we first run a process to create a flat_obs table of key=value pairs where key is the question concept_id and value is a stringified version of the value_x column in the obs table. We then use this table as the basis to create other transformations (all in sql). The problem with this approach is that any information within a nested structured allowed for by using obs_groups is lost (not ideal). But, this makes it much faster to do additional transformations AND we don’t hit the obs table again when transforming (we are using the same database for both openmrs and our etl tables). Some examples of our “calculated” (i.e. transformed) tables can be seen in the calculated directory.

A possible place to start for the community as whole might be a representation of an encounter object which could be used for further transformations. We are currently experimenting with an approach of mysql --> debezium (kafka adapter for msyql) --> kafka --> spark

We are just in the beginning of coming up with representations for the encounters (@fali, perhaps you could share basic structure you’ve been working on).

@darius, I remember you had raised this very question yourself a while back (though I don’t remember the thread) and suggested possibly using the rest api representation. We trialed this but initial tests showed that hitting the rest api made this process a bit slow (probably surmountable) but slow so we’ve been trying to come up with a function that directly handles the obs data coming in from the bin log.

We have spiked on an approach using Pentaho. Our model is not particularly sophisticated, but the initial aim has been to simplify the OpenMRS data model such that someone coming at the data would be able to make sense of things relatively easily, and get up to speed quickly in order to do their own analyses. Unlike the approach that @jdick describes for Ampath, it was not designed with performance as it’s initial goal, but rather accessibility and approachability of the data (eg. demystifying OpenMRS).

Our first stage transforms into a generic flattened structure:

The main things this aims to accomplish are:

  • Flattening person, patient, person_name, person_address, person_attribute, preferred patient_identifier into a single “patient” table.
  • Flattening metadata into varchar fields (eg. encounter.type = “Adult Initial”, rather than a foreign key to an encounter_type.name table)
  • Moving Obs Groups into a separate table from Obs

From here, we (optionally) add further pipeline stages that transform data into implementation-specific tables. Some examples of these for our Malawi implementation can be seen here:

Mike

Hi all,

Here is the json representation of an encounter that tries to group all obs under an obsgroup:

{
  "encounter_id": "63",
  "obsGroups": [
    {
      "obsGroupId": "2354728",
      "obs": [
        {
          "obs_id": "236",
          "person_id": "81",
          "concept_id": "3774",
          "valueCoded": "1107"
        },
        {
          "obs_id": "232539",
          "person_id": "81",
          "concept_id": "62438",
          "valueCoded": "11475"
        }
      ]
    },
    {
      "obsGroupId": "2354729",
      "obs": [
        {
          "obs_id": "236790583",
          "person_id": "81",
          "concept_id": "12268",
          "valueCoded": "1107"
        }
      ]
    },
    {
      "obsGroupId": null,
      "obs": [
        {
          "obs_id": "2360579",
          "person_id": "81",
          "concept_id": "6578",
          "valueCoded": "1175"
        },
        {
          "obs_id": "2360584",
          "person_id": "81",
          "concept_id": "9610",
          "valueCoded": "1175"
        },
        {
          "obs_id": "2360586",
          "person_id": "81",
          "concept_id": "6174",
          "valueCoded": "1107"
        }
      ]
    }
  ]
}

These are the schemas we have in mart database. This sheets has all the details. Please go through it and free feel to comment on it.

@mahitha thanks for sharing this. There are a couple of comments that Angshu and I made by voice last week, that I’ll mention here too, since I don’t see them incorporated…

The end user should have to know what is in “person_details” versus “person_information” versus “person_address” versus “patient_identifier”. I would expect a single flattened table combining all single-value-per-patient things. (Includes things that you currently have split across person_details, person_address, patient_identifier, patient_allergy_status, and person_information)

Similarly for other parts of the domain model, I would expect that anything that can be flattened is flattened. E.g. encounter_provider_and_role can probably be combined into patient_encounter_details (unless we’re supporting >1 provider for a given role in the encounter).

And still looking there, I see you have a provider table (and encounter_provider_and_role refers to it). I would expect you to flatten the provider id and name (e.g. provider.person.*_name) into the encounter-related table. (For things like this it probably makes sense to have both the provider_id and also the flattened version.)

For “attributes” (person, visit, location, provider), you should treat them like obs on forms. I.e. in most cases they would be single-valued so you can just flatten them into a single column in their parent table; in the multi-valued case you could have a subsidiary table just like is done for obs/forms.