To schema or not to schema... deciding on a foundation for our analytics engine

bashir · September 9, 2020, 4:43pm

Hi all, I am back from vacation and great to see this conversation going forward. I will try to summarize/respond a few points in one [long] post. First to simplify, I use these three terms: “updater” reads from OpenMRS and pushes new data to “data-warehouse” accordingly; the “reporter” is anything that works off of the data in the data-warehouse.

Re. schema and flexibility of no schema: Sure, having no schema perhaps may simplify updater logic and data-warehouse changes for custom data/modules but it has the down-side of making reporter more complex for reasons I mentioned before. So if we have an imaginary module X with custom data, just pushing X’s data into data-warehouse is usually not enough and we need to make assumptions about its structure to use it in reporter. I prefer those assumptions to be clearly defined as an schema in the data-warehouse rather than hard-to-find pieces of code logic.
Re. a simple flattened schema that @dkayiwa suggested: Yes AFAIK there have been multiple attempts for creating such simplified schema for analytics. @aojwang’s work you mentioned is one example. There are at least two other major efforts by AMPATH and PIH that I am aware of (and I am sure there are more examples). The key point is adoption by everyone. I feel spending more time on agreeing on a shared schema is not wise (it would not be easy to find a schema that meets everyone’e needs). So instead of that, I am proposing to pick a standard that is already there (SQL-on-FHIR) and build a prototype for PEPFAR indicator calculation on top of that. The benefit of relying on standards is that we ignore the time-consuming task of agreeing on a schema and can move forward faster. Plus, it potentially helps us using tools on top of the standards, for example there are tools for FHIR to OMOP which means with not-too-much effort we can do OpenMRS->OMOP. We can revisit this decision when we have some prototype.
For custom data that is either not yet converted to FHIR or will never be, there is always the option of adding new tables in the data-warehouse. I don’t feel that it is a huge challenge because there is obviously a schema in the source (i.e., OpenMRS). That said, for anything FHIR convertible, the long-term plan should be to implement the FHIR translation for that custom/module data too. This will make updater logic simpler too.