To schema or not to schema... deciding on a foundation for our analytics engine

burke · August 21, 2020, 1:06pm

@bashir shared a vision of Analytics on FHIR in a previous post. Since then, we’ve had some ongoing discussions in Analytics Engine Squad meetings regarding the schema to use.

Do we need a schema for our data warehouse?

We have had several efforts to use a traditional data warehouse approach (e.g., star schema) over the past decade or more, using Birt, Pentaho, and other/custom approaches. This traditional data warehouse approach works well when the data model and the questions to answer are known. We have neither of these. Our data model for analytics within OpenMRS isn’t the relational database model, but the data model of clinical data represented by our concept dictionary. Both this model and the questions to be answered are in constant motion. For this reason, tools that don’t depend on a fixed schema – e.g., Apache Lucene-based tools for indexing documents like Elasticsearch – are appealing.

@bashir, does your approach depend on coercing content into a FHIR SQL schema? Or is it possible to use documents (e.g., JSON) at the heart of the engine, focusing on coercing data into FHIR format, but not requiring a specific relational database schema to describe our universe? I assume a FHIR schema would not be adequate for data used by OpenMRS that is out of FHIR’s scope and would force us to come up with an alternative approach for non-FHIR and implementation-specific data.

ssmusoke · August 21, 2020, 1:36pm

How about a middle ground that provides a path to both approaches

A defined schema for OpenMRS, with the process to generate that schema, which allows users to generate their own custom schemas or FHIR

It is important to be able to generate this at each installation site, and to aggregate the data across multiple sites

This can also provide a simplified interface for both reporting and clinical decision support (patient flags + data validation rules)

burke · August 21, 2020, 2:29pm

What I’m questioning is whether “a defined schema for OpenMRS” exists. Surely, you can point to our SQL schema, but that has two problems: (1) it doesn’t cover all modules including implementation-specific modules and (2) it’s a bespoke representation of the world instead of leveraging standards like FHIR.

I’m hoping that having a first stop in the analytics engine that doesn’t require a specific schema can give us that path – i.e., don’t enforce a schema to receive data into the engine and then implement a transform to a specific initial use case in a way that can be mimicked for other use cases.

ssmusoke · August 21, 2020, 2:43pm

@burke Actually this schema should not be focused on modules and SQL that is too low level, rather the existing business language within the implementation

Examples

HIV - he domain is 90-90-90 cascade, simplest is tested -> enrolled in care -> virally suppressed. Confirmed Positive, Enrolled into care, baseline CD4/Viral Load, Baseline Regimen, regimen changes etc
MCH - antenatal visits -> maternity -> baby (vaccinations, growth measures)
TB - first infection or relapse, treatment through intensive -> regular, MDRTB, drugs, lab tests etc

The schema can be designed to map to FHIR or any other standard

bashir · August 26, 2020, 5:09am

Thanks @burke for starting this thread. A few points:

Before getting into the schema discussion, I think it is necessary to agree whether OpenMRS data is structured data or unstructured. I always assumed it is structured. But my understanding of OpenMRS is limited and I am not sure if I understand the following:

especially when you talk about Lucene and Elasticsearch in the same context. Do we have a lot of unstructured text data in OpenMRS that we need to deal with? My understanding is that OpenMRS data fits a relational schema anyways. Sure, the concept dictionary and modules data are flexible but still this does not make the data unstructured.

If we agree that we are dealing with structured data, then having some schema is natural. I totally agree that most previous attempts for analytics schema have been custom and not adopted by multiple implementers. But I feel that can be addressed by using a standard schema, e.g., SQL-on-FHIR, OMOP, etc. If we impose no schema and even go from the source structured data into an unstructured index based approach, I think we will be doing a disservice to downstream query engines and report generations. Even if use just JSON, still we should make some assumptions on the structure of the records (and I would argue even with raw text that is the case), otherwise we should stop there and cannot build anything else downstream (e.g., to calculate PEPFAR metrics or other indicators) which can be adopted by all implementations.
Using some schema like SQL-on-FHIR or OMOP, does not mean that everything in OpenMRS should be coerced into those representations. Any analytics engine that we pick should support joining custom data as well (e.g., joining two tables). But I definitely want SQL to be one of the APIs to query data, even for the extra custom data (this obviously does not mean that the underlying data warehouse is a relational DB).
I am curious to know how much of the OpenMRS data that is needed fro analytics, can never be represented as FHIR resources; I have been told that it exists but I love to know concrete examples and how common they are. Custom module data can still extend OpenMRS FHIR default implementation (fhir2 module) by extending the FHIR ResourceProviders and we don’t need to change the analytics engine schema for those.

P.S. sorry for the delay as I am currently on vacation.

bashir · August 26, 2020, 5:16am

Our current proposal is to have tools to extract OpenMRS data as FHIR resources and make the analytics schema mostly dependent on the FHIR representation (whether it is SQL-on-FHIR or something else). Maybe at some point we should think of generic tools to bring non-FHIR data as well but first I like to see examples where extending the FHIR implementation of OpenMRS is not possible to address custom needs. As I mentioned in my first reply above, the data warehouse engine will definitely support bringing custom data into the data warehouse.

Yes, that is a great point. For the MVP, we are targeting the single site case but hopefully we will support aggregation across multiple sites in the future too.

dkayiwa · August 27, 2020, 9:44pm

We use Lucene for searching concepts and patients. The Lucene Index is simply a read-only copy of the data in relational tables. We update this search index whenever the data changes in the relational database tables.

https://issues.openmrs.org/browse/TRUNK-425

https://issues.openmrs.org/browse/TRUNK-2999

burke · August 27, 2020, 11:51pm

While much of the data in OpenMRS is structured, there can be unstructured data as well (e.g., narrative from notes or other text collected in a form). But my desire for a non-relational database as the first stop for data in the warehouse is not to accommodate unstructured data; rather, to handle data with arbitrary structures (i.e., json where the structure depends on the data type being sent and new datatypes can be added without restructuring the warehouse).

Agreed. I was expecting something along the lines of:

{
  "timestamp": "...",              // ISO-8601 UTC timestamp
  "patient": { ... },              // reference to patient
  "documentType": "...",           // specifies schema for content
  "content": { ... },              // value, note text, FHIR resource, etc.
  "additionalIndexContent": "...", // any additional text to aid indexing
  "ref": "..."                     // URI to value within OpenMRS
}

I’m assuming that if the first target for incoming data is a relational database it would mean accepting new/arbitrary data would require altering the relational database (i.e., adding tables). On the other hand, if the first target is non-relational and can accept content with any schema as long as it has some basic structure (bare minimum would be a discriminator, reference to source, and content). Arbitrary new data could be accepted and reach the first stage without any changes. To make sure of the new data, you’d plug in new transform scripts associated with the new content type. Any other forms of data we want to support (SQL-on-FHIR, OMOP, etc.) would be populated by transforming data from the first stage.

The issue isn’t with data that can never be represented as FHIR resources; rather, it’s data that isn’t currently represented as FHIR resources. Part of this reflects unfinished work on covering all aspects of the OpenMRS model and the other (likely larger and longer tail) of data from custom modules created by orgs & implementations that may not have the expertise or resources to coerce their data into FHIR resources.

LOL. No worries. I hope you’re enjoying your vacation and not reading this reply

dkayiwa · August 30, 2020, 10:25pm

I get tempted to feel that we are over designing this and making it more complicated than our current needs. Having a shared solution which simply has flattened tables for our complex EAV model, and ways of incrementally updating data in these flattened tables, would be enough for the first version of our analytics engine. In this first version, i would not even be bothered with potential data from implementation specific modules. Not even anything to do with FHIR. All these are for now, nice to haves, for successive iterations of the analytics product. Having such a quick, simple, and immediately usable solution rolled out to implementations, will help us get real practical feedback to guide our subsequent iterations, instead of taking long to start, because of spending time on imaginary, theoretical, or possibly future requirements.

@aojwang you did something along those lines for KenyaEMR. Only that it was not a shared solution and it lacked features like how to incrementally get changes. But it worked, to the extent that it served better for ad-hock reporting than the default openmrs data model.

With the above background, i answer by saying that yes, we need a schema (simply flattened versions of tables like obs, and a few others) for the first release!

ssmusoke · August 31, 2020, 5:42pm

Right on the money!!

The ability to leverage reporting and querying on these flattened tables would be great addition to reduce the pain of reporting

bashir · September 9, 2020, 4:43pm

Hi all, I am back from vacation and great to see this conversation going forward. I will try to summarize/respond a few points in one [long] post. First to simplify, I use these three terms: “updater” reads from OpenMRS and pushes new data to “data-warehouse” accordingly; the “reporter” is anything that works off of the data in the data-warehouse.

Re. schema and flexibility of no schema: Sure, having no schema perhaps may simplify updater logic and data-warehouse changes for custom data/modules but it has the down-side of making reporter more complex for reasons I mentioned before. So if we have an imaginary module X with custom data, just pushing X’s data into data-warehouse is usually not enough and we need to make assumptions about its structure to use it in reporter. I prefer those assumptions to be clearly defined as an schema in the data-warehouse rather than hard-to-find pieces of code logic.
Re. a simple flattened schema that @dkayiwa suggested: Yes AFAIK there have been multiple attempts for creating such simplified schema for analytics. @aojwang’s work you mentioned is one example. There are at least two other major efforts by AMPATH and PIH that I am aware of (and I am sure there are more examples). The key point is adoption by everyone. I feel spending more time on agreeing on a shared schema is not wise (it would not be easy to find a schema that meets everyone’e needs). So instead of that, I am proposing to pick a standard that is already there (SQL-on-FHIR) and build a prototype for PEPFAR indicator calculation on top of that. The benefit of relying on standards is that we ignore the time-consuming task of agreeing on a schema and can move forward faster. Plus, it potentially helps us using tools on top of the standards, for example there are tools for FHIR to OMOP which means with not-too-much effort we can do OpenMRS->OMOP. We can revisit this decision when we have some prototype.
For custom data that is either not yet converted to FHIR or will never be, there is always the option of adding new tables in the data-warehouse. I don’t feel that it is a huge challenge because there is obviously a schema in the source (i.e., OpenMRS). That said, for anything FHIR convertible, the long-term plan should be to implement the FHIR translation for that custom/module data too. This will make updater logic simpler too.

bashir · September 9, 2020, 4:44pm

Thanks for the pointers @dkayiwa; they were helpful.

ssmusoke · October 6, 2020, 7:34pm

@bashir I am just following through on whether there is any movement on this

ssmusoke · October 6, 2020, 7:38pm

@dkayiwa To pick your brain - how can one compute custom data to add to the Lucene index and update it when the data model changes, can you point me to some code or documentation

Also how does querying the Lucene index work (Code or docs would be great)

Additional thoughts and questions - potential ideas too

Does one need to write custom extractors, if so what is the approach?
Can the extractors be queued so that there is not a spike in the application load?
Can the lucene index be dumped into the database? Then one can use SQL to populate and update the data models? MySQL 5.7 has JSON column support

bashir · October 7, 2020, 6:33am

Yes @ssmusoke, for the time being we are creating a FHIR based schema for the data warehouse. This does not mean that in future we won’t try other options like OMOP by doing a FHIR->OMOP conversion in the pipeline code.

BTW, the github repo for this work is here and there is a weekly Analytics Squad meeting on the OpenMRS calendar on Thursdays; please feel free to join.

ssmusoke · October 7, 2020, 10:24am

@bashir is there anything for local usage without pushing the data to Google Cloud

bashir · October 7, 2020, 1:53pm

Yes, local (or on-premise) support is our current focus. The idea is that OpenMRS FHIR resources will be pushed to local Parquet files and they can be used through query processing frameworks like Spark. That is the plan, how it performs in practice is yet to be seen.

ibacher · October 7, 2020, 2:14pm

Not to respond for @dkayiwa, but we use Hibernate Search to manage the coordination between Lucene and the database. So custom entities could easily be added to the Lucene index with just a few annotations. See the details here. Querying is likewise fairly simple and integrated.

Probably the most robust usage of Lucene queries in OpenMRS Core is the queries related to concepts, which have been around for years. For instance, the query to find concept names.

No, or at least, it’s a really bad idea to try to do this. Lucene indexes aren’t intended to be human-readable. Instead it uses a binary format and storing large binary files that are accessed at random in a database is usually a bad idea.

It’s also important to realise that Lucene is really only an index; it doesn’t store documents per se, just the fields that you instruct it to index and a document reference you give it. That document reference could be to a database entry (which is effectively how Hibernate Search works) or something else that makes sense to the system querying the index.

A somewhat more robust way to take advantage of Lucene indexing can be done through ElasticSearch (which wraps Lucene indexes, but supports indexing any kind of data in JSON format).

ssmusoke · October 7, 2020, 7:25pm

@ibacher Thanks for the pointers, and it really helps understand how Lucene works, now I seem to get it

Looking at DHIS2, there are analytics tables that are created for overnight runs to aid reporting, can we do the same for OpenMRS to:

Generate a baseline of summarized/transformed data in clearly distinct tables (plus a manual option to do this)
Update the baseline when changes are made to the patient record
Provide Reporting, Patient Flags, Data Validation Flags, Dashboard Widgets etc to use this as an option to speed up their own runs (configurable at runtime)

Just ideas to enable unfurling the EAV model that is great for data collection but painful for reporting - and ability to control extraction of “important” data on the patient

What are the quick wins and low hanging fruit, I am still scratching my head