The promise of OpenMRS ETL: Advanced Reporting/Analytics for OpenMRS

raff · May 21, 2020, 10:03am

@wyclif, bidirectional sync presents the same challenge of recognizing the origin and not reapplying the same change regardless if it is API interceptors, DB triggers or Debezium.

A way to approach it is to include unique id of the origin server for each changed row within the message so when you get a DB event you can check the origin server and decide whether to apply that change or not as it originates from this particular instance. The origin id should be saved by the receiving end upon getting an update message/applying the update and should be discarded as soon as an update message for that particular row is produced and the preserved origin is included in the message. Of course you could also choose not to propagate the update message at all by the receiving end. It depends on how you want to structure your sync group.

Anyway, it probably deserves a separate thread.

raff · May 21, 2020, 10:08am

@burke, thanks for posting the recording. Thanks to @AMPATH in particular for sharing your approach. Very interesting stats!

wyclif · May 21, 2020, 3:23pm

I actually implemented one with DB triggers where the triggers logic is skipped if in sync mode, we just set a session variable on the connection which the triggers detect when syncing. Therefore, when updating a record in response to an update from another DB, no new sync record is generated on the receiver DB so you avoid replaying back to the origin.

bashir · May 22, 2020, 9:49am

Correct; and my impression is that these kinds of requirements are considered in Atomfeed module server/client implementation; after all that module is the basis of Sync 2.0 module which is basically a version of CDC (for a filtered set of changes). For example, keeping track of the offset is taken care of in Atomfeed client, AFAIU.

bashir · May 22, 2020, 10:01am

Can you elaborate a little more on this point? I am sure you have thought more on these issues than myself and I am afraid I am missing something here.

Let me clarify my understanding: I understand that an Atomfeed based approach, is not a complete CDC/replication and that is actually not the goal. The goal that I have is to pretend that OpenMRS is a FHIR store, then choose a set of resources, say Patient, Encounter, Observation, and replicate all changes relevant to those resources into a target FHIR store. So if I listen to the Atom feed, I can capture all FHIR resources that have been updated and then can replicate those FHIR resources in a target store. Also, I understand that if my feed client (subscriber) falls behind the feed producer (the source OpenMRS) I may miss some changes, e.g., when a resource is updated twice quickly. However, with the “newest is the best” approach, that is probably not a big deal (or at least I understand that limitation). Anything else?

wyclif · May 27, 2020, 3:27pm

@bashir it really comes down to the fact that the REST and FHIR APIs intentionally don’t expose some fields when retrieving and don’t accept some fields when creating/updating a resource, it means you can never sync these fields yet depending on some implementation’s use cases they might be required to be synced, this is just of them, I recall running into other gotchas when I tested sync 2.0, I

So it would kind of depend on what you’re trying to achieve, different implementations have somewhat differing goals, if you just want to have a patient record to be replicated in another OpenMRS instance, possibly you can use REST/FHIR but if you need more like capture all details about a record including all fields as represented in the source DB e.g. for reporting purposes then you’re limited.

lluismf · April 27, 2021, 3:32pm

I agree, not all the DB changes are performed by Hibernate. In fact, Hibernate is not recommended at all for batch processing in terms of memory. Pure JDBC is the way to go if performance is needed.

dkayiwa · April 27, 2021, 3:45pm

@lluismf great to see you again man!!! Happy new 3 years.

gcliff · April 27, 2021, 3:47pm

sure thanks @lluismf ,we currently do have the JDBC processing mode in the analytics batch-processing pipeline

lluismf · April 27, 2021, 4:40pm

I’m probably dev/1 now If there’s some development going on using Spark or Kafka I would be interested. btw did you win an Oscar yesterday?

dkayiwa · April 27, 2021, 5:03pm

Hahahaha! A /dev/5 is always one.

Oh yes we have work going on with the Analytics Squad that uses these technologies: Analytics Engine (including ETL and reporting improvement) - Projects - OpenMRS Wiki

Not at all.

lluismf · April 28, 2021, 9:57am

I’ll definitely take a look, thanks!

fruether · April 29, 2021, 8:19pm

Just wondering. Is there a current source code available to have a look to. I could not find a GitHub link to phase 1: Phase 1: A Working Proof of Concept ETL for a Large Site

ibacher · April 30, 2021, 12:43pm

@fruether You can find the current WIP here: GitHub - GoogleCloudPlatform/openmrs-fhir-analytics: A collection of tools for extracting OpenMRS data as FHIR resources and analytics services on top of that data. [NOTE: WIP/not production ready].. We also have a PoC project that uses that for the “E” part of ETL here: GitHub - openmrs/openmrs-plir-dockerized-setup: Dockerised PLIR setup.

fruether · June 6, 2021, 1:53pm

Are there any tasks or issues in this context a community member could support with hence they are not time critical or too complex/fundamental?

ibacher · June 7, 2021, 12:02pm

You could have a look through the issue list. I’m not overly familiar, but issues tagged P2 or P3 are probably at an appropriate level of priority.

fruether · June 7, 2021, 9:20pm

Thanks @ibacher for the support. I am going to have a look to them during the weekend and elaborate if I can tackle one!

bashir · June 8, 2021, 6:42pm

Hi @fruether; from the list that Ian pointed out, here are a few issues that meet the criteria you mentioned (i.e., not time critical and hopefully not too complex). The order is roughly from easy to hard:

The following two are not simple but not time critical (at least for the next one month or so):

fruether · June 8, 2021, 8:34pm

Thank you @bashir for the elaboration. I will have a look to the linked issues during the next days and come back to you! Appreciate the nice list!

fruether · August 1, 2021, 7:16pm

Just a short question @bashir

When running as described in the readme “mvn clean install” on the master branch I am getting the following errors:

[ ERROR ] FetchResourcesTest.testPatientIdsFromBundle:95 » NoClassDefFound Could not ini…
[ ERROR ] JdbcFetchUtilTest.testCreateSearchSegmentDescriptor:141 » NoClassDefFound Coul…
[ERROR ] JdbcFetchUtilTest.testFetchAllUuidUtilonEmptyTable:173 » ExceptionInInitializer

I am using jadoptopenjdk-13.0.1 which seems to not be supported ( unsupported Java version: 13). Java 1.8 is supported (as that is my Editor using). What version is planed to be supported? Should I add the java version for mvn into the pom file e. g. maven.compiler.target