The promise of OpenMRS ETL: Advanced Reporting/Analytics for OpenMRS

Tags: #<Tag:0x00007fa3f5bd88f0> #<Tag:0x00007fa3f5bd86c0>

Hi @raff, @wyclif to provide further details but the short answer is no. We are just testing tech stacks that enforce best and reusable practises. And it looked like Debezium was doing off the shelf a lot of what we needed and could cover the whole ‘E’ process for multiple use cases: ETL for data analytics, pure low level sync, data out streams for EIP (with Camel for instance)… etc.

It looks like that @wyclif’s approach with DB triggers is working perfectly fine. However, those triggers require maintenance and a setup that is more invasive than Debezium. They also imply to keep sync records journal tables within the OpenMRS database.

Again @wyclif to confirm but it appears that Debezium is reducing the overhead in terms of maintenance.

2 Likes

I see. So you are still exploring your options and you haven’t rolled out a solution yet.

In my opinion the Debezium is the way to go in such a case. It comes at the price of having beefier machines to run the whole stack, though it could be mitigated by using the embedded version as @wyclif mentioned. Have a look at https://camel.apache.org/components/latest/debezium-mysql-component.html which nicely couples the Debezium embedded engine with Apache Camel. It may be a great fit, since you are already using Camel.

3 Likes

Thanks @raff for sharing this. This will solve most of the concerns we have been having. Apart from eliminating Kafka and zookeeper, I’m glad this component provides the ability to recover from faults. This will definitely work for small deployments as well as large-scale

We have options with all that we have tried so far (basic management DB based on the last sync date, then more recently the DB triggers evolution), so at worst we have plans B. But it looks like Debezium starts looking like plan A :slight_smile:

Thanks for your input again @raff.

@akimaina, well large-scale is pretty vague, but it definitely gives you a considerably easy upgrade path to Kafka and clustering to handle a truly large-scale if you start hitting performance limits.

I am afraid I do not fully understand the point here. I understand that in a truly distributed environment exactly-once semantic is a difficult problem but I thought earlier we established that is not the case here. Sure, OpenMRS is a monolith but so can be the change stream processor that updates the warehouse, no? For example, in the atomfeed example, the state of both the feed producer and consumer are tracked in the DB and it is not clear to me how they can miss an update or process it twice (the order issue can be another story though).

BTW, I mentioned atomfeed mostly as an example of another potential approach; relying on Debezium sounds like a great option too, as you mentioned. What I am not sure about is whether complicating the infrastructure with something like Kafka, which is mainly designed for a truly distributed system with many producers/consumers, is warranted for this problem.

One of the advantages that i can see of Debezium over our approaches that use the hibernate interceptor, is ability to also capture database changes that happen outside of the openmrs application. For instance when one directly makes changes to data in the database or via some other application.

@raff I think DB triggers work fine if you want to monitor a few tables and don’t plan to be adding new tables or change the trigger logic continuously, to be honest triggers do the job pretty well. The primary reason why we are considering debezium is that it solves 3 things that had to be manually implemented in the code base i.e. monitoring db events, publishing events and Apache camel already has built-in components to bring the glue together. The main issue we’re foreseeing with debezium is how to avoid going in circles i.e. replay a change back to the origin, you don’t want to update a patient on server A, the event is consumed by server B, during sync the change is again written to the binlog of server B and then gets sent back to server A, this is not an issue with a one-way sync but would need to be addressed for a 2-way sync.

@bashir as @dkayiwa pointed out sync 2.0 just like any other tool based on the hibernate interceptors has some holes, the most obvious being you miss any change that takes place outside of a hibernate transaction. The other big flaw with sync 2.0 is that it uses the rest API to extract and load sync entities, the rest API is heavily based on the java api with all the entity relationships, it also excludes a lot of required fields that a rest client doesn’t need to know but on the other hand need to be synced. I think REST is great for integration but not db sync.

Regards,

Wyclif

That is a good point but is this kind of DB changes (from outside OpenMRS) common (beyond restoring snapshots)? If that is the case, then this might be a deal breaker for some of the design options I mentioned above.

Thanks for pointing this out. I have not looked at the rest API much. From atomfeed/Sync 2.0, I have been mostly looking at using the FHIR API, which is provided beside the rest API, AFAICT.

Sorry for not making myself clear. Exactly-once processing/delivery applies to any CDC component, including nondistributed systems. In this context, I’m referring to the propagation of changes between openmrs CDC component and the warehouse especially during T of EL. If we are going to create a custom CDC, we will need to make this propagation process tolerant to disconnections/faults, such as server restarts, e.g., keeping track of offsets.

On a lighter note, If we need to avoid Kafka entirely, we can go for debezium-camel suggested by @raff and @mksd which guarantees exactly-once semantic at normal operation. However, there are some possibilities of duplicate delivery after an ungraceful shutdown.

I just want to throw my endorsement and full support behind this thread. It is very exciting for me to see this conversation happening like this, and especially to see so many people articulating the problems and pros/cons of various solutions that I have also felt and expressed but have not articulated as clearly as this thread is doing. I’m really encouraged by where this is heading!

4 Likes

We just had a great OpenMRS ETL/ELT Stand-Up! Over 20 participants from the OpenMRS community joined together to share their work in the space of ETL/ELT solutions.

  • VecnaCares presented their data warehouse work
  • AMPATH presented their ETL architecture as well as ideas for the next generation architecture
  • @bashir (from Google) presented a portion of his ideas & questions for the community

We definitely need to continue this conversation. There were several more people who had ideas & experience to share and we had little to no time for discussion. So, stay tuned for round 2!

For those who couldn’t make today’s call, you can review it here:

2020-05-20 OpenMRS ETL/ELT Stand-Up

6 Likes

This whole conversation is one of the highlights of my month!

And yes, we still have ideas and experiences to share - and much to discuss.

We can consider holding Round 2 at the same time on Wednesday (4pm Nairobi | 1pm UTC | 9am EDT | 6am PDT), so that option is included in the poll. Please respond by 3pm UTC on Friday so I can announce the next round by the weekend.

@bashir the issues with using REST for loading/extracting entities in the context of doing dbsync are the same even if you used FHIR, there is several OpenMRS internal logic that is hidden from REST/FHIR APIs which impact values stored in the database, you would miss all these values during a sync.

@raff sorry I didn’t respond to one of your questions, no we didn’t hit any dead end with DB triggers and we haven’t run any stats against them, we just didn’t want to sort of re-invent the wheel if the binlog is already doing something similar. The sync record lives in a table within each OpenMRS DB with column values being the affected entity’s identifier which is usually a uuid, operation (insert/update/delete), table name and a time stamp of when it happened, the after insert/update/delete triggers populate this table when these events occur.

@wyclif, bidirectional sync presents the same challenge of recognizing the origin and not reapplying the same change regardless if it is API interceptors, DB triggers or Debezium.

A way to approach it is to include unique id of the origin server for each changed row within the message so when you get a DB event you can check the origin server and decide whether to apply that change or not as it originates from this particular instance. The origin id should be saved by the receiving end upon getting an update message/applying the update and should be discarded as soon as an update message for that particular row is produced and the preserved origin is included in the message. Of course you could also choose not to propagate the update message at all by the receiving end. It depends on how you want to structure your sync group.

Anyway, it probably deserves a separate thread.

@burke, thanks for posting the recording. Thanks to @AMPATH in particular for sharing your approach. Very interesting stats!

I actually implemented one with DB triggers where the triggers logic is skipped if in sync mode, we just set a session variable on the connection which the triggers detect when syncing. Therefore, when updating a record in response to an update from another DB, no new sync record is generated on the receiver DB so you avoid replaying back to the origin.

Correct; and my impression is that these kinds of requirements are considered in Atomfeed module server/client implementation; after all that module is the basis of Sync 2.0 module which is basically a version of CDC (for a filtered set of changes). For example, keeping track of the offset is taken care of in Atomfeed client, AFAIU.

Can you elaborate a little more on this point? I am sure you have thought more on these issues than myself and I am afraid I am missing something here.

Let me clarify my understanding: I understand that an Atomfeed based approach, is not a complete CDC/replication and that is actually not the goal. The goal that I have is to pretend that OpenMRS is a FHIR store, then choose a set of resources, say Patient, Encounter, Observation, and replicate all changes relevant to those resources into a target FHIR store. So if I listen to the Atom feed, I can capture all FHIR resources that have been updated and then can replicate those FHIR resources in a target store. Also, I understand that if my feed client (subscriber) falls behind the feed producer (the source OpenMRS) I may miss some changes, e.g., when a resource is updated twice quickly. However, with the “newest is the best” approach, that is probably not a big deal (or at least I understand that limitation). Anything else?