Thanks, @mksd for pointing this out. Both camel and spark integrate really well with debezium. @wyclif please consider batch-mode scenario when you are looking into this, I suspect camel does support batch consumption. The main data elements to consider emanates from encounter, obs, person, and orders. Lastly, consider a friendly to analytics output dataframe from the openmrs:extract process. I’m interested to know your findings on this, Please keep me posted.
Thank you, @paulamendola , for outlining this. I think we can borrow some ideas from this pipeline. I’m interested to know how the pipeline handles incremental updates. Also hope to see you demo during the ETL/ELT stand up: https://doodle.com/poll/9kefyg4ryenex5m7
This would be very cool to see. Although Sync and ETL/Reporting are definitely different use cases, I’ve always hoped that we could share an underlying core foundation of CDC (Change Data Capture) that both leverage, and which other downstream tools could also build upon, rather than inventing different solutions for both. In each case, the problem is efficiently extracting data out of OpenMRS (in batch and/or incrementally) and making that extracted data reliably available for downstream processing. It’s great to have those working on DbSync and those working on ETL ensuring that their efforts are complementary.
And throwing this need out there, is there a way to have the equivalent of Lucene indexing the data as it is entered/updated within OpenMRS to simplify reporting?
Probably providing a way to build custom data sets etc, especially for smaller sites
Thanks @bashir and @burke for refining it further. I think we can leverage a lot on streaming since it requires less processing and can run alongside openmrs deployment. I’ll be testing the streaming-mode in resource-constrained environments, I’m happy to share the results.
I just wanted to share what I’ve been working on and the changes we intend to make to an existing project that involves extracting data out of OpenMRS for integration with various other systems.
I believe some of you have heard about the DB sync project which was written by a developer that moved on to other things, I have spent the past couple of weeks trying to make improvements to the project.
Several DB systems provide replication tools that can do the sync job but that’s all to them and in some cases can be pricey, so this project was built to provide OpenMRS DB to DB sync of selected tables while providing an integration layer in a way that is as agnostic as possible to the underlying DBMS.
At the heart of the project is Apache Camel which allows it to route messages between the various applications in the ecosystem, initially the project was running a scheduler that periodically polled the configured OpenMRS tables to find rows to sync, it would keep track of the last sync date to determine what had to be synced for subsequent runs by comparing the last sync date to date created and changed of each row, with this approach it was easy to miss some changes between runs. Currently, it’s updated to use after insert, update and delete DB triggers that write to a sync record journal in the remote OpenMRS instance that the tool reads to determine rows that need to be synced to the HQ instance.
The trigger approach works fine but we’re considering switching to something like Debezium to replace the triggers, sync record journal and the existing mechanism that broadcasts DB events to all listening routes. Of course we are concerned about what it means to run a full blown Debezium instance with ZooKeeper and Kafka which can demand more resources for a site with already constrained resources, it’s possible we might run an embedded instance of Debezium by default with an option to use an external one. We then listen for DB change events emitted by Debezium, these are typically very raw messages similar to a SQL resultset, we want to be able to enrich the message with a more loaded version of the affected entity along with it’s flattened associations at the application level before publishing it to a configured message broker of choice, the enriched message can be consumed by registered listeners which could be other camel routes that feed other systems.
We’re still in the initial phase of figuring out what tools to use and how everything will come together, we strongly believe our choices could be influenced by the product owner based on the resources of the stakeholder, we will share what we have put together in the near future.
Thanks, @wyclif for providing possible tools that we can use. We will definitely need to do a lot of benchmark and testing on these tools while considering the ability to scale up and/or down, depending on diff server specs. I am a huge proponent for debezium mainly because it is an out-of-the-box solution that has minimal impact on production DB. Additionally, debezium/Kafka tends to be rather light on the CPU and RAM. Dockerized zookeeper-Kafka-connector, uses negligible JVM heap; Thus for small deployments, 500mb to 1 GB RAM should be enough for each container. During runtime, Kafka consumers buffers at least 2MB of RAM per consumer; while producers consume up to 64 MB. Therefore, the resource it needs highly depends on # of records/messages per sec. When it comes to batch mode, we will definitely need modern processors with multiple cores - also highly dependent on the amount of data in the DB. for < 1000 patients with less approximately 1 million clinical observations, 8gb ram, and 4 cores should be able to power the entire stack including production line.
For a number of these solutions (eg. Debezium) it seems that they likely rely on reading the MySQL bin log. I don’t believe that at PIH we have typically/by default configured our MySQL instances beyond the defaults in this area, and so I’m interested in how we might start thinking about this and whether we need to planning for how we might adapt our existing and future MySQL instances to prepare for these tools. Particularly for some large-scale rollouts - I’m concerned that updating hundreds of existing installations to use a new MySQL binlog format might be a major challenge for some groups.
Does anyone have experience or best practices in this area that they can share?
If the above is correct, we might want to start issuing some recommendations around MySQL configuration that accounts for this.
Thanks @wyclif for this nice summary. As I alluded above, I think the solution for the sync problem can play a role for the streaming part of an ETL/Analytics solution. One thing that I asked here but still am not clear about is what were your reasons for not building on top of Sync 2.0 and instead creating dbsync? Because at least part of what you are describing is similar to how Sync 2.0 uses the atomfeed module to keep a log of DB changes.
FWIW, for streaming ETL, I think beside approaches based on debezium or DB logs, another way is to rely directly on OpenMRS to keep the change logs, like an extended version of atomfeed module; or even directly write into the data warehouse once a change happens in OpenMRS. I’ll try to talk through some of these ideas in this meeting.
Would this be in order to make the solution lighter?
Yes, potentially a more lightweight solution and also easier to deploy, e.g., you just add another module and configure it.
If it were my infrastructure, I would rather go through the pain to Dockerise my MySQL instances progressively and ensure that bin logging is turned on… rather than trying to install the Atom Feed module (and whatever else will be needed with it) 100+ times on instances that might still run ancient versions of OpenMRS.
I like the non-intrusive pattern enforced by Debezium, for me this is a big winner.
Thanks @mseaton for bringing this up! When I started working with debezium I had the same concerns.
Practically speaking, deploying the debezium CDC is quite easy; would take the same effort as installing a new module. It entails adding two lines to the MySQL config then powering up the required container stack. We could write a bash script to automate this process. The config of interest are:
log_bin = mysql-bin binlog_format = row
Switching binlog format is the easiest part, this can be done at runtime by running this query :
SET GLOBAL binlog_format = ‘ROW’;
Although this can not be switched when the replication channel thread is currently running, thus we will need to switch after 5pm, when no one is performing data entry.
We should encourage the community to use ROW based binlog format instead of statement base.
@wyclif, thanks for the write up.
Do I understand correctly that you hit the performance limit that makes you consider the switch from DB triggers to Debezium? What do you use for queuing right now? Could you please share some stats? We’d have to have more data on the setup to recommend the path forward.
Debezium may indeed be an overkill for smaller setups. Can you leave a gate open to switch between DB triggers or Debezium depending on the scale of operation?
If we use this approach, it will be essential to upgrade atom or create a new module that caters for the CDC processing guarantees downstream the pipeline. Most importantly, can we guarantee data consistency between Openmrs (monolith) and the intermediate store of interest (lake/warehouse)? Remember, it is very easy to lose updates downstream the analytical pipeline. Something which debezium is designed to prevent
Operationally Debezium/kafka shields us from missing any updates downstream the analytical pipeline even beyond the E of ELT, even in case of power failure or internal server/application faults. Unlike debezium/kafka, Most CDC systems fail to guarantee the exactly-once processing semantics
The question is do we want to go down this path or instead us a modern and out-of-the-box solution for our CDC requirement?
Hi @raff, @wyclif to provide further details but the short answer is no. We are just testing tech stacks that enforce best and reusable practises. And it looked like Debezium was doing off the shelf a lot of what we needed and could cover the whole ‘E’ process for multiple use cases: ETL for data analytics, pure low level sync, data out streams for EIP (with Camel for instance)… etc.
It looks like that @wyclif’s approach with DB triggers is working perfectly fine. However, those triggers require maintenance and a setup that is more invasive than Debezium. They also imply to keep sync records journal tables within the OpenMRS database.
Again @wyclif to confirm but it appears that Debezium is reducing the overhead in terms of maintenance.
I see. So you are still exploring your options and you haven’t rolled out a solution yet.
In my opinion the Debezium is the way to go in such a case. It comes at the price of having beefier machines to run the whole stack, though it could be mitigated by using the embedded version as @wyclif mentioned. Have a look at https://camel.apache.org/components/latest/debezium-mysql-component.html which nicely couples the Debezium embedded engine with Apache Camel. It may be a great fit, since you are already using Camel.
Thanks @raff for sharing this. This will solve most of the concerns we have been having. Apart from eliminating Kafka and zookeeper, I’m glad this component provides the ability to recover from faults. This will definitely work for small deployments as well as large-scale
We have options with all that we have tried so far (basic management DB based on the last sync date, then more recently the DB triggers evolution), so at worst we have plans B. But it looks like Debezium starts looking like plan A
Thanks for your input again @raff.
@akimaina, well large-scale is pretty vague, but it definitely gives you a considerably easy upgrade path to Kafka and clustering to handle a truly large-scale if you start hitting performance limits.