Thanks @akimaina for this great post. A few questions:
-
Is the code available on GitHub for everyone to take a look?
-
How sensitive is this implementation to changes in the input DB schema? For example, is the same pipeline applicable to different versions of OpenMRS where the schema of some of the input tables are different?
-
I see that you have tried to prevent code duplication by sharing common code. Did you consider implementing the pipeline by Apache Beam such that the same pipeline can be used in both streaming and batch modes?
-
Re. the output Spark dataframes, did you consider the alternative of storing the middle output in columnar file formats (like Parquet)? In general, if someone wants to avoid PySpark and run SQL like queries on the intermediate storage, is it possible? Sorry, I don’t know much about Delta Lake, maybe it already does what I am asking about.
-
Re. the stats that you reported for the streaming case, how many nodes/cores were being used for the streaming pipeline? And how big was the input DB in terms of approximate number of patients, observations, etc.
Thanks again.