Module for ETL & Predictive Modeling across Multiple Platform (DW) Midterm Presentation

vineetv2821993 · June 19, 2014, 6:12am

Greetings Everyone,

I know you are loving my weekly blog posts and now its good to see all of you waiting for this mid session progress. At long last here is my Midterm Presentation. Enjoy

https://www.youtube.com/watch?v=jQNmWkxr4kk&feature=youtu.be

Please feel free to give your comments and review about my progress and this presentation

Coming together is a beginning, keeping together is progress, working together is success.

Short description: The amount of data generated is getting increased day by day and so as the appetite for finding the information from data as well. Growing appetite for data analysis can’t be achieved by transactional databases. The intention of this project is to have a ETL module to interact with multiple DW compliance over which predictive modeling code could run. So, that healthcare provider can check upon the predictive modeling result based on historical data they are having/loading.

burke · June 24, 2014, 2:38am

@vineetv2821993, thank you for your presentation. Your project is very close to an ETL Module we planned to build for AMPATH. Some of the additional factors we were considering:

Allow raw SQL as one method for collecting data for transforms, since some of our queries for ETL are complex.
Make it easy to design transforms that process only new/changed data.
Consider mechanisms to allow transforms to be “streamed” (i.e., allow max memory or CPU usage to be controlled in exchange for taking longer to perform transform)
Allow transforms to be scheduled (e.g., to run nightly).
As you’ve started, allow for a variety of targets to be added over time (e.g., HIVE, MySQL, file output, SOLR, etc.)

Have you considered these issues in your project?

vineetv2821993 · June 24, 2014, 3:56am

Hi Burke,

Thank You for your comments and suggestion about my ongoing project progress .

Allow raw SQL as one method for collecting data for transforms, since some of our queries for ETL are complex.
- Yeah I considered it. I will add that option . But I am actually trying to avoid it was not looking safe because may be it is a security concern and an attacker can inject its query easily .
As you’ve started, allow for a variety of targets to be added over time (e.g., HIVE, MySQL, file output, SOLR, etc.)
Yes I considerd those issues like late night scheduling and various target output. For maximum memory or CPU Usage that didn’t come in mind because I am dealing with small data, but I will also consider it now . I will implement them soon

burke · June 24, 2014, 4:30am

This can be addressed by requiring an additional privilege for creating/editing raw SQL transforms that could be given only to people who are trusted or already have full access to the system (e.g., administrators, developers, etc.).

vineetv2821993 · June 26, 2014, 4:23pm

@burke So that means the person who log in to the webapp, there should be a list of user that describes, who can submit a raw sql query? I am right?

burke · June 26, 2014, 4:32pm

I mean, for exampe, that users who have the “MySQL ETL Raw SQL” privilege will see the option to add or edit existing transforms that are raw SQL. Users with privileges to use the module but without this “Raw SQL” privilege may be able to see raw SQL transforms and schedule or run them, but they would not be able to edit or create them.

vineetv2821993 · July 8, 2014, 7:07am

Hi @burke

I am working on your suggestion of scheduled Interval ETL. So what i think about the implementation is :-

Create a HTML form take information about a scheduled interval, like columns to export, source database, target database, user credentials etc.
Put information of source db and target db and credentials in another database in mysql localy (where the main db exist of openmrs) with the timestamp.
Then if anyone want to get those ETL information, they can retrieve it from this mysql data easily.

What You suggest?

burke · July 14, 2014, 10:25pm

@vineetv2821993,

I would suggest creating an OpenMRS Task – i.e., extending AbstractTask as seen in these examples. Your task can have additional properties about the source and destination and your module can even provide a simple form to create & schedule these tasks, but admins will also be able to see them and administrate them through the existing task scheduler (admin/Admin123) within OpenMRS. You could put credentials into your TaskDefinition by using a password field when collecting the password and using a reversible hash to obfuscate the credentials stored in the scheduler_task_config_property table.