Data warehousing

mogoodrich · October 4, 2016, 5:06pm

I know this is a question that always comes up, but just pinging the community again to see what anyone in the community might be currently doing re: data warehousing OpenMRS data.

Reporting and analytics has been an ever-present challenge for us here at PIH, and we are planning on putting some additional time into it over the next few weeks.

What I’m looking to do is set up a data warehouse that contains “flattened” OpenMRS, removing the complexity of the OpenMRS data model for use by data analysts without familiarity with OpenMRS. (More detailed tools could then be built top of this by the data analysts for use by other consumers).

Right now we have a simplified version of that in our “row-per-encounter” data exports that we generate, which generate CSV files. One route is to have the reporting module (which generates these exports) have a renderer that can insert this flattened data directly into a SQL database.

But the route I am currently spiking on is inserting flattened OpenMRS data directly into a NoSQL database. (Perhaps starting out each document is a single encounter). Just wondering if anyone has had any experience with this—or much experience with NoSQL databases at all. I’ve played around with couchdb in the past, but it’s been a couple years.

Take care, Mark

surangak · October 4, 2016, 5:22pm

Mark, are you familiar with the OHDSI project, and the initial work that OpenMRS did with them? perhaps this might be of use to you. ohdsi.org

Basically, it converts the OpenMRS DB into a generic standards based data schema, and lets you run various apps to analyze it.

mogoodrich · October 4, 2016, 8:52pm

Thanks @surangak… no I hadn’t heard about it… (or had forgotten about it if I had!)

Take care, Mark

darius · October 4, 2016, 9:11pm

A couple years back I played around with putting encounter REST representations into elasticsearch, by watching an atom feed. It worked well enough tech-wise, though I had to convert the schema slightly because ES didn’t like having the obs.value field represent multiple datatypes.

Who is the target user for whom having nosql is a more convenient format?

For the data consumers that I imagine you having, I think that your idea to use reporting DSDs to populate an analyst-friendly SQL database would provide more value for less effort (assuming the DB can be wiped and rebuilt overnight).

judy · October 5, 2016, 12:59am

Mark talk to Jonathan Dick they have the best analytics I have seen and have done this already

mogoodrich · October 5, 2016, 12:44pm

Thanks @judy, I will reach out to @jdick…

@darius we are considering both SQL and NoSQL solutions, I’m just pursuing the NoSQL solution right now… target consumer would generally be a technical user like someone in MEQ, the idea being that this would be a “middle” format designed primarily to allow data analysts with strong technical skills to work on the data (and refine it for consumption by other users further down the pipeline) without requiring them to have to have knowledge of the complexities of the OpenMRS data model or code in Java.

The advantage of NoSQL would be that in a flattened encounter & obs model in a SQL setting you’d either need a table per encounter type or a table with thousands of columns. (How much of an advantage this is in still up for debate and why we are playing with multiple approaches).

Is your elasticsearch splke up on Github anywhere?

Thanks! Mark

mseaton · October 5, 2016, 1:29pm

Thanks @judy.

@jdick, are these the repositories in question that you are actively using for reporting:

https://github.com/AMPATH/etl https://github.com/AMPATH/etl-rest-server

Would love to learn more about your approach and how it has worked for you.

Thanks, Mike

jdick · October 7, 2016, 10:43am

Mike, that’s correct, those are our repositories.

Our approach is relatively straightforward. We flatten all new encounter/obs data every few minutes and use those datasets as the backbone for doing reporting queries. For the most part, our etl processes are done in sql and run on a chron job.

Happy to discuss further.

JJ

janflowers · October 8, 2016, 4:43pm

I’d like to hear about your approach as well @jdick. Maybe this could be showcased on a dev forum?