I know this is a question that always comes up, but just pinging the community again to see what anyone in the community might be currently doing re: data warehousing OpenMRS data.
Reporting and analytics has been an ever-present challenge for us here at PIH, and we are planning on putting some additional time into it over the next few weeks.
What I’m looking to do is set up a data warehouse that contains “flattened” OpenMRS, removing the complexity of the OpenMRS data model for use by data analysts without familiarity with OpenMRS. (More detailed tools could then be built top of this by the data analysts for use by other consumers).
Right now we have a simplified version of that in our “row-per-encounter” data exports that we generate, which generate CSV files. One route is to have the reporting module (which generates these exports) have a renderer that can insert this flattened data directly into a SQL database.
But the route I am currently spiking on is inserting flattened OpenMRS data directly into a NoSQL database. (Perhaps starting out each document is a single encounter). Just wondering if anyone has had any experience with this—or much experience with NoSQL databases at all. I’ve played around with couchdb in the past, but it’s been a couple years.
A couple years back I played around with putting encounter REST representations into elasticsearch, by watching an atom feed. It worked well enough tech-wise, though I had to convert the schema slightly because ES didn’t like having the obs.value field represent multiple datatypes.
Who is the target user for whom having nosql is a more convenient format?
For the data consumers that I imagine you having, I think that your idea to use reporting DSDs to populate an analyst-friendly SQL database would provide more value for less effort (assuming the DB can be wiped and rebuilt overnight).
@darius we are considering both SQL and NoSQL solutions, I’m just pursuing the NoSQL solution right now… target consumer would generally be a technical user like someone in MEQ, the idea being that this would be a “middle” format designed primarily to allow data analysts with strong technical skills to work on the data (and refine it for consumption by other users further down the pipeline) without requiring them to have to have knowledge of the complexities of the OpenMRS data model or code in Java.
The advantage of NoSQL would be that in a flattened encounter & obs model in a SQL setting you’d either need a table per encounter type or a table with thousands of columns. (How much of an advantage this is in still up for debate and why we are playing with multiple approaches).
Is your elasticsearch splke up on Github anywhere?
Our approach is relatively straightforward. We flatten all new
encounter/obs data every few minutes and use those datasets as the backbone
for doing reporting queries. For the most part, our etl processes are done
in sql and run on a chron job.