This topic caters to the need of implementing Big Data solutions as an additional tool to crunch the rapidly increasing historical medical data sets that we are having now and gain new meaningful insights.
Datawarehousing: Data warehousing is a set of techniques and software to enable the collection of data from operational systems, the integration and harmonization of that data into a centralized database and then the analysis, visualization and tracking of key performance indicators on a dashboard.
- Hadoop is a complete, open-source ecosystem for capturing, organizing, storing, searching, sharing, analyzing, visualizing, and otherwise processing disparate data sources (structured, semi-structured, and unstructured) in a cluster of commodity computers. This architecture gives Hadoop clusters incremental and virtually unlimited scalability – from a few to a few thousand servers, each offering local storage and computation.
- Hadoop’s ability to store and analyze large data sets in parallel on a large cluster of computers yields exceptional performance, while the use of commodity hardware results in a remarkably low cost. In fact, Hadoop clusters often cost 50 to 100 times less on a per-terabyte basis than today’s typical data warehouse. A key difference between data warehousing and Hadoop is that a data warehouse is typically implemented in a single relational database that serves as the central store. In contrast, Hadoop and the Hadoop File System are designed to span multiple machines and handle huge volumes of data that surpass the capability of any single machine.
- The Hadoop ecosystem includes a data warehousing layer/service built on top of the Hadoop core. Those services on top of Hadoop include SQL (Presto), SQL-Like (Hive) and NoSQL (Hbase) type of data stores. In contrast, over the last decade, large data warehouses shifted to use custom multiprocessor appliances to scale to large volumes like those from Netezza (bought by IBM) and Teradata.
- The techniques of data warehousing to include Extract-Transform-and-Load (ETL), dimensional modeling and business intelligence will be adapted to the new Hadoop/NoSQL environments. Furthermore, those technologies will also morph to support more hybrid environments. The key principle seems to be that not all data is equal, so IT managers should choose the data storage and access mechanism to best suit the usage of the data. Hybrid environments could include key-value stores, relational databases, graph stores, document stores, columnar stores, XML databases, metadata catalog and others.
Hadoop will not replace relational databases or traditional data warehouse platforms, but its superior price/performance ratio can help organizations lower costs while maintaining their existing applications and reporting infrastructure.
Is Hadoop a viable / affordable solution for small, resource-constrained implementations? We should make a distinction that, while any OpenMRS installation qualifies as “Big Data”, Hadoop is for a large enterprise with resources for hardware.
Using Hadoop requires a paradigm shift to map-reduce thinking, to take advantage of its processing power. We could definitely use some domain knowledge in this area, if someone is willing to teach a few developers how to harness Hadoop with the OpenMRS data model.
Also, many large implementations still have to perform per-patient reporting, and I wonder if Hadoop is an adequate tool for that kind of information gathering. In my experience with Ampath, I found building derived tables to be extremely helpful in speeding up the per-patient reporting process, but look forward to other Big Data techniques that could be used instead.
Big data solutions doesn’t mean just hadoop. There is an ecosystem built around hadoop with cool technologies. I know people fear about additional cost being incurred by adopting this. But , my point is that we have not switched yet from mysql and developing openmrs software to work and tuning it may take some time . Its not that I am advocating to abandon mysql. What I am advocating is giving people options . Modify our existing code to give people options to go for mysql or hbase or for the matter any database they like - this is what I think should be the the way to go. Let people decide what they want , we give them options. Come on people…show some interest. And please comment to give your view. And for the record , I am always ready to invest my time on it.
Since people are showing less interest in this endeavour , can any body please tell me what is max data size of biggest openmrs installation out there? Cause if the size is not really that much , there is really no need to switch loyalty from the existing solutions…
Ampath is one of the largest implementations, with over 500,000 patients and 170+ million observations. I worked with them over the last four years, developing strategies for performance improvements in reporting and data integrity. We run MySQL, and the database files comprise over 158GB of disk space at this time.
OpenMRS is built with the understanding that some developers may want to change the data access layer to use a different database or ORM. All you have to do is write new DAO implementations, although that is an understatement of how much work that would require. The best and fastest way to shoehorn a different database than MySQL into OpenMRS, if you want, is to write or use a driver that works with Hibernate and can interpret SQL.