Do we really need Logstash too? I would think that E+K is enough.
You don’t “need” docker, but it’s the easiest way to make this easy for anyone interested in playing around to replicate.
I think this is a really neat idea. But I think we should be a bit more directed in how we approach this.
First, let’s make one more attempt to reach out to implementations on the Implementing list and see if someone wants to provide some real deidentified data, in exchange for getting some analyses coded up. (Randomly-generated data is not going to product pretty charts.)
Also, I think that @jdick and AMPATH may already be putting OpenMRS data into ElasticSearch for analysis, and maybe we could use something they’ve done as a starting point. Also, @mogoodrich and @mseaton were recently experimenting with something along these lines, but I think they went another way.
ElasticSearch + Kibana are a great combination, but the underlying model is to search through indexed documents. It’s quite flexible, and you can aggregate these results and replicate some of what you’d otherwise do with OLAP cubes, you don’t get the same performance and memory usage characteristics. (Specifically, if you were to index all the observations into ES it’s going get huge, and require more resources than OpenMRS itself. That seems impractical for a typical implementation, so from the beginning I would avoid this.)
If I were doing this I would look into this workflow:
triggered by an atom feed, schedule, or some other event mechanism… [implement this part last]
fetch key events in a simplified view (not the pure OpenMRS data model) [ideally produce these via Reporting and Reporting REST, but you could hack it in the first pass]
thus we generate a quick spike at an analytic data model, indexed in ElasticSearch
build Kibana graphs on top of this
then see if you can embed the Kibana graphs in the RA UI via an OWA (or alternately, copy the ES queries that Kibana is doing into an OWA + use some graphing library)
well you need something to receive the logs and process them using GROK. You can try graylog since it lets you send directly to it and it still uses elasticsearch for the indexes. I suggest you use the elastic *beats to collect the logs,metrics or any information.
According to a colleague, it’s the easiest way to import rows from a relational database into Elastic. Maybe it’s killing flies with guns, I haven’t tried yet. The alternative is to build an ETL process to do the same, but with Logstash you don’t need to code.
I can’t recall us being ever able to get deidentified data from an implementation (yes, let’s make one more attempt!), but we may get them to try out what we accomplish here and have an incentive to collaborate further on that setup. It is why I think it is important to do it in Docker so it can be easily run by others on real data.
There are official docker images for the whole stack at elastic · GitHub with documentation so it shouldn’t be hard.
I’m very excited to hear your stories!
Has ever anyone evaluated how much it is in practice? I figure you can always limit to a year of observations and still find it useful.
@lluismf, how about @adamg will work on creating the docker setup and you take it from there… @adamg, would you be interested in creating docker-compose with RA 2.6 and ELK (full stack)?
@adamg, yes, it would be the most straightforward. You can even write yet another docker-compose file for Kibana to be combined with RA instead of modifying the generated one. Thanks!
Tell me stubborn but I prefer to work natively instead of a virtual machine. I will create a logstash .CONF file with the queries and Kibana allows to export its artifacts to JSON files, and it’ll be easy to import all of them in the Docker image built by @adamg
Any database used for business intelligence is bigger than the online one, because of the de-normalization.
So yes, it’ll be even more huge unless instead of replicating raw obs we do just aggregated (grouped) data. There are many options.
If your approach is “index all the patients/visits/encounters/obs in ElasticSearch” then you’ll be able to do some cool stuff with ES, but it will probably explode when you try it for an implementation with millions of obs.
But every single one of the metrics @raff mentioned in the first post can be calculated without indexing the entire obs table.
(Basically what I’m saying is don’t index the entire obs table for analytics/reporting/visualization; everything else is of manageable size, but not obs.)
@raff I’ve created a repo with the current config for RA an ELK. This is my first time working with ELK so please let me know if there is anything that I should change