RFC: A separate, read-optimized projection of OpenMRS clinical data

We’ve been drafting a design for openmrs-module-querystore — a separate, denormalized, read-side projection of OpenMRS clinical data, intended as shared infrastructure for any consumer that needs query patterns core’s transactional schema doesn’t serve well. The visible use cases today are AI/ML pipelines, semantic + keyword chart search, cross-patient cohort queries, and analytics, but the design isn’t tied to them — anything that needs a read-optimized view of clinical data is in scope.

This is early and very much a work in progress — no implementation yet, on purpose. Nothing in the design is locked in: every decision in the ADR is open to revision based on community input. The questions below are where we’d most like feedback first, but they’re not the only ones up for discussion — push back on anything.

Repo (design docs only at this stage): :point_right: GitHub - openmrs/openmrs-module-querystore: OpenMRS Query Store Module — CQRS read-side projection of clinical data for AI, analytics, and reporting · GitHub Architectural decisions: docs/adr.md

The current proposal in one paragraph

Apply CQRS: core remains the source of truth; a second store maintains an eventually-consistent read-side projection optimized for queries. The current candidate for the backing store is Elasticsearch (hybrid BM25 + dense-vector kNN + structured filtering in one system), but the choice is open — if something fits OpenMRS’s deployment realities better, the design should accommodate it. Data lives in per-type indices under an openmrs_* namespace. Each document has three parts: a plain-text serialization of the record, a vector embedding generated from that text, and structured metadata for filtering. Sync is events first, with AOP only as a scoped gap-filler. The v1 consumer surface is a Java service using OpenMRS’s standard @Authorized privilege annotations; REST and FHIR layers are deferred from v1 but explicitly additive — they layer on top of the same service when we get to them.

The most concrete near-term consumer is the existing chartsearchai module, which today maintains its own ES pipeline. A migration analysis identifies what would need to change on each side.

What we’d most like feedback on

  1. Module or core? The current draft sits as a module on the grounds that not every deployment needs it, the backing-store choice shouldn’t bind core, and search/analytics dependencies don’t belong in every deployment. But the counter-argument is real — if this becomes the standard read surface across consumers, a module makes it optional infrastructure each deployment must add. Should this live in core instead, or stay a module?
  2. Is CQRS the right framing at all? It adds a second system and eventual consistency. Worth it, or should we push harder on the transactional database?
  3. What should the backing store be? Elasticsearch is the current candidate (mature, hybrid keyword + vector + structured filtering in one system, already in use by some deployments), but the ~1–2 GB memory floor is non-trivial in low-resource settings. Alternatives worth considering: PostgreSQL + pgvector, OpenSearch, a dedicated vector DB paired with a separate keyword index, or something else entirely. What would fit your deployments better?
  4. Plain-text serialization over FHIR. Chosen for token efficiency and embedding quality. Right call, or should we offer a FHIR projection too?
  5. Coarse GET_PATIENTS authorization in v1. Cross-patient results may include patients the caller couldn’t read individually via core — dataFilter and location-based ACLs are not honored. Acceptable starting point, or deal-breaker?
  6. Sync reliability. Durable subscription, dead-letter handling, reconciliation against drift — anyone who’s built event-driven sync on top of the OpenMRS Event module: what bit you?

The ADR’s Open Questions section lists items we haven’t formed a view on yet (patient merge, bootstrap, long-text chunking, embedding model versioning, complex obs, PII scopes, concept-set queries, time-zone convention, Person vs Patient model). Input on any of them is welcome — and accepted decisions in the ADR are equally open to challenge.

Concerns about the premise (“do we even want this?”) are explicitly welcome too.

4 Likes

For me, I would want this to remain a module for Platform 2 compatibility and consider moving to core in Platform 3, though I am ever mindful of this brilliant post. At the very least, I think we need more rapid development of the query store functionality than is likely tolerable in core.

This is definitely an interesting question and I’d really value the input of people not named Daniel Kayiwa or Ian Bacher on this one. CQRS is generally a pattern applied to systems built on event sourcing and while we’ve been playing around with improvements to events in OpenMRS, the core data model itself definitely is not. Eventual consistency has some real tradeoffs not only in processing overhead (as the write representation is translated the the query representation) and there is a (probably noticeable in real-world scenarios!) gap between a write and the query store being updated.

That said, I think there’s a very strong argument to be made that some kind of query store can enable a lot of functionality that the community has asked for. One way to think about this is that the query store effectively stores “flattened” records of OMRS transactional data and this module gives us the ability to “plug-in” different generated representations. This won’t solve everything we want flattened data for (i.e., it’s probably not the right representation for large-scale analytic queries), but it does seem a valuable representation to back things like patient flags, CDSS rules, calculated obs, etc.

I think Postgres only makes sense in this role if we also once again work on support Postgres as the transactional data store.

The other obvious storage to consider is Infinispan which is clusterable, supports vector stores, and is part of core since TRUNK-6302 / 2.8.0.

MariaDB also supports a vector storage (as of 11.8, which landed last year), but which I’m not sure we’ve tested with OMRS and isn’t as widely-used as pgvector. As near as I can tell, MySQL’s vector storage landed only in their paid products.

I favour purpose-driven, custom representations. Maybe it makes sense for the FHIR module to serialize FHIR representations (which could be one way to speed-up the FHIR2 module), but I don’t think FHIR is always the right representation for everything, especially AI RAG use-cases. Basically, if we want to provide FHIR serialization, I’d vote for this to live in the FHIR2 module.

Granular privilege management is something that we as a product have historically done quite poorly and I do think this is something we need to address to get to a MVP, but I assume for now that queries are largely patient scoped?

I won’t be able to add much to the technical discussion/questions, but I am interested in how the knowledge base gets combined with the instance data for use… whether in FHIR (where ostensibly coding is done) or something else, how we bring in locations, concepts, drugs, references, etc. should probably also be part of the discussion, no?

This is great @dkayiwa - very excited about this initiative.

I think you are aware of these issues, but a few initial thoughts:

  1. Handling metadata changes. Let’s say after the system has been running for some time, the implementation team decides to change the name of one of their concepts or locations or encounter types. What happens to the denormalized data at that point? Is this something the design would be able to handle?
  2. Similar to #1, let’s say a bug has been found in a transformation. Or a new column has been added to a table in the core data model. Or some other changes are decided upon after the representations have been built? What would be the process for deciding what needs to be rebuilt and what are the implications if rebuilding this takes a long time and production level decision support rules and such are relying upon it?
  3. Handling direct database changes. The event module is getting better, and the work that @raff is involved with and which is getting discussed in this thread and this thread seem like they will continue to make events more reliable. But these are still only as comprehensive as the mechanisms that detect changes and fire events - generally via Hibernate interceptors. There are still plenty of situations where the database is updated directly, most notably when liquibase is used to perform database migrations or setup initial metadata. How will these situations be handled such that the query store maintains consistency?
  4. Would you expect modules that do analytics ETL like mamba to migrate to using this mechanism?

Thanks! Mike

Short answer: no - I’d expect mamba and the query store to coexist rather than one to replace the other.

They’re both read-side projections of core, so they share the CQRS philosophy, but they optimize for different workloads. Mamba is analytics ETL - flat/relational shapes suited to SQL aggregations, indicator reports, and BI dashboards. The query store is search/retrieval-shaped - per-record Elasticsearch documents with plain-text serializations and embeddings, optimized for full-text and semantic queries on individual records, not GROUP BY across populations.

A deployment running indicator reports alongside AI-assisted chart search would reasonably want both. Over time there’s room to share plumbing - event sync, reconciliation, the contribution SPI - but I wouldn’t frame this as a migration target for analytics ETL.

WIll @raff 's providing of a CDC mechanism with Debezium satisfy this requirement? Jira

In my view Debezium events should be the main event source for CQRS to guarantee eventual consistency and low latency with our relational (and possibly clustered) DBs. Debezium is a part of the integration middleware project that I’m working on right now as @dkayiwa mentioned.

I’d vote for the ElasticSearch store for CQRS. It’s already the search engine of choice in our clustered deployments for full text searches.

Another alternative for low-resource settings (if needed) could be the embedded Lucene index that we use for the concept and patient search. We use Hibernate Search as an abstraction to make it easy to switch between embedded Lucene and ES for full text searches. The abstraction could possibly be used to some extent to support CQRS with targeted raw queries for Lucene and ES.

However, I would build the solution on ES right now. CQRS implies for me high resource settings these days, but in the AI era we will hopefully see what’s now considered costly infrastructure and complex deployments becoming a commodity not that far in the future. See how easy it is to run and maintain a K8s cluster these days vs what it took a few years back.

When building the read-only data source it would be a great opportunity to introduce more granular access control to patient data based on the approach, which I started to outline in this thread. Really looking forward to applying that in the new service.

The point that @mseaton raised about metadata changes is very important and it’s a complex one to address at scale with different scenarios to consider. It probably deserves a separate topic…

@dkayiwa I needed to drop off yesterday, so I just wanted to expand on my MCP suggestion.

I see it as a way to provide an LLM a universal way to query for data instead of us having to provide a context based on user’s query. In the first pass I think that including the whole patient history in the context is a good approach. If the history is too big for a specific patient, then we might divide history to fit in the context and ask LLM for summaries of each chunk and then using all summaries to answer the original query. Summaries could be persisted as well for future queries.

However, it’s probably much more efficient and performant to give LLM a way to just query for the data it needs based on the prompt. Often times a query already implies that you are only interested e.g. in the most recent lab result so there’s no point to give LLM the whole history, but rather let it call a method to get what it needs. Here MCP comes handy as many LLMs know how to use it.

Once we start to think about providing AI support not only for an individual patient, but a cohort, then MCP is even more important.