Hi Everyone, I’ve been studying the Archiving Voided Data project idea and wanted to share an initial architectural exploration as of project discussion before getting into implementation specifics and would also expect feedbacks for it from mentors.
- Understanding the problem space
Over time, high-volume tables such as obs, encounter, and orders accumulate a very large proportion of voided records.
Even though most queries filter using voided = 0, databases still maintain indexes and scan large datasets, which gradually impacts performance.
So the goal here is not simply removing data from tables, but managing the lifecycle of historical clinical data while preserving OpenMRS behavior and compatibility.
My understanding of this comes partly from working around validation and domain logic areas (Obs Validator, Patient Validator, and parts of Concept and Order services and many other parts), where even small data-handling changes can affect multiple services and modules indirectly.
- Archival boundaries should follow clinical structure
OpenMRS data is strongly relational and clinically contextual:
Patient → Encounter → Observations → Orders → Related records
From exploring components like ConceptService, ConceptServiceImpl, OrderServiceImpl, and reference range handling (ConceptReferenceRangeContext), it becomes clear that records are rarely meaningful in isolation.
Because of this, archiving should operate on a meaningful clinical boundary rather than isolated rows — typically an encounter together with its dependent clinical data.
This helps preserve relationships and avoids unexpected behavior when historical data is accessed later.
- Proposed lifecycle model
Instead of thinking only in terms of copying rows to archive tables, I’m considering a staged lifecycle:
Stage 1 — Active data
Normal operation (current behavior)
Stage 2 — Cold archived data
Old voided records are moved out of primary query paths but remain retrievable through backend services when explicitly requested.
Stage 3 — Long-term retention
After configurable retention period, data may optionally be permanently purged depending on implementation policy.
This keeps frequently queried tables small while maintaining recoverability.
- Archiving process
A scheduled archival task would periodically:
-
Select eligible clinical units older than configured retention
-
Process them in ordered batches
-
Move dependent data together
-
Record archive metadata
-
Commit safely with checkpoints
The job should be resumable so interruptions do not risk partial archival.
- Retrieval & restoration
Archived data should remain historically traceable.
Rather than silently modifying past history, restoration would reintroduce records through a controlled process that preserves provenance (who restored it and from which archived state). This maintains audit integrity and avoids ambiguity in clinical records.
- Compatibility goals
The solution should aim for:
• No changes to existing REST responses • No module breakage • No required frontend modifications • Explicit access only when archived data is requested
This keeps the feature transparent for implementations that don’t need it while enabling performance improvements for large datasets.
I’d appreciate guidance and feedback on this discussion for handling long-term historical data, especially regarding retrieval patterns and retention policies, suggestions or more ideas or areas to be covered for this project.
cc @dkayiwa