Archiving Voided Data — Exploring a Lifecycle-Based Archival Approach

Hi Everyone, I’ve been studying the Archiving Voided Data project idea and wanted to share an initial architectural exploration as of project discussion before getting into implementation specifics and would also expect feedbacks for it from mentors.

  1. Understanding the problem space

Over time, high-volume tables such as obs, encounter, and orders accumulate a very large proportion of voided records. Even though most queries filter using voided = 0, databases still maintain indexes and scan large datasets, which gradually impacts performance.

So the goal here is not simply removing data from tables, but managing the lifecycle of historical clinical data while preserving OpenMRS behavior and compatibility.

My understanding of this comes partly from working around validation and domain logic areas (Obs Validator, Patient Validator, and parts of Concept and Order services and many other parts), where even small data-handling changes can affect multiple services and modules indirectly.

  1. Archival boundaries should follow clinical structure

OpenMRS data is strongly relational and clinically contextual:

Patient → Encounter → Observations → Orders → Related records

From exploring components like ConceptService, ConceptServiceImpl, OrderServiceImpl, and reference range handling (ConceptReferenceRangeContext), it becomes clear that records are rarely meaningful in isolation. Because of this, archiving should operate on a meaningful clinical boundary rather than isolated rows — typically an encounter together with its dependent clinical data.

This helps preserve relationships and avoids unexpected behavior when historical data is accessed later.

  1. Proposed lifecycle model

Instead of thinking only in terms of copying rows to archive tables, I’m considering a staged lifecycle:

Stage 1 — Active data

Normal operation (current behavior)

Stage 2 — Cold archived data

Old voided records are moved out of primary query paths but remain retrievable through backend services when explicitly requested.

Stage 3 — Long-term retention

After configurable retention period, data may optionally be permanently purged depending on implementation policy.

This keeps frequently queried tables small while maintaining recoverability.

  1. Archiving process

A scheduled archival task would periodically:

  1. Select eligible clinical units older than configured retention

  2. Process them in ordered batches

  3. Move dependent data together

  4. Record archive metadata

  5. Commit safely with checkpoints

The job should be resumable so interruptions do not risk partial archival.

  1. Retrieval & restoration

Archived data should remain historically traceable.

Rather than silently modifying past history, restoration would reintroduce records through a controlled process that preserves provenance (who restored it and from which archived state). This maintains audit integrity and avoids ambiguity in clinical records.

  1. Compatibility goals

The solution should aim for:

• No changes to existing REST responses • No module breakage • No required frontend modifications • Explicit access only when archived data is requested

This keeps the feature transparent for implementations that don’t need it while enabling performance improvements for large datasets.

I’d appreciate guidance and feedback on this discussion for handling long-term historical data, especially regarding retrieval patterns and retention policies, suggestions or more ideas or areas to be covered for this project.

cc @dkayiwa

The accumulation of Encounters, etc. is just not as fast as for Observations and this is because Observations are essentially immutable. That is, voided Observations accumulate as a natural part of the system being run. Voided encounters (and other things like conditions, etc.) do not.

Thanks for this suggestion, that clarifies to some extent than before.

The main pressure is coming from the natural lifecycle of Observations, since corrections create new Obs rather than updating existing ones, while other entities don’t grow the same way.

So it sounds like the archival strategy should focus primarily on high-volume immutable clinical history rather than all voided records uniformly and add a mechanism to archive voided data with the possibility of restoring it.

To check my understanding over the project: should archiving mainly target historical Obs while leaving related encounter context in the main tables, or should the archive always preserve full encounter-level consistency?