Archiving Voided Data — Exploring a Lifecycle-Based Archival Approach

Hi Everyone, I’ve been studying the Archiving Voided Data project idea and wanted to share an initial architectural exploration as of project discussion before getting into implementation specifics and would also expect feedbacks for it from mentors.

  1. Understanding the problem space

Over time, high-volume tables such as obs, encounter, and orders accumulate a very large proportion of voided records. Even though most queries filter using voided = 0, databases still maintain indexes and scan large datasets, which gradually impacts performance.

So the goal here is not simply removing data from tables, but managing the lifecycle of historical clinical data while preserving OpenMRS behavior and compatibility.

My understanding of this comes partly from working around validation and domain logic areas (Obs Validator, Patient Validator, and parts of Concept and Order services and many other parts), where even small data-handling changes can affect multiple services and modules indirectly.

  1. Archival boundaries should follow clinical structure

OpenMRS data is strongly relational and clinically contextual:

Patient → Encounter → Observations → Orders → Related records

From exploring components like ConceptService, ConceptServiceImpl, OrderServiceImpl, and reference range handling (ConceptReferenceRangeContext), it becomes clear that records are rarely meaningful in isolation. Because of this, archiving should operate on a meaningful clinical boundary rather than isolated rows — typically an encounter together with its dependent clinical data.

This helps preserve relationships and avoids unexpected behavior when historical data is accessed later.

  1. Proposed lifecycle model

Instead of thinking only in terms of copying rows to archive tables, I’m considering a staged lifecycle:

Stage 1 — Active data

Normal operation (current behavior)

Stage 2 — Cold archived data

Old voided records are moved out of primary query paths but remain retrievable through backend services when explicitly requested.

Stage 3 — Long-term retention

After configurable retention period, data may optionally be permanently purged depending on implementation policy.

This keeps frequently queried tables small while maintaining recoverability.

  1. Archiving process

A scheduled archival task would periodically:

  1. Select eligible clinical units older than configured retention

  2. Process them in ordered batches

  3. Move dependent data together

  4. Record archive metadata

  5. Commit safely with checkpoints

The job should be resumable so interruptions do not risk partial archival.

  1. Retrieval & restoration

Archived data should remain historically traceable.

Rather than silently modifying past history, restoration would reintroduce records through a controlled process that preserves provenance (who restored it and from which archived state). This maintains audit integrity and avoids ambiguity in clinical records.

  1. Compatibility goals

The solution should aim for:

• No changes to existing REST responses • No module breakage • No required frontend modifications • Explicit access only when archived data is requested

This keeps the feature transparent for implementations that don’t need it while enabling performance improvements for large datasets.

I’d appreciate guidance and feedback on this discussion for handling long-term historical data, especially regarding retrieval patterns and retention policies, suggestions or more ideas or areas to be covered for this project.

cc @dkayiwa

The accumulation of Encounters, etc. is just not as fast as for Observations and this is because Observations are essentially immutable. That is, voided Observations accumulate as a natural part of the system being run. Voided encounters (and other things like conditions, etc.) do not.

Thanks for this suggestion, that clarifies to some extent than before.

The main pressure is coming from the natural lifecycle of Observations, since corrections create new Obs rather than updating existing ones, while other entities don’t grow the same way.

So it sounds like the archival strategy should focus primarily on high-volume immutable clinical history rather than all voided records uniformly and add a mechanism to archive voided data with the possibility of restoring it.

To check my understanding over the project: should archiving mainly target historical Obs while leaving related encounter context in the main tables, or should the archive always preserve full encounter-level consistency?

I mean, presumably encounters themselves and even visits that have been voided for a while (which would vary depending on the implementation), could be archived.

So Obs would likely be the primary archival pressure due to their lifecycle, while fully voided higher-level clinical records (like encounters or visits) become good secondary candidates since they no longer participate in active workflows.

I am not sure but it sounds like the archiving policy may need to be configurable based on implementation usage patterns rather than fixed per entity type depending on real-world usage.

Would it be reasonable to think of this as a tiered strategy — high-volume historical Obs handled for performance, and completely voided clinical containers archived opportunistically when safe?

Hi everyone,

I’ve been following the discussion around the accumulation of voided Observations and the idea of a tiered archival strategy. The clarification that Obs accumulate naturally due to their immutable correction lifecycle was particularly helpful in understanding why they become the primary archival pressure point.

While reviewing some of the related core classes (Obs, ObsService, and parts of the validation logic), I started thinking about what would practically define “safe” archival at the Obs level.

Obs records appear to participate in several relational structures, for example:

  • Version chains through previousVersion

  • Group hierarchies via obsGroup and groupMembers

  • References to other clinical entities such as Encounters or Orders

Because of these relationships, it seems that archival eligibility might need to ensure that a voided Obs is not still referenced within an active object graph.

For example:

  • A voided Obs that is still referenced as previousVersion by an active Obs might not be safe to archive.

  • Partial archival of grouped Obs structures could introduce inconsistencies in traversal logic.

  • Order references might also introduce additional constraints.

So I’m wondering whether defining clear isolation or eligibility rules for Obs-level archival should be considered a prerequisite before formalizing broader lifecycle strategies or retention policies.

I’d appreciate any thoughts on whether focusing on these relational constraints first would align with the intended direction for this project. Also, if there are any specific design documents or resources related to this area that I should review, I would be grateful for the guidance. cc @ibacher , @dkayiwa

Thanks.

It was a helpful breakdown of the relational structures around Obs. The examples you mentioned — version chains through previousVersion, group hierarchies, and links through encounters or orders — leads to an important role of the problem.

It does seem that archival eligibility at the Obs level would need to account for these structural relationships to avoid breaking the clinical graph. For example, if an active Obs still references a voided Obs through previousVersion, archiving the older record prematurely could make reconstruction of the correction chain difficult. Similarly, partial archival of grouped observations might introduce inconsistencies if traversal logic expects the full group structure to remain intact.

Building on that point, I’ve also been wondering whether archival eligibility might need to consider how Obs participate in the broader lifecycle of clinical data. Some possible aspects that come to mind:

Version lineage integrity – ensuring archival does not break correction history when an Obs has successors in a version chain

Group atomicity – determining whether grouped observations should be treated as a unit during archival

Encounter-level context – since Obs are often retrieved through encounters, independent archival might affect historical encounter reconstruction

Order or workflow references – some Obs are generated through orders or workflows, which might introduce additional constraints

Alongside these structural considerations, another angle that seems relevant is how archived records interact with the query and service layers. Since the motivation for this project is partly to reduce pressure on high-volume tables like obs, archived records would likely need to be excluded from routine DAO queries while still remaining accessible for historical reconstruction or auditing.

So I’m wondering whether the archival design might ultimately involve two complementary concerns:

Structural eligibility rules – determining when an Obs is safely isolated from active clinical graphs

Query-path separation – ensuring archived records no longer participate in normal clinical queries while remaining retrievable when needed

I’d be interested to hear how others think about balancing these two aspects — defining safe structural boundaries for archival while also ensuring archived records are separated from normal query paths. Any thoughts. cc @ibacher, @dkayiwa