There has been ongoing discussion regarding the best way to implement archiving logic for voided data within the OpenMRS ecosystem. While implementing this directly in OpenMRS Core is the preferred path for long-term maintainability and performance, it raises a significant architectural question regarding implementation adoption.
I would like to open a discussion on how we should handle the version dependencies this creates:
The “Upgrade Barrier”: If archiving logic is built into a new Core release (e.g., 2.7.0+), implementations running on older, stable versions (such as 2.3.x or 2.5.x) would be required to perform a full Core upgrade to utilize the feature. For many large-scale implementations, this is a major undertaking that may delay adoption of archiving.
Strategic Support: Should our goal be to keep this feature strictly “forward-only” in the latest Core, or is there a community preference for a backporting strategy to older maintenance branches?
Core Logic vs. Module Accessibility: Is there interest in an architectural approach where the core archiving engine is developed as a library that can be bundled into a module for legacy versions, while still being natively integrated into the latest Core?
I’d love to hear the thoughts from the community on the best path forward for this specific feature.
I think given the scope constraints, especially for something like GSoC, it makes more sense to start with a Core-only implementation targeting the latest version of OpenMRS Core.
Starting with Core gives us a chance to implement this cleanly and take advantage of deeper integrations where needed, without having to prematurely design abstractions that we’re not even sure about yet.
That said, I don’t think we should completely rule out backward compatibility. We could structure the implementation in a way that keeps the archiving logic reasonably modular and behind clear service interfaces, so that if there’s real demand from implementations on older versions of OpenMRS Platform, we can extract that logic later into a reusable library and build a module around it.
Of course, that module might not achieve the same level of integration as Core, but it would still provide a practical path for adoption without forcing upgrades immediately.
This way we avoid over-engineering upfront, but still leave room to support older versions if and when it actually becomes necessary.
I think according to me, the Core-only approach is the right way for the initial implementation. The “upgrade barrier” concern is real but can be significantly reduced by making the feature completely opt-in – deployments that upgrade Core but are not ready for archival remain entirely unaffected with zero behavioral change. The feature only activates when explicitly configured, so the upgrade itself carries no risk.
Placing the logic in Core is also important for correctness – archiving voided obs requires direct, trusted access to the service and DAO layer to safely handle version chains and transaction boundaries. A module-based approach could work but would add complexity to guarantee the same level of data safety.
Starting Core-only and keeping the service interfaces clean is the right way according to me. If backporting becomes a real demand from the community later, the clean service boundary makes extraction feasible – but it should be driven by actual need, not upfront assumption.
I agree that starting with a Core-only implementation makes sense, especially to avoid over-engineering and to ensure tighter integration with the service and DAO layers.
One aspect I’ve been thinking about is how to handle referential integrity during archiving, particularly for related entities like Encounter → Obs. If archiving is done in batches, there’s a risk of partial movement where parent and child records get separated.
Would it make sense to design the archiving process to be dependency-aware (for example, archiving related entities together), rather than handling each table independently?
Also curious how failures within a batch should be handled — should we aim for fully transactional batches, or allow partial success with retry mechanisms?
Thanks @varshithreddy for this concern. To start with, to keep things simple, we are only archiving the obs table cause its huge and can have a great number of voided obs.
Obs Groups (obs_group_id): This is the “Parent-Child” link. A single “Parent” Obs (like a Lab Set) groups multiple “Child” Obs (like Hemoglobin, WBC).
Version Chains (previous_version): OpenMRS tracks edits by voiding the old version and pointing the new one to it. This creates a historical “breadcrumb trail” of data.
Reference Ranges (reference_range): This stores the clinical context (like “normal” thresholds) at the time the observation was made. We migrate this string directly to the archive to ensure the data remains medically meaningful even years later.
Encounter/Patient Links (encounter_id, person_id): These are the external anchors. The Obs “belongs” to a Patient and was recorded during a specific Encounter.
Order Links (order_id): Links the observation to the specific medical order that triggered it.
We can’t just pick rows at random. If we archive a “Parent” but leave the “Children” in the active table, the children will point to a non-existent ID, and the database will lose its mind (Foreign Key violations). Therefore, we have to use a “Bottom-Up” (Leaf-to-Root) approach:
Isolate the Leaves: We first identify voided Obs that have no children and are not the previous_version of an active Obs.
Move the Leaves: We migrate these “leaf” records (including their reference ranges and values) to the archive first.
Climb the Tree: Once the children are gone, their parents become “leaves.” We then archive the parents.
Preserve the Anchors: We never archive the Encounter or Patient records. The archived Obs will still hold the original encounter_id and person_id, so the data remains clinically valid even in the archive.
Restoration is the exact opposite. If a site needs to bring data back:
Restore the Parent first: This creates the “landing spot” in the active table.
Restore the Children: Now, when the children come back, their obs_group_id finds its parent already waiting for it.
Integrity Check: Before moving anything, we’ll run a quick check to make sure the Patient, Encounter, or Concept hasn’t been purged from the active system. If the “anchor” is gone, we block the restoration to prevent corrupted records.
Regarding the batch failure strategy, that’s a really important distinction to make.
For the initial move, I’m leaning toward fully transactional batches (e.g., 100 rows at a time). Since we’ll be using bulk INSERT INTO ... SELECT and DELETE operations for speed, the database naturally treats that entire command as one unit. If one row fails—say due to a deadlock or a rare foreign key violation—the whole batch rolls back. This is actually the safest default because it prevents “half-moved” data where a record is created in the archive but fails to delete from the active table.
However, to avoid letting one “toxic” row block the entire process, I’m planning a Retry & Isolate mechanism:
The Fast Path: We attempt the move in large batches e.g (100+ rows) for maximum efficiency.
The Fallback: If a batch fails, the system doesn’t just stop. It automatically retries that specific set of rows in smaller “micro-batches” (even down to 1 row at a time).
Isolation: This allows the 99 “healthy” rows to be archived successfully in the second pass, while the 1 problematic row is isolated, left in the active table, and logged for the admin to review.
This way, we get the efficiency of bulk processing without the fragility of a “one-fail-stops-all” system. It keeps the archival moving through millions of records while gracefully handling the edge cases.
Does that balance of speed and safety make sense to you, or do you think we should look into a different retry pattern?
Thanks for the detailed explanation, this really helps clarify the approach.
Looking at the Obs structure, the leaf to root strategy makes a lot of sense. Since Obs is not just a simple hierarchy but also maintains version chains through previousVersion, along with parent child relationships via obsGroup and groupMembers, handling leaves first helps avoid breaking both the grouping structure and the version lineage.
The transactional batch approach combined with retry and isolate also feels like a solid balance between safety and performance. It ensures we do not end up with partially moved data while still allowing progress when a few problematic rows exist.
One thing I was thinking about is safe retries during the fallback phase. In cases where a batch is retried in smaller units, would it be useful to consider idempotency so that already archived rows are not processed again?
Also, for long running archival jobs, would it make sense to track progress, such as a checkpoint or last processed obs id, so the process can resume efficiently without scanning large portions of the table again?
Overall, the design looks robust, especially in how it preserves integrity across both hierarchical and versioned relationships while still handling scale.