Archiving Voided Data - Project Discussion & Feedback Request

phantomx · February 20, 2026, 9:33am

Hi Everyone, I’m exploring the “Archiving Voided Data” possible project idea for GSoC this year and would love to get feedback from mentors and the community on the proposed approach.

About the Project :

OpenMRS uses soft-delete (voiding) to mark data as invalid while preserving the ability to restore it. Over time, especially for immutable entities like Observations, voided data accumulates significantly often exceeding active data in production tables. This project aims to create a mechanism to archive old voided data to separate tables, improving database performance while maintaining the ability to restore archived data when needed.

Proposed Solution :

1.Archive Table Layer We will create separate archive tables that match the main tables, like obs_archive, encounter_archive, order_archive, patient_archive, and so on. When data meets the archive condition, for example voided for more than X days, it will be moved from the main table like obs to its archive table obs_archive.

This keeps the production tables clean and fast, while still keeping all voided data safe for compliance or future recovery.

The archive tables will have the same structure as the main tables, plus a few extra fields like: archived_date, archived_by, archived_reson etc

2.Scheduled Job Automation We will move the message based on some fixed period , and using scheduler. The job will :

a.Run on a configurable schedule like weekly on Sunday or may on month end b.We will call the our service to execute archiving c.This will process data in batches like 100 patients at a time to d.Can also enable log progress and statistics reports e.Or can also send notifications/alerts on completion or failure

Why Scheduled approach ? Archiving shouldn’t happen immediately when data is voided, there needs to be a retention period where users can unvoid mistakes. The scheduled approach allows configurable retention policies and ensures archiving happens during scheduled configs.

We can follow two phase approach here: a. One-time migration: During upgrade, archive all existing old voided data b. Ongoing archiving (scheduled task): Continuously archive data as it ages past retention threshold

3. Service Layer for Archiving Operations Now comes the service layer that will handle the core business logic for archiving and restoring data, will get triggered by scheduled jobs.

a. The Archiving Process : The service first finds all records that have been voided longer than the retention period, for example records with voided = 1 and date voided older than configured threshold. Instead of processing one record at a time, it archives data in logical groups, such as a patient with all related data or an encounter with its Obs, Orders, and other linked records. This helps keep relationships intact and improves performance.The data is processed in small batches to avoid long running transactions. Database JOINs are used to fetch all related records in a single query rather than thousands of separate calls.

Each batch is archived in one transaction. Records are copied to their matching archive tables, audit logs are written with checksums for integrity, and only then the data is removed from the main tables. If anything fails, the transaction rolls back so no data is lost or duplicated.

b. Restoration Process: The service first looks up the archived record in the archive table and performs validation checks: it verifies that any referenced entities (like the parent encounter) still exist or are also in the archive, checks for potential ID conflicts in case a new record was created with the same ID, and validates data integrity by comparing the stored checksum from the audit log with a freshly calculated one. If validation passes, the service copies the data back into the main obs table, sets voided=0 to mark it as active, clears all void metadata (voidedBy, dateVoided, voidReason), and creates an audit log entry recording who restored the data and when. This entire restoration happens within a transaction to ensure consistency—if any step fails, nothing is changed

4.Archive Audit Trail :

We will create an archive_event_log table to record all archiving operations as an immutable audit trail. This is separate from Hibernate Envers because they serve different purposes: Envers tracks entity lifecycle changes (when an Obs was created, updated, or voided), while the archive audit log tracks data movement operations (when data was archived, restored, or purged from the system). This separation is essential for compliances of healthcare regulation I guess.

Each record in the archive event log captures the complete context of an archiving operation: what entity was affected (entity_type and entity_id), what operation occurred (ARCHIVE, RESTORE, or PURGE), when it happened (timestamp), who performed it (user_id), and critically, a cryptographic checksum (SHA-256 hash) of the data at the time of archiving. Additionally, metadata stored as JSON provides supplementary information like the batch identifier, retention policy applied, and operation reason. When data is archived, we calculate the checksum of its complete state and store it in the audit log later, if we need to verify integrity or restore the data, we recalculate the checksum and compare,if they don’t match, we know the data was corrupted or tampered with, providing legal proof of data integrity.

The audit log is designed as append-only and immutable: records can only be inserted, never updated or deleted. This immutability is crucial for maintaining a tamper-proof audit trail that can withstand regulatory scrutiny and legal discovery.

Excited for any feedback, suggestions or any ideas that could enhance this raw idea for the project.

cc @dkayiwa

shyam.dev · March 21, 2026, 5:01am

Hi @phantomx

Thanks for sharing such a detailed and well-structured approach — it really helped me understand the problem space better.

I had a question regarding the archiving strategy: since some entities in OpenMRS (like Obs, Encounters, and Orders) are highly interrelated, do you think handling archiving at an aggregate level (e.g., per patient or encounter) would be more efficient than table-by-table archiving in terms of maintaining referential integrity?

Also, would there be any consideration for using partitioning or database-level strategies alongside this approach to further improve performance?

I’m also exploring this idea for GSoC and would love to hear your thoughts.

Thanks!

cc @dkayiwa

phantomx · March 21, 2026, 5:48pm

Yes, you are right we should consider the aggregate level only means archive a patient along with all their linked resources like encounters, obs, orders etc together and I guess the same thing I also mentioned in my approach too.

And regrading partitions alongside this, will be a good idea if data inside gonna be huge then we should consider this.