GSoC 2026: Archiving Voided Data - Community Feedback

shivvanir · May 13, 2026, 7:52am

Hello everyone,

This is the proposed implementation for my GSoC 2026 project - Archiving voided data. I am posting this here for community feedback, as suggested by my mentors @dkayiwa and @jonathan. I’ll be taking the obs table as an example.

This approach creates an obs_archive table identical to obs (without inbound foreign keys to prevent locking). It moves voided data to obs_archive the moment the voided bit is flipped. So here, our archive table is synonymous to the void column.

Implementation Plan:

One-Time Legacy Migration: Before activating the trigger, a migration script moves all existing rows with voided = 1 from obs to obs_archive and (in the end) hard-deletion will be done from the primary table to clear historical bloat.
Change Foreign Key Constraints:
1. Drop the existing FK on obs.previous_version → obs.obs_id.
2. Add new FK: obs.previous_version → obs_archive.obs_id (active obs’s previous version is pointing to voided/archived). And for the archived obs, the previous version would reference the archive table.
Move-and-Delete Mechanism: Implement via a database AFTER UPDATE trigger or a Hibernate onFlushDirty interceptor. When the voided bit is flipped from 0 to 1, the system executes an atomic transaction:
1. Copy: Insert the row into obs_archive.
2. Hard Delete: Immediately physically delete the row from the primary obs table.
  
  Note: Archiving a parent obs would archive all the child obs too (just like the current implementation of voiding).
Restore Strategy: Restoration is handled by a Spring service that performs a reversal - pulling the record from obs_archive back into obs and resetting the voided bit to 0. Also re-establishing the FK reference to the active obs table (for previous_version)
Future proofing - Purge: A Date-time partitioning could be done on the archive table so the future implementation for dropping the records (say 10 or 15 year old records) - if required, is easy.

Questions for the community:

Should the data only be archived after ‘x’ days or months? As opposed to doing it instantaneously as proposed?

If yes,

We could use the existing JobRunr infrastructure to implement a recurring job that queries for voided = 1 within date range adjusted to the grace period and invoke the Archival service.
We could wire the grace period, job’s schedule interval and batch size to Global Properties for admin configurability.

Any feedback on the implementation is greatly appreciated. Thank you.

dkayiwa · May 13, 2026, 7:13pm

Given the size of the obs table, what do you think about archiving during off peak hours instead of the moment the voided bit is flipped? Something that runs at those periods and archives all voided records of a configured age (e.g older than 5 years, etc)

shivvanir · May 15, 2026, 12:42pm

I was considering this approach too but is there a reason why we shouldn’t be moving the voided records instantaneously? Because this way the obs table would be lean at all times.
If the concern is the unvoiding- the unvoid strategy proposed should work when needed.
Wouldn’t a nightly archiving job require a constant monitoring to screen for failures?
The only con I see with the proposed approach is the Extended transaction lock time every time a user voids but I don’t think that should be a problem as MySQL 8.0 on 2–4 CPU cores, 8–16 GB RAM, standard SSD can handle 3,000 to 7,000 QPS (mixed reads+writes). We could still perform load testing to see if this would affect our deployments.

dkayiwa · May 17, 2026, 9:39pm

How would the unvoiding happen in the case where a user accidentally voids and then wants to immediately unvoid the record?
Have you thought about the implications of this on the transactional system, given that we want to minimise the effects on it, especially during peek hours?

shivvanir · May 23, 2026, 4:55pm

Apologies for the delay.

As far as I can see, for the users the current UI does not support undoing after deletion or update. Only the Admin can undo a deleted encounter. Attached screenshots for reference. So this case of immediate unvoiding could be unlikely.

The proposed implementation would only add an extra query of the void table during unvoiding and a write during the voiding.

Suppose, the current qps is 10/sec in the peak hours and if all of them are voiding and unvoiding at the same time, after our implementation the qps would spike to 20 and the database will be able to handle that. Of course the qps is just an assumption. If we could get the actual highest qps in our implementations, it would be helpful.

If needed, i’ll also share the exact code path of the current execution of voiding and unvoiding and how the proposed implementation would change it.