Handling large database changes in OpenMRS

burke · March 8, 2018, 6:01pm

Continuing the discussion from Proposing a sprint to help people upgrade to Platform 2.x (Mozambique tried to upgrade to Platform 2.x, but was blocked for various reasons):

Proposing a sprint to help people upgrade to Platform 2.x (Mozambique tried to upgrade to Platform 2.x, but was blocked for various reasons)

Schema changes to the obs table can be problematic when you have a very large obs table. We have a obs table with 27 million obs in it, and there were five schema changes to it that each took close to an hour, if not more, to run. As we are using OpenMRS in a hospital with an ED, we couldn’t afford to have 5+ hours of downtime. I’m currently testing this now, but it looks like we are going to be able to use Percona Toolkit to run those five changesets on our system while it’s online, prior to doing the upgrade. Thoughts on a good place for me to document this? (Also, we should really try to keep schema changes to the obs table to a minimum going forward)

This is great information @mogoodrich. I’m sure @jdick and others with large databases would appreciate sharing best practices for handling big database changesets.

I found a Execute Database Updates Module (Design Page) , but that’s from 2011 (I moved it to archived projects).

We should probably update or Database Update Conventions to include the point about changes to obs. In general, any changes to potentially large core files like obs should be vetted for alternative approaches and, when deemed necessary, coordinated in order to minimize the times we touch those tables.

I would expect to find info about this under the Implementation Guide on the wiki and/or in an upgrading section of the OpenMRS Guide.

mogoodrich · March 8, 2018, 6:16pm

I can work with others to pull this into a better place, but for immediate reference I wrote up more about what I did here:

lluismf · March 9, 2018, 8:45am

Very huge tables should be partitioned. Maybe this will make schema changes more tolerable.

owais.hussain · March 9, 2018, 12:08pm

Are there any plans to partition Obs table vertically. I mean, if value_coded, value_numeric, … can be stored in separate tables, that would really ease up the burden on Obs table.

ssmusoke · March 10, 2018, 8:07am

@owais.hussain but that would make querying them via SQL a nightmare

owais.hussain · March 10, 2018, 2:50pm

@ssmusoke writing queries will be a bit complicated, but the end user won’t have to wait for endless hours to pull reports, especially the ones using grouping.

I’ll see if I can cook some numbers to support this. First of all, is the distribution between coded/numeric/text even considerably even to entertain the idea.

ssmusoke · March 10, 2018, 3:45pm

Oh yes joining to the obs table today is already very difficult, breaking it down further adds up to 4 additional joins per loop through the obs table.

owais.hussain · March 10, 2018, 7:25pm

@ssmusoke I performed a primitive test on our largest implementation. First the distribution of obs table (16M rows) by data type (I’m skipping complex and drug, being negligible):

Here’s a query, which gives a count of patients on TB treatment, having a positive GeneXpert test, but don’t have a CAD4TB Chest X-Ray score above 50. No real use, but just a query using both coded and numeric obs.

select e.encounter_id, e.encounter_datetime, e.location_id as treatment_site, count(*) as total from encounter as e 
inner join encounter as gxp on gxp.patient_id = e.patient_id and gxp.encounter_type = 172 
inner join obs as gxp_result on gxp_result.encounter_id = gxp.encounter_id and gxp_result.concept_id = 162202 
where e.encounter_type = 29 
and gxp_result.value_coded = 1301 
and not exists (
	select * from encounter as e2 
	inner join obs as o on o.encounter_id = e2.encounter_id
	where e2.patient_id = e.patient_id and e2.encounter_type = 186 and o.value_numeric > 50) 
group by e.encounter_id, e.encounter_datetime, e.location_id;

create table obs_datetime like obs;
insert into obs_datetime select * from obs where value_datetime is not null;

create table obs_numeric like obs;
insert into obs_numeric select * from obs where value_numeric is not null;

create table obs_text like obs;
insert into obs_text select * from obs where value_text is not null;

create table obs_coded like obs;
insert into obs_coded select * from obs where value_coded is not null;

reset query cache;

select e.encounter_id, e.encounter_datetime, e.location_id as treatment_site, count(*) as total from encounter as e 
inner join encounter as gxp on gxp.patient_id = e.patient_id and gxp.encounter_type = 172 
inner join obs_coded as gxp_result on gxp_result.encounter_id = gxp.encounter_id and gxp_result.concept_id = 162202 
where e.encounter_type = 29 
and gxp_result.value_coded = 1301 
and not exists (
	select * from encounter as e2 
	inner join obs_numeric as o on o.encounter_id = e2.encounter_id
	where e2.patient_id = e.patient_id and e2.encounter_type = 186 and o.value_numeric > 50) 
group by e.encounter_id, e.encounter_datetime, e.location_id;

First query took 1.781 sec vs the other one after partitioning) which took 1.046 (not including fetch time).

The tables were created as copies, so all indices and keys were copied as is.

Is it possible to collect more data on this from several implementations, especially using DBMS other than MySQL?