Handling large database changes in OpenMRS

Continuing the discussion from Proposing a sprint to help people upgrade to Platform 2.x (Mozambique tried to upgrade to Platform 2.x, but was blocked for various reasons):

This is great information @mogoodrich. I’m sure @jdick and others with large databases would appreciate sharing best practices for handling big database changesets.

I found a Execute Database Updates Module (Design Page) , but that’s from 2011 (I moved it to archived projects).

We should probably update or Database Update Conventions to include the point about changes to obs. In general, any changes to potentially large core files like obs should be vetted for alternative approaches and, when deemed necessary, coordinated in order to minimize the times we touch those tables.

I would expect to find info about this under the Implementation Guide on the wiki and/or in an upgrading section of the OpenMRS Guide.

1 Like

I can work with others to pull this into a better place, but for immediate reference I wrote up more about what I did here:

Very huge tables should be partitioned. Maybe this will make schema changes more tolerable.

Are there any plans to partition Obs table vertically. I mean, if value_coded, value_numeric, … can be stored in separate tables, that would really ease up the burden on Obs table.

@owais.hussain but that would make querying them via SQL a nightmare

@ssmusoke writing queries will be a bit complicated, but the end user won’t have to wait for endless hours to pull reports, especially the ones using grouping.

I’ll see if I can cook some numbers to support this. First of all, is the distribution between coded/numeric/text even considerably even to entertain the idea.

Oh yes joining to the obs table today is already very difficult, breaking it down further adds up to 4 additional joins per loop through the obs table.

@ssmusoke I performed a primitive test on our largest implementation. First the distribution of obs table (16M rows) by data type (I’m skipping complex and drug, being negligible):

image

Here’s a query, which gives a count of patients on TB treatment, having a positive GeneXpert test, but don’t have a CAD4TB Chest X-Ray score above 50. No real use, but just a query using both coded and numeric obs.

select e.encounter_id, e.encounter_datetime, e.location_id as treatment_site, count(*) as total from encounter as e 
inner join encounter as gxp on gxp.patient_id = e.patient_id and gxp.encounter_type = 172 
inner join obs as gxp_result on gxp_result.encounter_id = gxp.encounter_id and gxp_result.concept_id = 162202 
where e.encounter_type = 29 
and gxp_result.value_coded = 1301 
and not exists (
	select * from encounter as e2 
	inner join obs as o on o.encounter_id = e2.encounter_id
	where e2.patient_id = e.patient_id and e2.encounter_type = 186 and o.value_numeric > 50) 
group by e.encounter_id, e.encounter_datetime, e.location_id;

create table obs_datetime like obs;
insert into obs_datetime select * from obs where value_datetime is not null;

create table obs_numeric like obs;
insert into obs_numeric select * from obs where value_numeric is not null;

create table obs_text like obs;
insert into obs_text select * from obs where value_text is not null;

create table obs_coded like obs;
insert into obs_coded select * from obs where value_coded is not null;

reset query cache;

select e.encounter_id, e.encounter_datetime, e.location_id as treatment_site, count(*) as total from encounter as e 
inner join encounter as gxp on gxp.patient_id = e.patient_id and gxp.encounter_type = 172 
inner join obs_coded as gxp_result on gxp_result.encounter_id = gxp.encounter_id and gxp_result.concept_id = 162202 
where e.encounter_type = 29 
and gxp_result.value_coded = 1301 
and not exists (
	select * from encounter as e2 
	inner join obs_numeric as o on o.encounter_id = e2.encounter_id
	where e2.patient_id = e.patient_id and e2.encounter_type = 186 and o.value_numeric > 50) 
group by e.encounter_id, e.encounter_datetime, e.location_id;

First query took 1.781 sec vs the other one after partitioning) which took 1.046 (not including fetch time).

The tables were created as copies, so all indices and keys were copied as is.

Is it possible to collect more data on this from several implementations, especially using DBMS other than MySQL?