Initializer to be extended to handle data (not just metadata)

bistenes · November 9, 2018, 3:52pm

@dsurrao pointed me to it. I’m interested in using it to migrate data from a preexisting EMR into OpenMRS. I’d use it to write CSV loaders for Patients, Encounters, etc. Then I’d export my existing data to CSV and massage it a little to meet the loader’s expectations. Does that seem like a reasonable way to use it?

@mogoodrich has also mentioned that it might be worth looking into as a way to manage our metadata.

mksd · November 9, 2018, 5:14pm

Initializer’s original and current goal is to help manage metadata. It was intended to be the backend-equivalent of what Bahmni config is to the front-end.

I don’t know whether using it to load data is a good approach. This would require some thinking and design.

mksd · November 9, 2018, 5:16pm

To clarify: Initializer is not meant to be used with Bahmni only, but its intent is inspired from a pattern that is used within Bahmni in regards to Bahmni’s front-end.

bistenes · November 9, 2018, 10:58pm

Right, yes. But it seems like the work of adapting Initializer to this use case would be simpler than writing a new module for it – Initializer already provides all the machinery needed to load CSVs into the database.

Another possibility might be to fork Initializer and make a module that is intended for migration, with Loaders for Patients, Encounters, Observations, etc.

mksd · November 10, 2018, 9:48am

Option 1 is a possibility, I would definitely start off Initializer (‘Iniz’), add the logic to load data, the unit tests that go with it and I would UAT it to see if it can handle lots of data. Importing patients and obs will mean importing a very large number of entities, and that’s where the challenge lies (Initializer or not).

For a recent implementation in the Middle East we needed to load 6,000 diagnoses. That meant splitting them into 10 CSV files and would take 30 min on a normal dev machine.

Hower the thing with concepts is that Iniz looks up concepts before adding/editing what’s provided on a CSV line. I realise that this becomes slower when many concepts are already saved. The patients and obs use case is definitely different.

Option 2 is a bad idea, at worst you would depend/require iniz in a new module. But as I said you should start off it and see where it leads you.

We can discuss this in a design call if you want.

bistenes · November 12, 2018, 10:06pm

Oh interesting! Is there a way for client code to disable looking up before adding/editing? We have on the order of 500k Encounters to load, I don’t know about Observations.

Yeah, discussing this would be great. Do you mean a one-on-one or during an OpenMRS design call?

mksd · November 13, 2018, 8:28am

Sorry I stalked you a little and saw this:

I’m here working with Compañeros en Salud (PIH Mexico) to replace their old MS Access -based EMR with OpenMRS.

Is it in this context that you will end up exporting data out of a legacy database prior to re-importing it in OpenMRS? Out of curiosity, is it a one-off process or should it be streamlined and re-run with further implementations?

mseaton · November 13, 2018, 2:56pm

@mksd, I’d be interested in exploring this. I’ve heard a number of threads recently about how difficult / memory-intensive / time-consuming people are finding it to import data into OpenMRS. I’ve had similar experiences in the past, but there are some tricks that can help is massive ways. In particular, making sure the Hibernate session is getting flushed/cleared every X records imported. I can try to find some code I’ve written in the past, but it’s basically just a matter of putting some logic in place to call Context.flushSession(); Context.clearSession(); every X rows of your import CSV. Maybe you are already doing this, and there are other reasons for the slowness, but importing 6000 diagnoses should not take 30 minutes or requiring splitting them up into 10 files of 600 diagnoses.

To speak for @bistenes, I believe this is a one-off migration/import, and we are just looking for some existing code that we can build off of so as not to re-invent the wheel (and perhaps provide something that another group could use for a similar need down the road).

Best, Mike

mksd · November 13, 2018, 4:41pm

Thanks @mseaton, this makes a lot of sense, and I believe that Iniz’ code base will fairly easily allow to flush + clear every so often (based on a configuration). We definitely need to try that.

What other improvements can you recall in the context of loading large pieces of data into OpenMRS?

Actually splitting the files didn’t improve the load time at all. I was hoping it might but that’s not the primary reason why we did this. It’s rather because of the checksums footprint. Splitting allows to minimize the re-processing impact of adding a new concept or editing an existing concept. Basically only that one CSV file where the concept is referenced will be processed, all the others will be skipped.

Cc @zouchine @mksrom

mksd · November 13, 2018, 4:53pm

@mseaton in regards to loading encounters and obs, do you think that a line by line CSV approach might work? I find it unlikely but again one has to try it out.

mogoodrich · November 13, 2018, 5:36pm

For what it’s worth, in addres hierarchy, when importing hierarchy entries from a CSV, we persist them in batches… we didn’t do an in-depth analysis, but batches of 10 seemed to work well.

github.com

openmrs/openmrs-module-addresshierarchy/blob/master/api/src/main/java/org/openmrs/module/addresshierarchy/util/AddressHierarchyImportUtil.java#L32


import java.util.Map;

import java.util.Stack;





public class AddressHierarchyImportUtil {

	

	  protected static final Log log = LogFactory.getLog(AddressHierarchyImportUtil.class);

	  

	  // number of entries to save at one time

	  // we want to save in batches to improve performance, but if try to save ALL at once we can run into memory issues

	  protected static final int ENTRY_BATCH_SIZE = 10;

	

	/**

	 * Takes a file of delimited addresses and creates and address hierarchy out of it

	 * Starting level determines what level of the hierarchy to start at when doing the input

	 */

	public static final void importAddressHierarchyFile(InputStream stream, String delimiter, String userGeneratedIdDelimiter, AddressHierarchyLevel startingLevel) {

		

		AddressHierarchyService ahService = Context.getService(AddressHierarchyService.class);

		

		String line;

mksd · November 13, 2018, 6:01pm

Interesting @mogoodrich…

@bistenes as you can see, it’s a matter of trying this out. With obs you’re entering a space where you’ll have to load millions of entries. If we can fine tune a process that works, that will be quite a killer feature.

Perhaps could you spend some time on a test module that just loops on many obs to see if you’re hitting some obvious performance hits? No CSV yet here, just saving many many obs at once in a loop to get a sense of where the limits are encountered. I guess that the hope is that if we do things well, this would just be at worst a linear process.

bistenes · November 13, 2018, 9:04pm

Sure, I could do that. It might be a while though, I’m leaving for some traveling (and OpenMRS conf) this weekend.

I’m interested in how preexisting concepts are being searched for. It looks like Iniz only interacts with the database by saying service.saveConcept(concept). I didn’t look at the code for ConceptService, but for ObsService, it only does an upsert if an ID is provided. If no ID is provided, and I’m reading the code right, it should just insert the new Obs straightaway. Is the behavior different for Concepts, or is your code creating concepts with IDs, or am I missing something else?

mksd · November 15, 2018, 7:24am

@bistenes this happens here:

Concept concept = service.getConceptByUuid(uuid);

if (StringUtils.isEmpty(uuid) && concept == null) {
  Locale currentLocale = Context.getLocale();
  LocalizedHeader lh = getLocalizedHeader(HEADER_FSNAME);
  for (Locale nameLocale : lh.getLocales()) {
    String name = line.get(lh.getI18nHeader(nameLocale));
    if (!StringUtils.isEmpty(name)) {
      Context.setLocale(nameLocale);
      concept = service.getConceptByName(name);
      if (concept != null) {
        break;
      }
    }
  }
  Context.setLocale(currentLocale);
}

It does a fetch attempt by UUID first, then tries by name (in all possible locales).

bistenes · January 11, 2019, 11:12pm

Ok, I just did some benchmarking of ObsService.saveObs. I seem to be getting about 200/s when doing a flush-clean every 25 obs, which seems to be the sweet spot.

That code you pointed to, Dimitri, is in the concept line processor, which I don’t think I’ll be using. So we’re safe

I’m writing a Patient loader now, against a cloned copy of Iniz.

bistenes · January 12, 2019, 12:16am

Has anyone written code that is client to Iniz before? I just realized that ConfigDirUtil.loadCsvFiles obtains a CsvParser from a factory, which doesn’t allow me to inject my new Parser.

Should I add a function to ConfigDirUtil with the following signature?

public static <T extends CsvParser> void loadCsvFiles(String configDirPath, String checksumDirPath, T parserClass)

(other designs welcome… I’m more of a Python guy)

mksd · January 12, 2019, 11:11am

It’s because you asked:

Every time a new domain is added the bootstrap method must be overloaded, you saw above the example for the Concept domain.

No I don’t think so, you mean code written in a new module that would depend on Iniz? While Iniz can certainly be refactored to allow that, I would stress again that the philosophy behind it would rather be to have Iniz be as rich as possible in itself. Why don’t you just make the exercise to add a new Patient domain into Iniz? And PR your work?

I would be more than happy to support Iniz first extension into data through new Patient and Obs domains.

ssmusoke · January 12, 2019, 1:12pm

@bistenes Oh wow!!! Now this is something that we have been battling with over the years, and this provides a great foundation for helping with data migrations and probably merges of data. I am so keeping an eye on this since it has been a blocker for me for over a year.

Another strange and interesting use case would be to use this as an integration point for other systems that can provide formatted csv data - I know everyone will say use the REST API, currently its easier to just get a CSV file and drop it into a directory.

A couple of questions:

How are you managing the mapping of patients by identifiers?
How are you managing the transformation of columns (in CSV) to obs data. Any chance you can share a sample CSV file and how you are approaching this? I would be happy to collaborate and test out

mksd · January 12, 2019, 4:16pm

While the Patient domain should come first (and presumably also Visit and Encounter), there is no particular blocker here to create a new Obs domain in Iniz. The only reason why we haven’t done it yet is because our focus was rather on metadata management. But code wise the effort is reasonable.

I invite either of you @bistenes or @ssmusoke to give it a go and provide a tentative PR for -say- the Patient domain.

What mapping are you referring to here?

I would imagine that there would be a column identifiers where all the patient’s identifiers are provided as a ;-separated list (as usual with Iniz and lists of values), however I guess that the patient identifier type will have to be prefixed. Something like

…	Identifiers	…
…	OpenMRS ID:U78912 ; National ID:000-59019033-01	…

Where the prefixes ‘OpenMRS ID’ and ‘National ID’ are the patient identifier type names.

ssmusoke · January 12, 2019, 6:03pm

Actually rather than identifier name, I would go for identifier type uuid since that does not seem to change across implementations as metadata