Migrating PIH dictionary to OCL

@mogoodrich / @mseaton / @ball / @akanter / @paynejd,

Let’s use this Talk topic for any general discussion around getting PIH into OCL.

FYI for the community…

We’ve been working on getting the Partners In Health (PIH) concept dictionary working within Open Concept Lab (OCL) with a goal of getting PIH to a place where they can manage their main (“gold”) dictionary within OCL, expecting the process of getting PIH to this point will pave the way for other implementations to use OCL for dictionary management.

Currently, the process for migrating a dictionary from OpenMRS into OCL is:

  1. Create SQL dump of concept tables
  2. Run the ocl_omrs conversion script to convert SQL dump into OCL json for import (first pass can take some iterations to discover & clean up local dictionary issues)
  3. Use OCL’s bulk import tool to import into a source with OpenMRS custom validation schema.
  4. Creating a collection containing all content (concepts & mappings) in the source.

This work is being tracked in a few places (now, with this post, one more):

Location Link Purpose
GitHub ocl_issues #45 For tracking technical work done.
Wiki Migrating to OCL: PIH Use Case “Journal” of high level steps/challenges
Talk (you are here) Discussion around getting PIH using OCL
Slack #ocl General OCL-related chat

This process of getting PIH migrated into OCL has been ongoing for several months and has already knocked out several bugs and taught us a lot about what is needed to support implementations with their dictionary management.

Some general issues (not just PIH-specific) have been collected on the Using OCL with existing OpenMRS dictionaries wiki page.

I’ve created this post to serve as a location for us to use for asynchronous discussion around migrating PIH’s dictionary to OCL.

2 Likes

Thanks @burke (both for starting this thread and working on this overall!)

One topic I wanted to check on here is how Concept Sources should be handled and matched across OCL and OpenMRS.

In our existing system, we have concept sources that exist in our database that look like this:

When I import from OCL into an empty database, it sets up concept sources that look like this:

And if I were to import the same OCL collection into the existing database shown in the first image, I would end up with the superset of all of these, matching on “Name”.

So I’d have both a “SNOMED CT” source in my dictionary and a “SNOMED-CT” source (and all of the mappings I thought I was updating on SNOMED CT were in fact added to a new, duplicate source I didn’t know was added). And information that I might expect to be present (eg. HL7 Code and a useful description) isn’t added from OCL, though I’m not sure how important this is.

Some questions:

  • Can OCL handle concept sources with spaces in the names? Where are the dashes getting added - is this something happening in the import tool @burke ?

  • Assuming we can’t have spaces, how should we go about matching a source that comes from OCL to a source that exists in an OpenMRS instance? Name seems to be an insufficient and unreliable identifier, but maybe some sort of normalization where space, underscore, and dash are all interchangeable in terms of matching on name (SNOMED CT = SNOMED_CT = SNOMED-CT)?

  • Should the OCL subscription module continue to do what it does now, and create any sources that it doesn’t find during the import process? Or would it be better for the subscription module to fail and report on the concept sources that it was unable to find, so that these can be manually added prior to import and there are no silent surprises like I am currently finding?

Although I can see solutions that would involve updating the names of our existing Concept Sources to align with what is in OCL, this is potentially non-trivial as mappings are often used to represent concepts on forms and elsewhere in the “SOURCE:CODE” form. So we would need to ensure that nothing in our implementations was depending on “SNOMED CT:XYZ” rather than “SNOMED-CT:XYZ”. So although we may try to update our sources to match OCL, I don’t think we should force this on everyone.

Interested in thoughts - @burke and all.

2 Likes

So, the dashes are getting added based on a lookup table here. The limitation here is that OCL doesn’t currently support spaces in identifiers. (IIRC there was an issue to get OCL to support spaces in at least concept identifiers to be able to match FHIR’s code type).

However, this is somewhat different: while the OCL identifier field can’t handle spaces, the OCL name field (for sources, concepts, etc.) certainly can. And in the openconceptlab module, we actually use the source’s name rather than its identifier to try and find the source or create it if necessary. That means that fixing this should be as simple as: updating the name for SNOMED-CT to SNOMED CT in OCL (and for any other concept sources) and then creating a new release of the PIH dictionary (a new release is necessary because release exports are generated only once, the first time they are requested).

I think there are two workflows to bear in mind here:

  1. In the first case, the concept dictionary is empty and is populated by the subscription module. In this case, having the subscription module raise an error because a concept source doesn’t exist seems less helpful.
  2. In the second case, we’re layering concepts on top of an existing dictionary. In this case it might or might not make sense to raise an error. For example, from your screen shots, I see you have mappings for SNOMED CT, AMPATH, CIEL, etc., but not, e.g., the mappings for ICD-10. What would be the best way to handle that case?

I see from this that we still have SNOMED-NP which is probably something we should stop copying as my understanding is that that was something @akanter added to handle more complex SNOMED mappings. I don’t know whether this should be fixed on the subscription module side or the OCL side.

Thanks @ibacher that’s very helpful. A few comments:

Looking in OCL, it looks like the names of sources are populated correctly by the import tool when they create them. For example, I see this in OCL staging for PIH:

But you are right that SNOMED-CT has “SNOMED-CT” for both name and identifier in the organization where this is defined: https://app.staging.openconceptlab.org/#/orgs/IHTSDO

I’m not sure if the right answer here is to update the name at this point, as I don’t know if that will have the same impact on others who are now relying on the name being “SNOMED-CT”, but I’m interested in thoughts around this. In general, I’m not sure we can assume that everyone will name their sources the same across the OpenMRS ecosystem that they intend to use to represent the same entities in OCL.

I only shared a snippet of my screen, with a subset of my sources in order to illustrate the problem. So let’s not worry about this.

  • Mike

Also @ibacher , this does expose a bug in the OCL export. The export zip that I download from OCL to represent the entire PIH collection above has this in it:

"to_source_url": "/orgs/PIH/sources/PIH-Malawi/", 
"to_source_name": "PIH-Malawi"

As you can see from the screenshot image in my previous post, this “to_source_name” should be “PIH Malawi” (no dash), as this is the actual name of the source in OCL rather than the identifier of the source.

(Editing my last post, as Discourse won’t allow me to post again separately).

I did a little more additional testing the above issue and a bit beyond today, and put together several tickets to reflect what I encountered. To summarize:

Issue 1: The issue described above. Improve creation of concept sources during import:

  • Ticket in OCL to include full concept source information in export json
  • Ticket in openconceptlab module to use the newly added concept source information to create more complete sources in OpenMRS

Issue 2: OCL imports no longer import allow_decimal information for numeric concepts

Issue 3: OCL bulk import and openconceptlab import need to support Concept Complex

There is one more issue around concept “version” that I did not ticket, but in the interest of not cluttering this more than I already have, I’ll create a separate talk post for that.

Interested in any and all feedback on the above issues. @burke / @ibacher FYI

Yeah! That’s disappointing…

Thanks @mseaton! This is extremely useful feedback. I think all of the issues you’ve highlighted are things that would be good to address.

Thanks @mseaton for summarizing the issues very cleanly… including creating the needed tickets!

Sources

As for concept sources, while I agree with being important to these to be sent between OCL & OpenMRS (not only adding them to OCL’s export, but also adding to them in the import instead of our current manual alignment step), the critical information for sources are:

Critical metadata for sources
Canonical URL A canonical reference to the source. While this could be accomplished with a UUID, OID, or URN, I favor a canonical URL, since it combines namespacing (that we don’t have to manage), familiarity, and human friendliness – e.g., http://loinc.org or http://pih.org.
Source name A short reference. While we can’t enforce the format & uniqueness universally, we could encourage use of standard names for dictionaries like CIEL, LOINC, and SNOMED-CT. Since these could be part of OCL addresses and within mapping references (in the form PIH:123), then its probably better to stick to characters that don’t have to be encoded in URLs and avoid spaces.

It would be nice if OCL helped resolve these (e.g., look up sources by name or canonical URL) and could help normalize them.

Allow Decimal

Thanks for creating the ticket. It would likely help to continue to support “precise” for backwards compatibility (maybe logging a deprecation warning) as we work to use “allow_decimal” across the board, since “precise” will probably be showing up for a while.

Concept Complex

Again, thanks for the tickets. I claimed #1114 to add handler to extras.

Discourse will let you have multiple drafts across different topics (I just learned this… you can find drafts under your user profile), but prevents multiple drafts on the same topic (probably a good thing). You should be able to reply as many times as you like (via Reply button) on the same topic.

Thanks @burke for those responses, sounds good.

When I tried to post, Discourse told me that I wasn’t allowed, as I had made the previous 3 posts in a row on the topic, and I needed to wait to post a 4th time until after someone else had a chance to weigh in :slight_smile: So I just went back and edited my previous post.

LOL. It’s almost like they’re trying to promote… discourse. :wink:

@burke (and @mogoodrich and @ball) :

The testing I went through yesterday essentially involved:

a) Setting up a DB with (what I understand to be) the starter PIH concepts that were provided to @burke to import into OCL. I used a file that was provided to me named “pih-concepts-db-20210908.sql.gz”.

b) Setting up an empty DB an then using the OCL subscription module to populate this with an export from OCL staging containing (what I understand to be) the collection of concepts imported into OCL from (a).

The issues I posted above are from my analysis of diffs between just the concept table of these 2 instances. Analysis of reference terms, names, etc. will follow.

Beyond the issues I described, I found that there are 17 concepts that exist in database (a) and don’t exist in database (b). These are:

  • 160593AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,Patient’s family history list
  • 123501AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,Urinary Cast, Hyaline
  • 160159AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,HIV resulting in infectious or parasitic disease
  • 3ccc9340-26fe-102b-80cb-0017a47871b2,Pregnancy
  • 3f551a17-eefd-4806-89e1-25cddbb0b75e,Torsion of Ovary
  • 70057AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,Abacavir sulfate
  • 1d549146-e477-4dcc-9716-11fe4d1cad68,Triage green
  • 34d06871-15cc-4423-bf79-15168789104a,Sodium
  • 3cd28732-26fe-102b-80cb-0017a47871b2,Negative
  • 70763694-61c5-447f-abc3-91f144bfcc0b,Yellow
  • 762ecf40-3065-47aa-93c3-15372d98d393,Triage red
  • 80f4496d-6116-4a11-b6c4-f692c19e15b1,Protein
  • b9014bc8-5f14-4e71-87aa-a63a6a26b72e,Case status
  • 3ccf43b0-26fe-102b-80cb-0017a47871b2,Urinalysis
  • 1298AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,Very low density lipoprotein (mmol/L)
  • 163594AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA,Glucose tolerance test
  • 3cd68fa8-26fe-102b-80cb-0017a47871b2,Low-density Lipoprotein Cholesterol (mmol/L)

Now, it’s possible that I am using the wrong starter database SQL. And it’s possible that there were issues importing this into OCL that I’m not aware of. And it’s possible that I made some mistakes in my analysis. Would you please:

  • Confirm that the initial database that I’m comparing against is the right one?
  • Check to see if anything might have failed to correctly import with the import tools that would explain this?

Looking in OCL staging, I see a “Negative” concept for the “PIH” owner/source but not for PIH-temp (though I may not be searching OCL right). In the OpenMRS dictionary manager, I don’t see a “Negative” concept in PIH-temp. This indicates to me that maybe there was an issue importing some of the concepts into OCL.

@burke - would you mind having a look?

1 Like

@mseaton That’s the last file I created from the PIH EMR concept dictionary. I don’t find the email/attachment/Slack I must haver sent to Burke with the 8Sept sqldump file. It was created by me on my local SDK instance and I have the original.

It sounds like you’re using the right database dump. The md5 of the pih-concepts-db-20210908.sql (not gzipped) I got from Ellen is e0ce217886546a8d31fe5d3618deb554.

I noticed a slight discrepancy in the counts when a loaded the data from Ellen into OCL, but hadn’t had the time to dig into the cause. I didn’t worry about it, since my priority was getting a dictionary into OCL against which Mark could start exploring the subscription module. This was PIH-temp after all, and I figured we could sort these differences out later. :slight_smile:

Looking at the first one, a clue comes from searching the PIH dictionary for concept ID #43. I get two results: Pneumonia and Patient’s family history list.

The same problem appears for concept #28… it’s both HIV resulting in infectious or parasitic disease and Hepatitis C virus infection.

For Ellen’s data dumps, I use the a SAME-AS mapping to a PIH code (whole number) to signify the official “gold” ID for a concept within OCL. If there isn’t a gold mapping, then I just use the concept ID. It’s possible that these are cases where there was one concept without a gold mapping and another with a gold mapping both pointing to the same concept ID.

I thought that might be the case for all of these; however, I could find all of these in OCL and it looked like there was only one instance of each in the PIH dictionary:

But then I thought I’d check to see if these concepts were in the PIH-temp source, but not in the collection on OCL. I couldn’t find LDL as #1008 in the collection. But it is there as #12992. What? I thought this might be a problem with the collection creation, but then I noticed that LDL exists both as #1008 (uuid 3cd68fa8-26fe-102b-80cb-0017a47871b2) and #12992 (uuid ec10a67f-913f-4a62-a0ed-43fb335ff5af) within the PIH data I got from Ellen.

That’s as far as I’ve gotten.

Thanks @burke .

This matches what I have, so we can rule out a different starter database as the issue.

Makes sense. I think later is now now.

You can never fallback to using the concept id. You should just fail the import, and tell us to fix our mappings. Or tell us to give you a DB that has correct concept ids in it. Using concept ids from what we’ve given you will never be correct, as the DB that @ball is giving you is not an extract from our gold dictionary DB, but an extract from a throwaway SDK DB that has concepts imported into it from MDS packages generated on the gold dictionary. I’d much rather the import fail than have incorrect concept mappings.

Yes, both of those concepts exist, but they also exist in CIEL. See: #166045 and #1008. Regardless, even if we may feel like both of these 2 concepts shouldn’t exist, they both do exist, and both should be in the source and in the collection.

From my analysis above, it seems most of the issues are that the concepts are imported but are not correctly added to the Collection for some reason. There are likely other issues as well, but I think this covers the majority.

It would be really nice to be able to start from scratch with a proper “PIH” source and see how things go. How straightforward would it be for me to set up a local instance of OCL, with all of the existing sources (CIEL, etc) in order to test the import tool myself? How would I get the starter data I would need for that? Is that something straightforward to do using Docker?

Thanks, Mike

Okay. This will force me to make the import behave differently for PIH vs. CIEL, since CIEL is a “gold” database (i.e., concept IDs are the actual concept codes).

Ah. I see. I think this is a red herring. They are different concepts for different units: one for mg/dL and the other for mmol/L.

Agreed. I’ll try making another collection of the PIH-temp source and do a comparison to see if I can figure out a pattern of what’s missing.

While you can fire up oclapi2 with docker compose up (the default user is ocladmin with authentication token 891b4b17feab99f3ff7e5b5d04ccc5da7aa96da6), it’s empty and I’m not sure how easily you could use it locally. I tried and was able to add a PIH org, but got an error trying to create a PIH source (see payload below), and then trying to upload the PIH source and gave up.

We’re very close to having #661 resolved, so we’ll be able to delete the PIH source and use that instead of PIH-temp.

PIH source payload
{
    "id": "PIH",
    "custom_validation_schema": "OpenMRS",
    "default_locale": "en",
    "description": "Partners In Health Dictionary",
    "full_name": "Partners In Health",
    "name": "PIH",
    "owner": "PIH",
    "owner_type": "Organization",
    "owner_url": "/orgs/PIH/",
    "public_access": "None",
    "short_code": "PIH",
    "source_type": "Interface Terminology",
    "supported_locales": [
        "en",
        "am",
        "bn",
        "es",
        "fr",
        "ht",
        "id",
        "it",
        "nl",
        "pt",
        "ru",
        "rw",
        "sw",
        "ti",
        "ur",
        "vi"
    ],
    "type": "Source"
}

@burke I think the next step should be for us to generate a new concept export for you to use and to try the import again with a source that has a complete set of PIH numeric mappings. I’ll work with @ball on this.

The PIH concept dictionary snapshot from Sept 2021 had numeric mappings on all concepts. Or should have had them… Is there a concept where that is missing?

@mseaton ?

Yes, by my analysis there are 58 concepts that are missing numeric “PIH” mappings in that provided database snapshot. I’ve detailed this out in another message internally @ball . We can hash it out there.