Building conventions for Concept IDs in OCL

burke · October 6, 2021, 1:55pm

Based in part on this conversation, we need a strategy & consensus on how we identify concepts within OCL and as we pull them into OpenMRS. Up until the PIH dictionary import, we’ve been using the OpenMRS Concept ID when importing into OCL and storing concept UUIDs as an “external identifier” within OCL.

I tried to list some of the key questions/problems we’re facing and potential solutions below. I would love to hear from others on opinions about these issues or other issues that I’ve overlooked.

For the sake of discussion, here are some definitions:

Term	Description
`Concept ID`	Internal identifiers of concepts in OpenMRS. Have also been used as `Code`, but that gets tricky when using multiple servers, since these internal identifiers can vary across multiple servers.
`UUID`	True Version 4 (random) UUIDs allow creation of universally unique identifiers without a single authority. They are difficult for humans to use (directly editing form or report definitions with UUIDs is painful).
`Code`	The “official” identifier of a concept within a terminology or dictionary (sometimes called a “Gold” Concept ID) – i.e., the code you would use if mapping to the concept like LOINC’s `14682-9`, CIEL’s `790`, or PIH’s `790`. In OpenMRS, we have been using `Concept ID` as a `Code`; however, this becomes harder when you grow beyond a single server. Some implementations (e.g., PIH) have used SAME-AS mappings to declare a `Code` for their concepts.

The requirements

Each concept in a dictionary within OCL needs a Code so it can be mapped or referenced (from a form, report, module, or from other dictionaries).
Implementers need to be able to clone or import content from other dictionaries into their own.
Implementers need to be able to use CIEL concepts within their local dictionary.

How should we declare the `Code` for concepts? Burke has long advocated for adding `Code` directly to concepts (e.g., `concept.code` in the database) as a unique `Code` to be separated from `Concept ID` (the database’s internal id).

Existing workarounds for managing an official Code for concepts when implementations grow beyond a single server:

Use a “Gold” OpenMRS server to manage the official dictionary, where the Concept ID in this server is consider the official Code.
Create a SAME-AS mapping to your own dictionary – e.g., a “Gold” mapping – to declare the Code for a concept by effectively mapping it to itself.
Maintain a list of official concepts separately (e.g., in a spreadsheet) and use custom scripting or the Initializer module to update concepts.

Proposal: Add concept.code to our data model and refactor any code checking mappings to treat concept.code as an implicit “gold” mapping (equivalent to having SAME-AS mapping to implementation’s dictionary). In the meantime, use gold mappings (i.e., only allow one SAME-AS mapping to the implementation’s dictionary and treat this as the “official” Code for the concept).

If you import concepts from different sources into your dictionary, what should happen if two or more concepts being imported have the same code, since codes in a dictionary must be unique.

One could argue that concept IDs don’t matter and we can just use UUIDs reference concepts; however, UUIDs are not human-friendly. There’s a reason standard terminologies use human-friendly codes instead of UUIDs to identify their terms. When we first introduced UUIDs in OpenMRS, implementations tried using them everywhere and found mappings (using source + Code) were far more preferable in situations where humans needed to work with them.

Proposal: if you are importing concepts with a conflicting Code, you must provide a new, unique Code for your dictionary before the concept can be imported.

What should the UUID be when a concept is cloned? For example, if PIH is using a CIEL concept with a few non-breaking changes, can they still refer to it via the same UUID or should it get a new UUID?

Technically speaking, all concepts should have universally unique UUIDs. In practice, when implementations use a CIEL concept, they have copied the concept into their system and may continue to refer to it by CIEL’s UUID even if they make some non-breaking adjustments to the concept locally.

Proposal: All concepts should have a unique UUID. If an implementation is using a CIEL concept, they can use its UUID; however, if they are going to make any change to the concept, then that altered concept (even if only non-breaking changes) should have its own UUID. Any references that want to refer to either the CIEL concept or a locally modified version of that concept should use a mapping (not UUID).

Do codes need to be numeric? Concept IDs are integers, but OCL doesn’t require a code to be a number and many terminologies (e.g., LOINC, ICD, etc.) use codes that aren’t just whole numbers.

In general, Code does not need to be numeric. Both OCL IDs and OpenMRS mapping codes allow for non-numeric values. The only constraint is in cases where Concept ID is being used as the Code (which is the default practice for implementations with a single server).

Proposal: Add concept.code to our data model and refactor any code checking mappings to treat concept.code as an implicit “gold” mapping (equivalent to having SAME-AS mapping to implementation’s dictionary). In the meantime, use gold mappings (i.e., only allow one SAME-AS mapping to the implementation’s dictionary and treat this as the “official” Code for the concept).

ibacher · October 6, 2021, 3:06pm

So, I’m happy with most of this, but I would question why we need to add a new field to the concept table to handle this. Could we not simply handle this using a specific concept mapping type (e.g. CANONICAL-REFERENCE or something)?

I’m happy with treating the SAME-AS mapping as a way of referencing the concept, but restricting each concept to a single SAME-AS mapping seems like an unnecessary breaking change, e.g., for purposes of interfacing with an external system, I might want to say that not only is my concept “CIEL:119481” for the purposes of this server, but it also is “SNOMED:73211009” and “ICD-10:E14.9”.

I’m not quite persuaded by this. I think my hang-up is that we end up with metadata both asserting that the concept is and isn’t the same as the concept it was cloned from and I’m just unclear on what the relevant distinction is.

More practically, my concern here is that while some form technologies (e.g., HTMLFormEntry) allow us to refer to concepts in multiple possible ways, others like AMPATH forms and Bahmni Forms 2 seem to refer to concept by UUID only. That means that if we can end up in a situation where we have “the same concept”, but two different UUIDs, we end up in a situation where forms aren’t as easily sharable as they could be.

darius · October 6, 2021, 5:42pm

Perhaps we should go further and say that concept.code must actually have a namespace (or authority) and a code.

This allows a server that’s not the authoritative concept server for the implementation to directly import a CIEL concept, without needing to resolve an authoritative implementation code.

Is there some reason that concept.code must be in the implementation’s namespace?

(This maps somewhat to Google API guidance for “sometimes it is necessary for services to refer to resources in an arbitrary API [vs used in contexts where the owning API is clear]”: AIP-122: Resource names)

burke · October 6, 2021, 6:37pm

An attribute on the concept table would enforce uniqueness and only one per concept. Using a mapping works, but would require additional work to enforce uniqueness and avoid multiples for this special case anywhere mappings can be created.

To be clear, I didn’t mean to suggest there could only be one SAME-AS mapping; rather, there can only be one gold mapping (i.e., a SAME-AS mapping to the dictionary itself representing the Code for the concept).

The UUID of a resource should always be the universally unique ID to that specific resource. While it might be possible to get a decorated copy of a resource (e.g., the same resource but with changes layered on it), if you need to refer to that thing as a resource (i.e., the “adapted” concept), then it needs its own UUID.

Interesting point. At Regenstrief, they started in the 1970s with a single dictionary (everything assumed to be in a single namespace) and ended up doing exactly what you suggest by making the concept table essentially id (internal database id), source (authority), and code where source = 1 was used for the local concepts. The original dictionary then became just one reference table (the “local dictionary” with all the concepts for source = 1) and other terminologies (LOINC, SNOMED, etc.) could be loaded into their own tables and similarly referenced from the new concept table. As you suggest, this allowed observations (pointing to the new concept table) to reference concepts from non-local dictionaries without having to create a local concept.

The primary reason would be to avoid forcing all terminologies into the same model/schema. If we want to support other authorities, I’d favor Regenstrief’s approach, where the metadata for all terminologies isn’t forced into one table. But the primary impetus for introducing concept.code isn’t to solve this problem; rather, to stop conflating our Code with the database’s internal ID.

mseaton · October 7, 2021, 12:42pm

If I am understanding what you mean, you are saying that if the PIH dictionary we have a “PIH” source that we use for our “gold mappings”, we would only be permitted to have a single SAME-AS mapping to this source for a given Concept? Right now we often have 2 SAME-AS mappings to the PIH source on a given Concept - one that is populated with the concept id, and the other with a human friendly name (eg. “PIH:832” and “PIH:WEIGHT LOSS”). You’re saying this would not be allowed, and we’d need to stick with just “PIH:832”? We could probably make this adjustment on our end with some refactoring, but it would take some effort. Can we be clear on the reasoning and need for this?

I like this idea if we had done this from the start (and I’d love to see this exist on all of our metadata as a human-manageable alternative to how we currently use uuid and something that could serve as a nicer “key” for message code lookups), but I also tend to agree with @ibacher that this doesn’t really gain us much beyond what we already have with same-as reference terms. Especially if we are talking about adding a namespace/authority as @darius suggests - isn’t this the equivalent of a SAME-AS term in a given Source? My gut feeling is to build on concept reference terms the short/medium term, and look at adding code longer term if this solution is insufficient, as there is a lot of dependent code and tooling out there right now that is already (successfully) using mappings for this purpose.

Once we establish something on the Concept as it’s identifier, then I would expect that we would treat it that way like we do today with “uuid” and “same-as” mappings to known sources - i.e. importing this concept would treat it as an update the existing concept in your dictionary that matches this code rather than a new concept. Right now in tools like Initializer and MetadataSharing, we typically look at these to determine how to match existing Concepts in a dictionary with incoming Concepts in an import.

I don’t really see how this is practical to be honest. One quite often doesn’t know at the time of importing a concept from CIEL/OCL that they will need to make changes to it. Let’s say someone imports a Concept, and uses it happily for some time, recording Obs and such. Then later on they decide they need to make a subtle adjustment to the concept name, or add additional names or mappings, or add answers or set members. We wouldn’t expect them to create a new concept for this. Am I misunderstanding the idea here?

Very interested in further discussions on this.

ibacher · October 7, 2021, 1:10pm

I think what Burke is thinking of is specifically

Which is a feature we’re looking at adding to the dictionary manager, i.e., to clone a concept from one source to another, rather than something applied at the OpenMRS level specifically. That said, it does have implications for how we manage the relationships between UUIDs and customised concepts.

darius · October 8, 2021, 6:09am

In the long run concept.code (or unique_code or uid) is a good thing. I.e. an opaque string, implementation-managed, that uniquely defines a piece of metadata in whatever its implied scope is. (And having an explicit namespace/authority in the data model isn’t actually required.)

I agree with @mseaton that the “gold mapping” already solves the exact problem, and tooling supports it, and will continue to.

If that approach is too difficult/tedious for most implementations, then maybe improve the dictionary management tools to automate it?

burke · October 8, 2021, 3:34pm

You’re right and I wouldn’t want to prevent the use of human-friendly SAME-AS mappings. For the PIH dictionary import into OCL, I detected gold mappings as SAME-AS mappings with a numeric code; however, it feels like a hack. So, to clarify, we need a way to uniquely & reliably identify the Code for each concept.

I’m suggesting we try to find a way to avoid re-using the same UUID to represent two different things. As long as an implementation is using a CIEL concept, the UUID can remain the same; however, I think we would all agree that breaking changes at any point should either require making a new concept or – at a minimum – changing the UUID. I was being provocative with the case of non-breaking changes, suggesting that, if we want to be able to refer to a resource independently from the original, then it deserves a distinct UUID.

I’m approaching this from the context of implementations managing their content in OCL and exporting/subscribing within OpenMRS. If you clone a CIEL concept, should we assign a new UUID to the clone as it’s created? Change the UUID only when changes are made to it? Or change the UUID only if breaking changes are made to it? If someone fetches PIH’s version of a CIEL concept, should they get the same UUID as if they went to CIEL directly? This might warrant a separate discussion thread.

Brainstorming here… within OCL the Code is easy: it’s the concept ID. The trick is how do we represent this within OpenMRS both for import & export. My initial thought was the Dictionary Manager could create and enforce that every concept has a SAME-AS mapping to its OCL ID (either hiding it from users or help users understand why its there and uneditable). But, given there can be multiple SAME-AS mappings to itself (i.e., human-friendly mapping codes as @mseaton described), how do we identify the Code to use for the OCL Concept ID when importing?

Approach	Comments
Introduce a new `CODE` map type	Tooling within OpenMRS would need to be refactored to treat this as a SAME-AS mapping.
Treat any numeric SAME-AS mapping to same source as a `Code`	This could work immediately (it’s what I’m doing for the PIH import); however, I worry that we’ll run into cases where we want to support `Code`s that aren’t simply whole numbers (sooner than we expect), which would break this approach.
Other approaches?

As for evolving toward a concept table like concept_id (internal) + source + code, the most pragmatic & backwards-compatible approach might be to introduce source & code into our existing concept table (as @darius suggested) and slowly start pulling out parts of the model specific to the local dictionary into new table(s).

mseaton · October 8, 2021, 4:23pm

It definitely seems sensible to say, if a Concept is cloned into a new Concept, that it gets a new, distinct UUID. If this Concept needs to indicate it’s relationship to the original Concept, it can do so via mappings (typically by having a SAME-AS CIEL mapping). So a Concept with the same UUID should be considered the exact same Concept, and one with a different UUID but a Mapping should be considered a reference to the Concept. In the past, we’ve largely tried to preserve CIEL UUIDs on our Concepts in our gold PIH dictionary (at much hassle and difficulty at times) largely because of a concern that if we did not maintain the UUID that various tools and processes may not work as easily as intended.

This makes sense to me, and honestly if we were to have this I could see us simply updating all of our numeric PIH mappings to instead use the “CODE” rather than the “SAME-AS” map type. I honestly don’t think most of the code the looks up Concepts by mapping even check the map type, but those would be easy enough to change to incorporate a new map type regardless.

I’d think it might also make sense to add a new boolean property to ConceptMapType to indicate whether validation should prevent one to add multiple mappings between a given source and concept for that map type. But if we’d rather hard-code against a specific, known, system created map type that uniquely requires this, that’s likely fine too.

darius · October 8, 2021, 4:39pm

Logically, every OpenMRS instance’s concept dictionary has some authoritative source managing it. A one-server implementation is its own authority. PIH has a central concept server which is the authority. Once the tooling is there, the authority could be OCL. But this is all implicit, and AFAIK no OpenMRS code is actually built around this idea.

Suggestion:

Introduce a system setting in OpenMRS which indicates which is the authoritative concept source. If unset, then the implied behavior is “local dictionary”. For PIH this is the “PIH” source.
Add first-class support for this in OpenMRS UI tooling, and eventually in OpenMRS core.
- there should be a single API method, and a simple UI gesture, for getting a concept given a code in the authoritative source, or setting that code on a concept.
- mappings against this source could have extra validation (e.g. they only support SAME-AS mappings, and must be unique)
Eventually you could introduce concept.code and it takes over this behavior.

burke · October 9, 2021, 12:57am

I was thinking we could introduce concept.code and then make code that checks for mappings to treat this like a SAME-AS mapping, but agree it would likely work as-is if we switched SAME-AS mappings to CODE mappings.

It would be nice if we didn’t have to introduce these to OCL (just use OCL’s concept ID as it’s designed). When importing into OCL, we’d use the CODE mapping as the concept ID and when importing from OCL into OpenMRS, we wouldn’t worry about OpenMRS concept ID assignment and would create a CODE mapping from OCL’s concept ID. Wouldn’t the same work for iniz CSV files?

Within Regenstrief’s system, the source = 1 was the local dictionary. We could avoid assuming a magic value for the local source ID, but I wouldn’t want to introduce an additional join for every concept-related query. A global property containing the concept_source_id of the local source could be read into memory and used like a constant within local concept queries. By default we could bump SNOMED-CT further down the list and ship with concept source 1 as “Local dictionary” with the HL7 code L for local dictionary. Implementations could edit this concept source to match their local authority or change the global property.

darius · October 12, 2021, 4:48pm

I think you interpreted something stronger than I intended to suggest. This is orthogonal to having a dictionary with multiple naming authorities.

What I meant to suggest is:

Having a single “code” is good, but there’s a lot of concept tooling and process out there already, so change management is hard.
Therefore: introduce APIs for “code” first, before having support for it in the data model, and push tooling and processes to use these new APIs.
Implement via the strategy pattern:
- Initially, implement the new API using the gold mapping approach.
- Then, introduce a code field in the data model, and migrate the implementation to use this.
- Users who care a lot can control the timing of switching between implementations.

burke · November 17, 2021, 3:29pm

So, while introducing a new CODE map type might be a more robust way to identify gold mappings (allowing gold mappings to any code, not just numeric codes), the reality is gold mappings are always for OpenMRS concept IDs which are, by definition, always whole numbers. So, treating any self-referencing SAME-AS mapping as a gold mapping can work for now and avoids the work of introducing a new map type for gold mappings.

The way forward:

Use self-referencing SAME-AS mapping to a numeric code for gold mappings – i.e., within the PIH dictionary, a concept with a mapping SAME-AS PIH:123 means it is the official concept 123 for PIH. We adapted the import script to either use concept ID (e.g., for CIEL imports) or to use these gold mappings and, when using gold mappings, each concept will be required to have exactly one gold mapping with a unique code.
Migrate SNOMED CT concept source to a different internal id within our default data and reserve concept_source_id = 1 for a new “LOCAL” concept source.
Introduce source (FK to concept source) and code (varchar 255) to the concept table with a unique constraint on source + code.
Over time, we migrate away from using concept ID alone to reference concepts and toward using source + code. Where its needed and for backwards compatibility, the internal concept ID can be used, but eventually all concept references would expect source + code and only assume concepts from the local source are OpenMRS concepts.