Using patient matching module

wyclif · July 15, 2020, 5:46am

Hi all,

I’m attempting to use the patient matching module and it doesn’t seem to be reporting any duplicates in the generated reports even when I have matching records in the DB based on my matching strategy.

Could the issue be because of the settings I have in my patient matching configuration xml file? I’m using the default configuration that comes with the module and a very simple strategy that uses DOB and gender as the blocking fields(must match) while family and given name are used to finally decide actual matches (should match).

Those familiar with the module, what do you think I could be missing?

Thanks and regards, Wyclif

wyclif · July 19, 2020, 6:24pm

Anybody read this post?

mozzy · July 20, 2020, 7:31am

sure i did

cc @dkayiwa @mksd @burke

burke · July 22, 2020, 5:17pm

I assume you are talking about openmrs-module-patientmatching.

Based on the repo and comments on the module’s wiki page, it doesn’t look like the module has been used in many years… which is probably a big reason why you aren’t getting many responses.

The theory & under-the-hood approach to matching are derived from @sgrannis’s work, so is undoubtedly sound; however, I don’t know if the module is being actively used by anyone.

burke · July 24, 2020, 2:57am

I went looking for when James Egg left Regenstrief (the developer who worked on the module). Here’s a conversation I found from 2016 of someone asking Shaun about the patient matching module:

When I am looking at the code, the object method and code utilized when the “Run Linkage” button is pressed seems to be fairly well separated from the GUI code. It seems to me that it could be possible to encapsulate that functionality into a separate program which could be invoked as a batch process, given the configuration file, but perhaps with some additional information written into it, or perhaps other files would need to be written to carry information from the GUI program to the batch program. The reason for doing this would be to simplify the development cycle if we were to attempt to use MPI to accelerate the run linkage process. Is there anything you are aware of which you believe would prevent us from being able to do this?

sgrannis:

The GUI is well separated from the underlying matching function calls, so they can be separated as you describe. However, I want to reiterate that it is my belief that the “Run Linkage” process typically does not invoke the sort function because the M/U values must be calculated before invoking “Run Linkage”, thus the “calculate M/U values” functions (which requires sorted data) must be run first. Consequently, I’m not clear why separating the “Run Linkage” functionality will improve performance?

Are there any people who would be good to invite to the meeting tomorrow?

sgrannis:

My relationship to the code: I designed and coded the initial core matching software functions, including scorePair, formPairs, etc. I have directed all software development and developers who have contributed code to the RecMatch code base. My primary developer for this software was a person by the name of James Egg, a Regenstrief employee who left our organization in early 2015.

The sorting function dominates the total time, even after Run Linkage is performed. When these u-value Calculation and the m/u-value Calculation operations are performed, is that data stored in any file?

sgrannis:

If one clicks the “apply” button, those results are copied to the configuration file. The configuration file must be then saved to ensure that the values are persisted in the configuration file. Once the parameters are saved, the m/u values do not need to be recalculated for that data set.

I didn’t see any reference to it in the configuration file, even when I saved the configuration file after performing those calculations. Are they stored in the .sorted files?

sgrannis:

Per above, they are stored in the configuration file.

I am wondering if I calculate the values, save and close the configuration file, then open it back up, do I have to recalculate the u and m/u values before I can Run Linkage?

sgrannis:

Per above, the m/u values are saved in the config file and can be reloaded if one clicks the “apply” button, those results are copied to the configuration file.

I tried Run Linkage after saving, quitting the program, then reopening, and it seems to generate the same files without having to recalculate.

In a related question, does the program then find the .sorted files that were created for a configuration file during the u and u/m calculations when a configuration is read back in?

sgrannis:

Yes. If the config file and data file remains unchanged, the program will re-use the existing sorted files without re-sorting.

I did not see a reference to the filenames within the configuration file, so how does the program finds these sorted files even though their names are not stored.

sgrannis:

The names of the sorted files are derived from the original unsorted file names (which are listed in the config file). A hash created from the config file and data file are added to the sorted file name and used to determine if a new sort file is needed.

That conversation was from 4 years ago. I don’t think you’re going to find any developers still familiar with the module. If you can make use of the code available, then go for it. If you need advice on the theory behind the module beyond what you can infer from the code and the discussion above, I’m sure Shaun would be happy to help.

wyclif · July 28, 2020, 1:54am

Thanks @burke for your response, I was hoping that someone has setup the module in the past with a basic configuration and that they could know what I could be missing. I did setup what I think is a basic configuration that the module should be able to work with and determine 2 record to be duplicates but it can’t.

burke · August 6, 2020, 5:10pm

@wyclif, have you made any progress? If you’ve had success, how did you solve your problems? Did you happen to learn what “transposable” fields are along the way?

wyclif · August 11, 2020, 3:10am

The issue was that I was testing with only 2 records, you need at least 3 records, preferably you need at least one matching and one mismatching record because the matching logic needs to be trained with some records to start working properly.

I noticed the following issues with the module (not the GUI application),

It doesn’t work for 2.x and above, I fixed this though.
It doesn’t go by the matching configuration from the specified config file and always performs an exact match.
The matching algorithm does not ignore voided records.
The code in MatchingReportUtils.InitScratchTable selects only one strategy so combination strategies don’t work.
The module’s UI for creating matching strategies is very limited, all strategies built from it only perform exact matches, it doesn’t provide an option to select a different algorithm or set a different threshold and other configurations that you would specify from the config file.

I plan to create tickets to address items 2-5.

Regards,

Wyclif

burke · September 2, 2020, 12:36pm

I’d like to try to surface some useful information being shared in off-list conversations about the patient matching module…

When merging matched patients, the “losing” patient gets voided (merged) but on subsequent runs it gets re-matched again with the winning patient. Instead, we would expect the module to exclude voided patients when finding duplicates.
The state column in the patientmatching_matchingset table remains as PENDING even after merging patients. We’re not sure this the correct behavior.
“Transposable” fields can be swapped with one another – e.g., given name & family name can be treated essentially as one field. If you define fields as transposable with one another, we believe the algorithm will take that into account.
The patient matching module currently implements a probabilistic algorithm, but does not support deterministic methods.
- A probabilistic matching algorithm is useful for identifying likely matches and possible matches between two relatively large sets of patients (at least hundreds, thousands, or more). This can be useful for discovering the best data for matches, getting the probability of a match for near matches, and to automatically adjust matching based on available data.
- A deterministic matching algorithm is a pre-defined algorithm applying a set of known & reproducible rules for determining whether patients match. This is what most sites implement on their own (e.g., consider patients to match if the patients have the same identifier OR gender + names + date of birth OR …). This can work on any number of patients, but is less likely to provide a probability of match (for non-perfect matches) and does not adjust to the data provided.

mksd · September 2, 2020, 2:47pm

Thanks a lot @burke.

@wyclif to confirm if the first point is still relevant.

wyclif · September 2, 2020, 5:26pm

Thanks @Burke for sharing the conversation here.

I did finally verify that the patient matching actually filters out voided records, so this was a false alarm on my side.

@mksd yes as Burke mentioned, the patient matching module is designed to work for probabilistic matching, we are trying to find out from Shaun and Andrew if it’s possible to tweak it to behave in a more deterministic way based on a config setting.

And the module doesn’t mark merged records in the patientmatching_matchingset table as merged, we might want to fix it but it’s not a blocker to the matching algorithm.

wyclif · October 7, 2020, 5:53am

Hi,

We have created the tickets below to address some issues in patient matching,

PTM-92 Use configured patient matching configuration
PTM-95 Fix transposable fields logic to work as expected
PTM-96 Add support for deterministic patient matching

Anybody with objections? Specifically regarding https://issues.openmrs.org/browse/PTM-96?

Regards,

Wyclif