Soundex search in LuceneQueries

Looks like a good plan! Thanks again @fruether for your great contribution! :slight_smile:

@dkayiwa I tuned the lucene query a bit but the result did not have a great effect. There are still some that are ranked higher as expected (see Soundex search in LuceneQueries - #45 by fruether)

Anyways I prepared a PR as we decided the additional results are fine. Thanks for the review and feedback in advance!

N Names Queries While the previous Pull request for TRUNK-5974 is under review I need an additional decision regarding the design of TRUNK-5973

Previous logic The previous logic took the search parameter length into account namely the names that have been searched for. If I search for the names n1, n2, n3, n4 I would expect a match of 4*0,75 = 3 names. This means at least three names I was searching for have to match for patient so that it is part of the result. This number increases slowly with the amount of names that are part of the query.

Proposed logic in Lucene

The easiest way and most simplest to maintain would be to just search for all the names and then let Lucene rank in according to the number of matches. This would be a query would be designed like e.g.

(familyNameSoundex:‘Darius’ OR familyName2Soundex:‘Darius’ OR middleNameSoundex:‘Darius’ OR givenNameSoundex:‘Darius’ OR familyNameSoundex:‘Graham’ OR familyName2Soundex:‘Graham’ OR middleNameSoundex:‘Graham’ OR givenNameSoundex:‘Graham’ OR familyNameSoundex:‘Jazayeri’ OR familyName2Soundex:‘Jazayeri’ OR middleNameSoundex:‘Jazayeri’ OR givenNameSoundex:‘Jazayeri’ OR familyNameSoundex:‘Junior’ OR familyName2Soundex:‘Junior’ OR middleNameSoundex:‘Junior’ OR givenNameSoundex:‘Junior’ )

This would add patients to the result of the name search, e. g. names with only one matching name of n1 to n4, but the query itself would be simple to understand. Another alternative would be a complex matching of names as a combination of “AND” that would be harder to reason about but more limiting in terms of search results . @dkayiwa @ibacher if you could be so kind to validate the logic or if to many results are harmful in this scenario: Trunk 5973 - Review for Talk discussion on Lucene query on N names quersy by fruether · Pull Request #3739 · openmrs/openmrs-core · GitHub

Would this return less results than we currently do?

  1. The proposed logic returns the same plus additional rows and is implemented in the PR PR part

  2. I would have to design the complex matching logic. It would probably be a conjunction of n-1 terms where n = name.lenght * 0,75. So if a 3 name query appears of the 4 name party and 4 fields I would generate conjunctions with 3 terms that are combined by an OR statement e. g

  • (givenName:n1 and familyName:n2 and familyName3:n3) OR
  • (firstName:n1 and familyName:n2 and familyName3:n3) OR
  • (givenName:n4 and familyName:n2 and familyName3:n3) OR
  • (firstName:n1 and familyName:n2 and familyName3:n3) OR … The result of this query should be that only persons that match on three names are returned. I think two nested for loops would have be create to do this. I am not sure what the best approach would be but it would be quite compley compare to the just or combination of the proposed variant

@fruether are you done responding to all comments on the pull request?

@dkayiwa

TRUNK-5974 - Use of lucene score for two name queries by fruether · Pull Request #3738 · openmrs/openmrs-core · GitHub → There are no open comments left as far as I can see

Trunk 5973 - Review for Talk discussion on Lucene query on N names quersy by fruether · Pull Request #3739 · openmrs/openmrs-core · GitHub → There I need comments in terms of business logic before I start to apply the code changes. Because it makes no sense to invest time in making code look good that will never merged because test case results are not acceptable.

@fruether did you see the comment that i added two days ago for pull request number 3738?

Yes @dkayiwa I replied to it. However, I think I am not before Thursday evening (London Time) able to adapt it accordingly.

@dkayiwa I adapted the code according to your review!

@dkayiwa regarding your question in the PR

Can you give an example of a search query and the return name values?

The Search query would be: “Darius Graham Jazayeri Junior” (for gender male and 1997) and reflects the test case

Results: (first three were expected results in test case with previous logic)

  • 1006–>[Jazayeri Darius Graham]
  • 1007–>[Darius Graham Jazayeri]*
  • 1009–>[Darius2 Darius2]
  • 1005–>[Graham Darius]
  • 1004–>[Graham Darius]
  • 1003–>[Darius Graham]
  • 1013–>[Graham2]
  • 1011–>[Graham2 Graham2]
  • 1012–>[Darius2 Darius2]
  • 1002–>[Darius]
  • 1001–>[Darius]
  • 1000–>[Darius]
  • 1008–>[Darius With SomeOtherName]
  • 1010–>[Graham3]

Generated Query:

(familyNameSoundex:‘Darius’ OR familyName2Soundex:‘Darius’ OR middleNameSoundex:‘Darius’ OR givenNameSoundex:‘Darius’ OR familyNameSoundex:‘Graham’ OR familyName2Soundex:‘Graham’ OR middleNameSoundex:‘Graham’ OR givenNameSoundex:‘Graham’ OR familyNameSoundex:‘Jazayeri’ OR familyName2Soundex:‘Jazayeri’ OR middleNameSoundex:‘Jazayeri’ OR givenNameSoundex:‘Jazayeri’ OR familyNameSoundex:‘Junior’ OR familyName2Soundex:‘Junior’ OR middleNameSoundex:‘Junior’ OR givenNameSoundex:‘Junior’ )

My impression The logic is fine. It is not as restrictive as the previous one but a lot easier to read/maintain and udnerstand

I completely agree. The first pull request has been merged. Waiting for you to polish up the other one.

I created a new PR to start after the design decision now fresh. It can be found here: Trunk-5973: Refactor getSimilarPeople to use Lucene for a search on m… by fruether · Pull Request #3770 · openmrs/openmrs-core · GitHub

Read to work on your review.

@ibacher @dkayiwa @bistenes I think the path to follow on the Lucene track would then be to tackle: TRUNK-5669 Could someone give me a heads up what would have to be done next regarding this issue or to follow the refactoring path? Or is it it overall done now in this person search context and I can move to another topic? If so any suggestion for a topic?

Do you mean another topic related to lucene? Or anything else?

Both! Related to Lucene if there is anything that is urgent and relevant now , like e. g. TRUNK-5669 (I do not understand yet what it is about), or another topic if that would be more urgent/relevant.

Reviewed, merged and eventually closed https://issues.openmrs.org/browse/TRUNK-5680

It has been more than a year since you started working on this ticket. So, thank you so much @fruether for the dedication and determination to take it to the finishing line. It is not so common for us to get volunteers like you, who work on complicated tickets! :slight_smile:

@fruether did you get a chance to look at the links in the description of this ticket? If yes, do they help you understand what it is all about?

1 Like

@dkayiwa it was a pleasure working with you and @ibacher on getting this issue done. It was a long time but I think it was totally worth the effort. So I am looking forward to continue the work on other issues.

In term of that new ticket I think I totally understand the need. We would like to configure in a more flexible way what filters should be used. E. g by giving a string as global properties like this one:

“org.apache.lucene.analysis.core.LowerCaseFilterFactory, org.apache.lucene.analysis.standard.ClassicFilterFactory, org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory”

I am just not sure where the gap in the approach of @bistenes lies as he already developed code and rather seemed in failing to make it being overwritten by a module. I am not sure how to work with an open-mrs module. Do you got an idea where I can read about it? And where the global properties should be overwritten in a module context? I assume there exists a standard approach.

@dkigen mentioned to me that accents are no longer normalized in patient search. Is that because of this work? If so, that’s a major regression.

@bistenes there have been multiple merges for 1,2,3 and 4 names in the patient search. The last two merges have been for 2 and 4 name queries.

Does the accents normalization during the search does not work for 2 and 4 name queries? And does it use the getSimilar() function of core? If so it could have an affect (but I would assume 1 & 3 name queries still to work as they were merged month back). Otherwise I would highly doubt an affect

@dkayiwa thinking about this ticket: I am not really sure if I can add any value to it as the approach already described in the talk post (Making patient search Lucene analyzers configurable - #12 by bistenes) looks fine to me and I am not really so much into modules that I think I can help " in overriding it from a modul". Therefore, do you got another idea for an issue in open-mrs core?

@ibacher in terms of the OpenMrs ETL is there any issues that can be worked on within the community as it is not critical and more long term: Making patient search Lucene analyzers configurable - #12 by bistenes

Thanks for your suppoert!