Soundex search in LuceneQueries

dkayiwa · April 5, 2021, 6:43pm

Would this return less results than we currently do?

fruether · April 5, 2021, 8:03pm

The proposed logic returns the same plus additional rows and is implemented in the PR PR part
I would have to design the complex matching logic. It would probably be a conjunction of n-1 terms where n = name.lenght * 0,75. So if a 3 name query appears of the 4 name party and 4 fields I would generate conjunctions with 3 terms that are combined by an OR statement e. g

(givenName:n1 and familyName:n2 and familyName3:n3) OR
(firstName:n1 and familyName:n2 and familyName3:n3) OR
(givenName:n4 and familyName:n2 and familyName3:n3) OR
(firstName:n1 and familyName:n2 and familyName3:n3) OR … The result of this query should be that only persons that match on three names are returned. I think two nested for loops would have be create to do this. I am not sure what the best approach would be but it would be quite compley compare to the just or combination of the proposed variant

dkayiwa · April 18, 2021, 6:15pm

@fruether are you done responding to all comments on the pull request?

fruether · April 18, 2021, 6:27pm

@dkayiwa

TRUNK-5974 - Use of lucene score for two name queries by fruether · Pull Request #3738 · openmrs/openmrs-core · GitHub → There are no open comments left as far as I can see

Trunk 5973 - Review for Talk discussion on Lucene query on N names quersy by fruether · Pull Request #3739 · openmrs/openmrs-core · GitHub → There I need comments in terms of business logic before I start to apply the code changes. Because it makes no sense to invest time in making code look good that will never merged because test case results are not acceptable.

dkayiwa · April 20, 2021, 9:49pm

@fruether did you see the comment that i added two days ago for pull request number 3738?

fruether · April 20, 2021, 9:58pm

Yes @dkayiwa I replied to it. However, I think I am not before Thursday evening (London Time) able to adapt it accordingly.

fruether · April 21, 2021, 9:34pm

@dkayiwa I adapted the code according to your review!

fruether · April 27, 2021, 9:46pm

@dkayiwa regarding your question in the PR

Can you give an example of a search query and the return name values?

The Search query would be: “Darius Graham Jazayeri Junior” (for gender male and 1997) and reflects the test case

Results: (first three were expected results in test case with previous logic)

1006–>[Jazayeri Darius Graham]
1007–>[Darius Graham Jazayeri]*
1009–>[Darius2 Darius2]
1005–>[Graham Darius]
1004–>[Graham Darius]
1003–>[Darius Graham]
1013–>[Graham2]
1011–>[Graham2 Graham2]
1012–>[Darius2 Darius2]
1002–>[Darius]
1001–>[Darius]
1000–>[Darius]
1008–>[Darius With SomeOtherName]
1010–>[Graham3]

Generated Query:

(familyNameSoundex:‘Darius’ OR familyName2Soundex:‘Darius’ OR middleNameSoundex:‘Darius’ OR givenNameSoundex:‘Darius’ OR familyNameSoundex:‘Graham’ OR familyName2Soundex:‘Graham’ OR middleNameSoundex:‘Graham’ OR givenNameSoundex:‘Graham’ OR familyNameSoundex:‘Jazayeri’ OR familyName2Soundex:‘Jazayeri’ OR middleNameSoundex:‘Jazayeri’ OR givenNameSoundex:‘Jazayeri’ OR familyNameSoundex:‘Junior’ OR familyName2Soundex:‘Junior’ OR middleNameSoundex:‘Junior’ OR givenNameSoundex:‘Junior’ )

My impression The logic is fine. It is not as restrictive as the previous one but a lot easier to read/maintain and udnerstand

dkayiwa · May 4, 2021, 8:49am

I completely agree. The first pull request has been merged. Waiting for you to polish up the other one.

fruether · May 8, 2021, 9:01pm

I created a new PR to start after the design decision now fresh. It can be found here: Trunk-5973: Refactor getSimilarPeople to use Lucene for a search on m… by fruether · Pull Request #3770 · openmrs/openmrs-core · GitHub

Read to work on your review.

@ibacher @dkayiwa @bistenes I think the path to follow on the Lucene track would then be to tackle: TRUNK-5669 Could someone give me a heads up what would have to be done next regarding this issue or to follow the refactoring path? Or is it it overall done now in this person search context and I can move to another topic? If so any suggestion for a topic?

dkayiwa · May 9, 2021, 9:24am

Do you mean another topic related to lucene? Or anything else?

fruether · May 9, 2021, 11:13am

Both! Related to Lucene if there is anything that is urgent and relevant now , like e. g. TRUNK-5669 (I do not understand yet what it is about), or another topic if that would be more urgent/relevant.

dkayiwa · May 12, 2021, 12:22pm

Reviewed, merged and eventually closed https://issues.openmrs.org/browse/TRUNK-5680

It has been more than a year since you started working on this ticket. So, thank you so much @fruether for the dedication and determination to take it to the finishing line. It is not so common for us to get volunteers like you, who work on complicated tickets!

@fruether did you get a chance to look at the links in the description of this ticket? If yes, do they help you understand what it is all about?

fruether · May 13, 2021, 2:37pm

@dkayiwa it was a pleasure working with you and @ibacher on getting this issue done. It was a long time but I think it was totally worth the effort. So I am looking forward to continue the work on other issues.

In term of that new ticket I think I totally understand the need. We would like to configure in a more flexible way what filters should be used. E. g by giving a string as global properties like this one:

“org.apache.lucene.analysis.core.LowerCaseFilterFactory, org.apache.lucene.analysis.standard.ClassicFilterFactory, org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilterFactory”

I am just not sure where the gap in the approach of @bistenes lies as he already developed code and rather seemed in failing to make it being overwritten by a module. I am not sure how to work with an open-mrs module. Do you got an idea where I can read about it? And where the global properties should be overwritten in a module context? I assume there exists a standard approach.

bistenes · May 13, 2021, 9:38pm

@dennis mentioned to me that accents are no longer normalized in patient search. Is that because of this work? If so, that’s a major regression.

fruether · May 13, 2021, 10:47pm

@bistenes there have been multiple merges for 1,2,3 and 4 names in the patient search. The last two merges have been for 2 and 4 name queries.

Does the accents normalization during the search does not work for 2 and 4 name queries? And does it use the getSimilar() function of core? If so it could have an affect (but I would assume 1 & 3 name queries still to work as they were merged month back). Otherwise I would highly doubt an affect

fruether · June 3, 2021, 7:58pm

@dkayiwa thinking about this ticket: I am not really sure if I can add any value to it as the approach already described in the talk post (Making patient search Lucene analyzers configurable - #12 by bistenes) looks fine to me and I am not really so much into modules that I think I can help " in overriding it from a modul". Therefore, do you got another idea for an issue in open-mrs core?

@ibacher in terms of the OpenMrs ETL is there any issues that can be worked on within the community as it is not critical and more long term: Making patient search Lucene analyzers configurable - #12 by bistenes

Thanks for your suppoert!