Soundex search in LuceneQueries

I would think the correct thing to do is to either transform the list returned into a LinkedHashSet (which maintains insertion order) or use the streaming API to do List.stream().distinct().collect(Collectors.toList()). That said, if there is a way to guarantee that a Lucene search only returns each “document” (i.e., patient) once, then we can simply return the list returned from Lucene.

A TreeSet is the wrong way to go, as it ends up re-sorting the set in Java code, which I think we want to avoid.

One thought: would it make sense to make the boosts a bit higher, something like:

(givenName:n1^6 OR   givenName:n2^3 OR   givenName:n1) 
OR
(middleName:n1^3 OR   middleName:n2^6OR   middleName:n1) 
OR
(familyName:n1 OR   familyName:n2^3 OR   familyName:n1^6) 
OR
(familyName2:n1 OR   familyName2:n2^3 OR   familyName2:n1^6)

I.e., to somewhat mitigate the effects of more occurrences receiving a higher score by boosting the matches in the “appropriate” field to count more than that (these numbers may need to be adjusted).

I’m sorry I didn’t have time to follow this thread, but there’s a way to sort of achieve that. I’ve implemented skipSame for this specific purpose: https://github.com/openmrs/openmrs-core/blob/master/api/src/main/java/org/openmrs/api/db/hibernate/search/LuceneQuery.java#L283-L285

1 Like
  1. Adapted Boost:

One thought: would it make sense to make the boosts a bit higher, something like:

Doing a boost like that makes the result a lot more as expected. However, it acts more like I would expect it to behave with a boost of 3 and not of 6:

1007 --> 3x3x3x0 --> 9
1003 --> 3x3x0x0 --> 6
1006 --> 1 x 2 x 2 x 0 --> 5
1004 --> 2x3x0 --> 5
1011 --> 2x3 --> 5
1005 --> 2 x 1 --> 3 ---> WHY
1009 --> 2 x 1 x 1 --> 4 --> WHY - Lower due to weak middle_name=Darius I assume
1012 --> 3x2 --> 5
1013 --> 2 X 2 --> 4  
1000 --> 3 --> 3
1008 --> 3
1010 --> 3 --> 3
1001 --> 2  --> 2
1002 --> 1 --> 1
  1. SkipSame Method

I’ve implemented skipSame for this specific purpose

I am using skipSame now.

  1. LinkedHashSet

I would think the correct thing to do is to either transform the list returned into a LinkedHashSet (which maintains insertion order) or use the streaming API to do

I am using a LinkedHashSet. So we the order will be preserved. @ibacher: I am wondering if changing the interface to use List instead of Set would be to much of an issue since a lot of services probably have to be adapted. What are your thoughs?

I don’t love the idea of changing a public interface without very good reason, since we’d invariably break something. The choice of having the method return a Set is maybe a bit unfortunate, but I think we’re stuck with it.

@ibacher I agree to your point.

In context of the test data set I made an analysis to show how the actual result differs compared to the result (and order) that would have been provided by the original statement based on the addition. The result look something like that:

  • 1007 --> 0
  • 1003 --> 0
  • 1006 --> 0
  • 1011 --> -1 --> Graham is set multiple times. 2 + 3 = 5 is the additive score of the original score
  • 1012 --> -3 --> 3+2=5, middle_name=Darius=9 occurs often. So probably counted as a 4 instead of 5
  • 1004 --> +2 --> 3+2 = 5 Two seperate names set, middle_name=Darius appears to often
  • 1009 --> 0
  • 1013 --> 0
  • 1000 --> 0
  • 1008 --> 0
  • 1010 --> 0
  • 1001 --> 0
  • 1002 --> 0

PR of the code can be found here: https://github.com/openmrs/openmrs-core/pull/3241 @ibacher it probably makes sense to use this PR for testing since it contains more than the same changes as the previous PR including the 3 name search functionality.

1 Like

As discussed with @ibacher I am currently working on the simplification of the 2 name query search based on using Lucene scoring. The business logic remains like described in: Soundex search in LuceneQueries

Let me lay out the results:

Test Case Only N2 is matching

The result of this test case now is that we find 8 and not 3 matching cases. The test cases searches for “D Graham” and now finds in addition the following results:

  1. For 1006 the family name is Graham. This is a match but since the familyName match would only give 4 points this one would be filtered out in previous logic
  2. For 1003 the middle name is Graham. No other match so the score would be below 6 and filtered out
  3. For 1007 the middle name is Graham. No other match so the score would be below 6 and filtered out
  4. For 1004 the given Name is Graham. That would score to 3 points which leads to a filtering out
  5. For 1005 the given Name is Graham. That would score to 3 points which leads to a filtering out
  • No element that was previous expected is now missing

Test Case two names are matching anywhere The result of this test case now is that we find 14 and not 11 matching cases. The test cases searches for “Darius Graham” and now finds in addition the following results:

  1. For 1002 the familyName is Darius. That would in previous logic translate to 3 which would be a 6 together with the 3 empty names which is less (<6)
  2. For 1008 the first Name is Darius & familyName is empty which translates to a 5 (which is < 6)
  3. For 1001 the middleName is Darius which becomes a 3 (+1 for empty familyName2) which would be less to 6

The 1002 is at rank 4, 1008 and 1001 are only better than 1010 which has only full score at middle name.

First arguments search The result of this test case now is that we find 11 and not 3 matching cases. The test cases searches for “Darius G” and now finds in addition the following results:

  1. For 1002, 1005 family name is Darius which would be a 3
  2. For 1003, 1007, 1008 Darius is the givenName but only this is matching (4 + empties) which is below 6
  3. For 1001, 1004, 1006 middleName is Darius which is only matching and hence to low in score (4 + empty names < 6)

1002 and 1005 match before the expected values in in the ranking.

Summary: In the new logic only one match would be enough to get into the result bucket. The previous needed two matches. The ordering of the result is as well different in the sense that the most new finds but not all behind the previous.

@ibacher @dkayiwa: I think this business logic is even better since it as well takes one match into account for the result. What do you think? May the boost hast to be adjusted for the ranks.

@ibacher @dkayiwa did you had the chance to validate that the described change in the behaviour of the business logic would be valid and ok? Because if so I would refactor and create a PR. Otherwise I would stop the work in this ticket.

Around how many of the existing tests, would fail with your proposed implementation?

@dkayiwa just the tree test cases described above are failing in openmrs-api aligned with my explanation.

  • [ ERROR ] PersonServiceTest.getSimilarPeople_shouldMatchN1InThreeNamesSearch:516 expected: <3> but was: <11>
  • [ ERROR ] PersonServiceTest.getSimilarPeople_shouldMatchN2InOneLastNameAndEmptyNames:549 expected: <3> but was: <8>
  • [ ERROR ] PersonServiceTest.getSimilarPeople_shouldMatchTwoWordSearchToAnyNamePart:492 expected: <11> but was: <14>

–>Tests run: 4349, Failures: 3, Errors: 0, Skipped: 36

Do you have some sort of draft pull request that i can try out to examine the extra records in order to see if they make sense?

In the above post I documented the discrepancy and the reasons for failing of the tests: Soundex search in LuceneQueries

I was using the code in the following PR to produce the outcome: https://github.com/openmrs/openmrs-core/pull/3663

It is not cleaned up yet but rather it focuses on the research. Feel free to check the output of the test cases:

@fruether In the abstract, I think this sounds alright. I’d have to play around with it in practice to have a real opinion. I think it’s less concerning to be including new matches, but changes in the order of matches are something we need to look at carefully.

Cool. I can understand that. Feel free to use the PR for playing around: https://github.com/openmrs/openmrs-core/pull/3663

Once I got the go I will delete the not-necessary lines of code and make the code more high quality.

I updated the PR so that the not used code is removed. So once we made a decision regarding the usability of this approach we can work on making the PR mergeable.

@ibacher @dkayiwa do you had the time to have a look to the PR and give an indication if that failing tests and adapted list would be ok?

Meanwhile I would work on the > three name search

Any update when you may got some time to review the code so that we can decide if the approach is good or not?

Thanks for the support!

1 Like

@fruether i have closely looked at the extra results returned, and they look correct. So i believe that we can tolerate the less strict search algorithm that returns more rather then less values, but in an order.

1 Like

Thank you very much for validating it @dkayiwa

Then my next steps will be:

  1. I adapt the test cases accordingly
  2. I am trying to tune the weights so that the expected result is matched more closely.

I assume I can deal with that during the next 2-3 weeks and will come back to you!

Looks like a good plan! Thanks again @fruether for your great contribution! :slight_smile:

@dkayiwa I tuned the lucene query a bit but the result did not have a great effect. There are still some that are ranked higher as expected (see Soundex search in LuceneQueries - #45 by fruether)

Anyways I prepared a PR as we decided the additional results are fine. Thanks for the review and feedback in advance!