I would think the correct thing to do is to either transform the list returned into a LinkedHashSet (which maintains insertion order) or use the streaming API to do List.stream().distinct().collect(Collectors.toList()). That said, if there is a way to guarantee that a Lucene search only returns each “document” (i.e., patient) once, then we can simply return the list returned from Lucene.
A TreeSet is the wrong way to go, as it ends up re-sorting the set in Java code, which I think we want to avoid.
One thought: would it make sense to make the boosts a bit higher, something like:
(givenName:n1^6 OR givenName:n2^3 OR givenName:n1)
OR
(middleName:n1^3 OR middleName:n2^6OR middleName:n1)
OR
(familyName:n1 OR familyName:n2^3 OR familyName:n1^6)
OR
(familyName2:n1 OR familyName2:n2^3 OR familyName2:n1^6)
I.e., to somewhat mitigate the effects of more occurrences receiving a higher score by boosting the matches in the “appropriate” field to count more than that (these numbers may need to be adjusted).
One thought: would it make sense to make the boosts a bit higher, something like:
Doing a boost like that makes the result a lot more as expected. However, it acts more like I would expect it to behave with a boost of 3 and not of 6:
1007 --> 3x3x3x0 --> 9
1003 --> 3x3x0x0 --> 6
1006 --> 1 x 2 x 2 x 0 --> 5
1004 --> 2x3x0 --> 5
1011 --> 2x3 --> 5
1005 --> 2 x 1 --> 3 ---> WHY
1009 --> 2 x 1 x 1 --> 4 --> WHY - Lower due to weak middle_name=Darius I assume
1012 --> 3x2 --> 5
1013 --> 2 X 2 --> 4
1000 --> 3 --> 3
1008 --> 3
1010 --> 3 --> 3
1001 --> 2 --> 2
1002 --> 1 --> 1
SkipSame Method
I’ve implemented skipSame for this specific purpose
I am using skipSame now.
LinkedHashSet
I would think the correct thing to do is to either transform the list returned into a LinkedHashSet (which maintains insertion order) or use the streaming API to do
I am using a LinkedHashSet. So we the order will be preserved.
@ibacher: I am wondering if changing the interface to use List instead of Set would be to much of an issue since a lot of services probably have to be adapted. What are your thoughs?
I don’t love the idea of changing a public interface without very good reason, since we’d invariably break something. The choice of having the method return a Set is maybe a bit unfortunate, but I think we’re stuck with it.
In context of the test data set I made an analysis to show how the actual result differs compared to the result (and order) that would have been provided by the original statement based on the addition. The result look something like that:
1007 → 0
1003 → 0
1006 → 0
1011 → -1 → Graham is set multiple times. 2 + 3 = 5 is the additive score of the original score
1012 → -3 → 3+2=5, middle_name=Darius=9 occurs often. So probably counted as a 4 instead of 5
1004 → +2 → 3+2 = 5 Two seperate names set, middle_name=Darius appears to often
1009 → 0
1013 → 0
1000 → 0
1008 → 0
1010 → 0
1001 → 0
1002 → 0
PR of the code can be found here: https://github.com/openmrs/openmrs-core/pull/3241@ibacher it probably makes sense to use this PR for testing since it contains more than the same changes as the previous PR including the 3 name search functionality.
As discussed with @ibacher I am currently working on the simplification of the 2 name query search based on using Lucene scoring. The business logic remains like described in: Soundex search in LuceneQueries
Let me lay out the results:
Test Case Only N2 is matching
The result of this test case now is that we find 8 and not 3 matching cases. The test cases searches for “D Graham” and now finds in addition the following results:
For 1006 the family name is Graham. This is a match but since the familyName match would only give 4 points this one would be filtered out in previous logic
For 1003 the middle name is Graham. No other match so the score would be below 6 and filtered out
For 1007 the middle name is Graham. No other match so the score would be below 6 and filtered out
For 1004 the given Name is Graham. That would score to 3 points which leads to a filtering out
For 1005 the given Name is Graham. That would score to 3 points which leads to a filtering out
No element that was previous expected is now missing
Test Case two names are matching anywhere
The result of this test case now is that we find 14 and not 11 matching cases. The test cases searches for “Darius Graham” and now finds in addition the following results:
For 1002 the familyName is Darius. That would in previous logic translate to 3 which would be a 6 together with the 3 empty names which is less (<6)
For 1008 the first Name is Darius & familyName is empty which translates to a 5 (which is < 6)
For 1001 the middleName is Darius which becomes a 3 (+1 for empty familyName2) which would be less to 6
The 1002 is at rank 4, 1008 and 1001 are only better than 1010 which has only full score at middle name.
First arguments search
The result of this test case now is that we find 11 and not 3 matching cases. The test cases searches for “Darius G” and now finds in addition the following results:
For 1002, 1005 family name is Darius which would be a 3
For 1003, 1007, 1008 Darius is the givenName but only this is matching (4 + empties) which is below 6
For 1001, 1004, 1006 middleName is Darius which is only matching and hence to low in score (4 + empty names < 6)
1002 and 1005 match before the expected values in in the ranking.
Summary: In the new logic only one match would be enough to get into the result bucket. The previous needed two matches. The ordering of the result is as well different in the sense that the most new finds but not all behind the previous.
@ibacher@dkayiwa: I think this business logic is even better since it as well takes one match into account for the result. What do you think? May the boost hast to be adjusted for the ranks.
@ibacher@dkayiwa did you had the chance to validate that the described change in the behaviour of the business logic would be valid and ok?
Because if so I would refactor and create a PR. Otherwise I would stop the work in this ticket.
@fruether In the abstract, I think this sounds alright. I’d have to play around with it in practice to have a real opinion. I think it’s less concerning to be including new matches, but changes in the order of matches are something we need to look at carefully.
I updated the PR so that the not used code is removed. So once we made a decision regarding the usability of this approach we can work on making the PR mergeable.
@fruether i have closely looked at the extra results returned, and they look correct. So i believe that we can tolerate the less strict search algorithm that returns more rather then less values, but in an order.