Soundex search in LuceneQueries

fruether · December 6, 2020, 3:49pm

As discussed with @ibacher I am currently working on the simplification of the 2 name query search based on using Lucene scoring. The business logic remains like described in: Soundex search in LuceneQueries

Let me lay out the results:

Test Case Only N2 is matching

The result of this test case now is that we find 8 and not 3 matching cases. The test cases searches for “D Graham” and now finds in addition the following results:

For 1006 the family name is Graham. This is a match but since the familyName match would only give 4 points this one would be filtered out in previous logic
For 1003 the middle name is Graham. No other match so the score would be below 6 and filtered out
For 1007 the middle name is Graham. No other match so the score would be below 6 and filtered out
For 1004 the given Name is Graham. That would score to 3 points which leads to a filtering out
For 1005 the given Name is Graham. That would score to 3 points which leads to a filtering out

No element that was previous expected is now missing

Test Case two names are matching anywhere The result of this test case now is that we find 14 and not 11 matching cases. The test cases searches for “Darius Graham” and now finds in addition the following results:

For 1002 the familyName is Darius. That would in previous logic translate to 3 which would be a 6 together with the 3 empty names which is less (<6)
For 1008 the first Name is Darius & familyName is empty which translates to a 5 (which is < 6)
For 1001 the middleName is Darius which becomes a 3 (+1 for empty familyName2) which would be less to 6

The 1002 is at rank 4, 1008 and 1001 are only better than 1010 which has only full score at middle name.

First arguments search The result of this test case now is that we find 11 and not 3 matching cases. The test cases searches for “Darius G” and now finds in addition the following results:

For 1002, 1005 family name is Darius which would be a 3
For 1003, 1007, 1008 Darius is the givenName but only this is matching (4 + empties) which is below 6
For 1001, 1004, 1006 middleName is Darius which is only matching and hence to low in score (4 + empty names < 6)

1002 and 1005 match before the expected values in in the ranking.

Summary: In the new logic only one match would be enough to get into the result bucket. The previous needed two matches. The ordering of the result is as well different in the sense that the most new finds but not all behind the previous.

@ibacher @dkayiwa: I think this business logic is even better since it as well takes one match into account for the result. What do you think? May the boost hast to be adjusted for the ranks.