Soundex search in LuceneQueries

Similar to the previous thread I would like to document the business logic in this thread to make the debugging and review easier:

Logic in the SQL statement:

case.  when pname.givenName is null then 0.
       when soundex(pname.givenName) = soundex(:n1) then 3.
       when soundex(pname.givenName) = soundex(:n2) then 2.
       when soundex(pname.givenName) = soundex(:n3) then 1.  else 0 . end. 
+ 

case. when pname.middleName is null then 0.
      when soundex(pname.middleName) = soundex(:n1) then 2.
      when soundex(pname.middleName) = soundex(:n2) then 3.
      when soundex(pname.middleName) = soundex(:n3) then 1.  else 0. end.
        + 
case. when pname.familyName is null then 0.
      when soundex(pname.familyName) = soundex(:n1) then 1.
      when soundex(pname.familyName) = soundex(:n2) then 2.
      when soundex(pname.familyName) = soundex(:n3) then 3.  else 0. end.
      +
case.  when pname.familyName2 is null then 0.
      when soundex(pname.familyName2) = soundex(:n1) then 1.
      when soundex(pname.familyName2) = soundex(:n2) then 2.
      when soundex(pname.familyName2) = soundex(:n3) then 3.  else 0. end.
 >= 5;

Logic 1: One variable (e. g. n1) matches multiple names so that count goes over 5. This happens with an expected match (n1 = givenName) and mutiple 1 and 2 add upp:

  • givenName=n1 and ((familyName= n1 and familyName2=n1) or middleName=n1)
  • middleName=n2 and (n2= familyName or familyName2=n2)
  • familyName=n3 and givenName=n3 and middleName=n3
  • familyName2=n3 and givenName=n3 and middleName=n3

Logic 2 There are two matches with a 3:

  • There exists two true returns in GivenName(N1), MiddleName(n2), FamilyName(N3), FamilyName2(N3)

Logic 3 There exist a match in the form of a 2 and 3 hit

  • givenName = n1 and familyName=n2
  • givenName = n1 and familyName=n3
  • familyName = n3 and middleName = n2
  • familyName2 = n3 and middleName = n2

Logic 4: There exist one almost correct attribute match worth of 2 points and three matches worth one point

  • givenName=n2, middleName=n3, familyName=n1, familyName2=n1
  • middleName=n3, givenName=n3, familyName=n1, familyName2=n1
  • familyName=n2, middleName=n1, givenName=n3, familyName2=n1
  • familyName2=n3, middleName=n1, givenName=n3, familyName=n1

Logic 5 - One match worth of 3 points and two matches worth of 1 point

  • givenName=n1, middleName=n3, (familyName=n1 or familyName2=n1)
  • middleName=n2, givenName=n3, (familyName=n1 or familyName2=n1)

Potential Approaches for a Lucene impelementation

Approach 1 - Point based solution Same Logic implementation in multiple SQL statements. For each variable=attribute statement a function is implemented that returns the matched patiend_ids plus the assigned matching points (1,2,3). In the end, id total points are added up and a filter is applied. Disadvantage is that 12 SQL statements are executed.

HashMap<int, int> person_id_to_points;
person_id_to_points += select_familyName(n1, 1);
person_id_to_points += select_familyName(n2, 2);
match= filter_persons_id_with_more_5_points(person_id_to_points)

Approach 2 Each logic block is executed in a seperate function. A lot of “AND” blocks have to be added to the Lucene query. The advantage is that less SQL statements are executed but each SQL statement becomes a lot more complex.

List person_id;
person_id += get_persons_matching_one_attribute_multiple_times(n1,n2,n3,n4);
person_id += get_persons_matching_a_three_and_two(n1,n2,n3,n4);

Questions:

  1. Is there an error in my interpretation of the business logic?
  2. What approach would be the best?