Soundex search in LuceneQueries

fruether · March 7, 2020, 12:13pm

@dkayiwa @bistenes @raff good news. Thanks to your help I think I found the cause of the error. It seems to be case sensetive and hence if I add the LowerCaseFilterFactory before using Phonetic filter it seems to work. Or rather at least this first if statement seems to work as expected. Lets see if it works for the others:

		mapping.analyzerDef(LuceneAnalyzers.SOUNDEX_ANALYZER, StandardTokenizerFactory.class)
		.filter(ClassicFilterFactory.class)
		.filter(LowerCaseFilterFactory.class)
		.filter(PhoneticFilterFactory.class)
			.param("encoder", "Soundex")
			.param("inject", "true");

My issue where I may need the confirmation of a more experienced business analyst. The orginal query looked like that: soundex(pname.givenName) = soundex(:n1)

Do I change the intended behavior if I filter for lower cases?

Link: https://lucene.apache.org/solr/guide/7_2/filter-descriptions.html

fruether · March 9, 2020, 9:32pm

I got a new question regarding the business logic which I would like to validate before I pull the code into the PR. The reason for this is that it does not seem to be completely covered by PersonServiceTest.

The query that is executed in the getSimilarName query for two names n1 and n2 is part of the second if statement of HibernatePersonDAO.java and looks like the following:

     case  when pname.givenName is null then 1  
        when pname.givenName = '' then 1  
        when soundex(pname.givenName) = soundex(:n1) then 4
        when soundex(pname.givenName) = soundex(:n2) then 3  
        else 0  end
  +  
  case  when pname.middleName is null then 1
        when pname.middleName = '' then 1
        when soundex(pname.middleName) = soundex(:n1) then 3
        when soundex(pname.middleName) = soundex(:n2) then 4  else 0  end
  +  
  case  when pname.familyName is null then 1
        when pname.familyName = '' then 1
        when soundex(pname.familyName) = soundex(:n1) then 3
        when soundex(pname.familyName) = soundex(:n2) then 4  else 0  end
  + 
  case  when pname.familyName2 is null then 1
        when pname.familyName2 = '' then 1
        when soundex(pname.familyName2) = soundex(:n1) then 3
        when soundex(pname.familyName2) = soundex(:n2) then 4  else 0  end")
    .) 
  > 6");

I had a look to the logic and concluded that the following rules are implemented:

People who have a matching name n1 in at least 3 name elements namely middleName, familyName, familyName2 OR givenNamen matches n1 and one other name element.
People who have a matching name n2 in at least 2 names elements namely givenNamen, middleName, familyName, familyName2
People who have at least one name matching with n1 and one name matching with n2
People, who have only givenName set and it is matching n1
People who have one name element, that is not not givenName, matching n2 and the other names elements are empty

*name element: Either givenName, familyName, familyName2 or middleName

I have the following two points I would like to validate/discuss before creating the pull request with new tests and according changes:

Does someone see more or other business logic implemented in the query compared to my list?
I am not sure if 1. and 2. are really valid in terms of business logic. Sounds a bit odd.

dkayiwa · March 10, 2020, 9:14am

Before even thinking of creating new tests, do all existing tests pass with your changes?

fruether · March 10, 2020, 9:38am

Before I create a final PR I will make sure that all tests work. However, it’s a huge refactoring so I doubt it should be done without adding further tests. Especially since it does not look like the intended behavior is documented in the code comment.

Right now I work incrementally implementing first identified logic 1 and then in the end 6. The tests I created for 1 and 2 success with new and old code. I will probably have by Sunday the 4 missing logic parts implemented. Then I can answer your question finally.

fruether · March 14, 2020, 9:51pm

Logic 3 is not correct as I defined it above:

People who have at least one name matching with n1 and one name matching with n2

It should rather be:

3a. People whose first name matches n1 and second part of the name (givenName, familyName, familyName2) matches at least once n2

3b. People who have second part of the name(givenName, familyName, familyName2) matching n2 and n1 (at least once)

3c. People who have given_name matching n2, second name matching n1, and another part of second name matching “”

@dkayiwa @bistenes Seeing how complicated the logic is becoming even without representing the fix in logic 3, which btw is not reflected by the existing tests in PersonServiceTest, I am wondering if refactoring to Lucene makes the query really more readable PersonLuceneQuery.java

bistenes · March 16, 2020, 3:49pm

Awesome, glad to see such progress on this! And thanks for the additional investigation around the business logic.

Regarding the question of whether refactoring to Lucene makes the query more readable… it very well might not, but that’s not so much the point. Lucene provides much faster and more powerful text querying than MySQL possibly can. Hopefully as our code that uses it matures, we will get better at writing it more cleanly, and refactor it to be easier to read.

fruether · March 17, 2020, 10:26pm

@bistenes Thank you!

Unfortunately there still seems to be an issue with the PhoneticFilterFactory. The search logic is working but it does not seem like soundex() is really executed during the search. I think I know why that is the case.

@raff you were heavily involved in coding LuceneQuery so I hope I can get some advice from you. The issue lies in LuceneQuery.java. More precisely it seems like PersonName only uses the LuceneAnalyzers.EXACT_ANALYZER and does never touch the SoundexFilter I was defining.

Is there a specific reason why it is only using ExactAnalyzer. If not can I change it

What would be the prefered way to change it?

 protected MultiFieldQueryParser newMultipleFieldQueryParser(Collection<String> fields) {
 Analyzer analyzer;
 if (getType().isAssignableFrom(PatientIdentifier.class) || getType().isAssignableFrom(PersonName.class) || getType().isAssignableFrom(PersonAttribute.class)) {
 	analyzer = getFullTextSession().getSearchFactory().getAnalyzer(LuceneAnalyzers.EXACT_ANALYZER);
 } else {
 	analyzer = getFullTextSession().getSearchFactory().getAnalyzer(getType());
 }
 MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields.toArray(new String[fields.size()]), analyzer);

 setDefaultOperator(queryParser);
 return queryParser;

}

Edit: I have to admit that I am maximized confused why the change of the Soundex_Analyzer namely setting it to “.filter(LowerCaseFilterFactory.class)” as it was suggested on the internet actually has had an affect. I can only assume that the fields, e. g. familyNameSoundex, still got the filter executed. However, it does not seem like PhoneticFilter did change their values…

fruether · March 18, 2020, 10:38pm

It is straightforward to actually make it work. Since the following part of the if-condition has to be dropped.

|| getType().isAssignableFrom(PersonName.class)

Howerver that leads to make the following test fail:

getPeople_shouldMatchSearchToFamilyName2()

In fact, I do not know why it fails it detail after the change. However, the difference is that due to the fact that the ExactAnalyzer is not used it also matches the beginning of Jonny

Options I see:

Drop that patient from the xml file or giving it a different name. So the test cases would never have to change
Accept that the search has been inaccurate in the past and change all the test cases accordingly
Adapt the code so that it will use the LuceneAnalyzers.EXACT_ANALYZER by default for search queries that look for a PersonName on a not soundex base.

Who can answer this?

standardTestDataset

Of cause that leads to a lot more failed tests:

Failed tests: PatientServiceTest.getPatients_shouldSearchFamilyName2WithName:762 expected:<3> but was:<4> PatientDAOTest.getCountOfPatients_shouldObeyAttributeMatchMode:1983 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldGetCloseIdentifiersWithCorrectStartPhrase:2259 expected:<2> but was:<4> PatientDAOTest.getPatients_shouldGetPatientByFamily2Name_SignatureNo2:1218 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldGetPatientsWithMatchModeAnywhere_SignatureNo1:1020 expected:<2> but was:<5> PatientDAOTest.getPatients_shouldGetPatientsWithMatchModeAnywhere_SignatureNo2:1744 expected:<2> but was:<5> PatientDAOTest.getPatients_shouldGetVoidedPersonWhenVoidedTrueIsPassed:2001 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldNotGetCloseIdentifiersWithWrongStartPhrase:2291 expected:<1> but was:<3> PatientDAOTest.getPatients_shouldNotGetExcessPatientsOnIdentifierAndNameMatch:1644 expected:<3> but was:<4> PatientDAOTest.getPatients_shouldNotGetPatientByNonexistingSingleName_SignatureNo1:822 expected:<0> but was:<1> PatientDAOTest.getPatients_shouldNotGetPatientByNonexistingSingleName_SignatureNo2:1239 expected:<0> but was:<1> PatientDAOTest.getPatients_shouldNotMatchVoidedPatients:674 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldNotMatchVoidedPatients_SignatureNo1:719 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldReturnDistinctPatientList_SignatureNo2:1697 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldReturnOnlyPatients:2120 Expected: is an empty collection but: <[Patient#46, Patient#78]> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByFamilyName:481 expected:<2> but was:<3> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByMultipleNameParts:533 expected:<5> but was:<7> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByNameAndAttribute:403 expected:<2> but was:<8> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonexistingNameAndNonsearchableAttribute:351 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonexistingNameAndVoidedAttribute:363 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonsearchableAttribute:161 expected:<0> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetNoOneByVoidedAttribute:174 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByAttribute:189 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByGivenName:417 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByNameAndAttribute:375 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByNameAndVoidedAttribute:388 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByRandomCaseAttribute:205 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetSingleDeadPerson:588 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetVoidedByVoidedNameWhenVoidedIsTrue:555 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldNotGetDeadPerson:577 expected:<0> but was:<1>

fruether · March 22, 2020, 7:12pm

@dkayiwa @bistenes PR for one and two names search can be found here: https://github.com/openmrs/openmrs-core/pull/3152

Subtask is here: https://issues.openmrs.org/browse/TRUNK-5724

fruether · March 30, 2020, 8:09pm

@dkayiwa it seems like @ibacher and I need a bit more Lucene support to validate if my PR actually is following a valid approch. Do you perhaps have a suggestion who can help answering the questions and choices: https://github.com/openmrs/openmrs-core/pull/3152/commits/4a54a1953af632db5bb21b754bebe02c1aea6349#diff-8418a7dc796909a947781080e9b24f83R88

The question is related to what was discussed earlier in the topic

dkayiwa · March 30, 2020, 10:23pm

@fruether i have followed the discussions you have had on github with @ibacher and it does not look like you have any un answered question. Do you?

fruether · March 31, 2020, 7:45pm

We found a particular solution for the addressed point. However, for me at least a certain kind of uncertainty remains touching a core functionality of the project

dkayiwa · March 31, 2020, 7:59pm

@ibacher invests a significant amount of time in this, and he is a /dev/5. So when you get a go ahead from him, you should feel safe!

fruether · April 19, 2020, 2:23pm

Similar to the previous thread I would like to document the business logic in this thread to make the debugging and review easier:

Logic in the SQL statement:

case.  when pname.givenName is null then 0.
       when soundex(pname.givenName) = soundex(:n1) then 3.
       when soundex(pname.givenName) = soundex(:n2) then 2.
       when soundex(pname.givenName) = soundex(:n3) then 1.  else 0 . end. 
+ 

case. when pname.middleName is null then 0.
      when soundex(pname.middleName) = soundex(:n1) then 2.
      when soundex(pname.middleName) = soundex(:n2) then 3.
      when soundex(pname.middleName) = soundex(:n3) then 1.  else 0. end.
        + 
case. when pname.familyName is null then 0.
      when soundex(pname.familyName) = soundex(:n1) then 1.
      when soundex(pname.familyName) = soundex(:n2) then 2.
      when soundex(pname.familyName) = soundex(:n3) then 3.  else 0. end.
      +
case.  when pname.familyName2 is null then 0.
      when soundex(pname.familyName2) = soundex(:n1) then 1.
      when soundex(pname.familyName2) = soundex(:n2) then 2.
      when soundex(pname.familyName2) = soundex(:n3) then 3.  else 0. end.
 >= 5;

Logic 1: One variable (e. g. n1) matches multiple names so that count goes over 5. This happens with an expected match (n1 = givenName) and mutiple 1 and 2 add upp:

givenName=n1 and ((familyName= n1 and familyName2=n1) or middleName=n1)
middleName=n2 and (n2= familyName or familyName2=n2)
familyName=n3 and givenName=n3 and middleName=n3
familyName2=n3 and givenName=n3 and middleName=n3

Logic 2 There are two matches with a 3:

There exists two true returns in GivenName(N1), MiddleName(n2), FamilyName(N3), FamilyName2(N3)

Logic 3 There exist a match in the form of a 2 and 3 hit

givenName = n1 and familyName=n2
givenName = n1 and familyName=n3
familyName = n3 and middleName = n2
familyName2 = n3 and middleName = n2

Logic 4: There exist one almost correct attribute match worth of 2 points and three matches worth one point

givenName=n2, middleName=n3, familyName=n1, familyName2=n1
middleName=n3, givenName=n3, familyName=n1, familyName2=n1
familyName=n2, middleName=n1, givenName=n3, familyName2=n1
familyName2=n3, middleName=n1, givenName=n3, familyName=n1

Logic 5 - One match worth of 3 points and two matches worth of 1 point

givenName=n1, middleName=n3, (familyName=n1 or familyName2=n1)
middleName=n2, givenName=n3, (familyName=n1 or familyName2=n1)

Potential Approaches for a Lucene impelementation

Approach 1 - Point based solution Same Logic implementation in multiple SQL statements. For each variable=attribute statement a function is implemented that returns the matched patiend_ids plus the assigned matching points (1,2,3). In the end, id total points are added up and a filter is applied. Disadvantage is that 12 SQL statements are executed.

HashMap<int, int> person_id_to_points;
person_id_to_points += select_familyName(n1, 1);
person_id_to_points += select_familyName(n2, 2);
match= filter_persons_id_with_more_5_points(person_id_to_points)

Approach 2 Each logic block is executed in a seperate function. A lot of “AND” blocks have to be added to the Lucene query. The advantage is that less SQL statements are executed but each SQL statement becomes a lot more complex.

List person_id;
person_id += get_persons_matching_one_attribute_multiple_times(n1,n2,n3,n4);
person_id += get_persons_matching_a_three_and_two(n1,n2,n3,n4);

Questions:

Is there an error in my interpretation of the business logic?
What approach would be the best?

ibacher · April 20, 2020, 11:16pm

It’s unclear to me from your outline if this falls under approach 2, but my preference would be to utilise Lucene’s scoring to boost matching the first name part to the first name, the second name part to the second name part to the second name, etc. and then favouring exact matches for any field or partial matches for any field.

That may not exactly duplicate the logic, but at the point where we’re doing fuzzy matching, I would prioritise getting a reasonable set of fuzzy matches as efficiently as possible over exactly duplicating current system behaviour. Otherwise, we lose a lot of the advantages of moving to Lucene. At the same time, though, we don’t want an algorithm completely unrelated to the current implementation, since it seems to work well enough for most cases.

fruether · April 22, 2020, 9:27pm

Thanks for the details and approach.

I would have to think and investigate how to implement that kind of score with Lucene. In fact, I am not sure how I could set a boost for e.g. givenName=n1, middleName=n2, … on query condition level

So I can’t say for now if that would be possible easily and what consequences it has. I have to dig deeper in to this idea…

It seems like it can be done as part of the Query (How to boost Lucene queries with field values? - Operations - Neo4j Online Community). I guess that would rather be a way of implementation in approach 2. I have to look at it more in depth

fruether · April 26, 2020, 10:25am

My research indicated that no minimum score can be set in Lucene so a logic like >5 is not possible.

If we are fine with the fact that all documents/persons appear in the result who have a match of one argument in the list of names then the following line of code should work and roughly bring the patient in an order depending on how good they match.

(givenName:n1^3 OR   givenName:n2^2 OR   givenName:n1) 
OR
(middleName:n1^2 OR   middleName:n2^3 OR   middleName:n1) 
OR
(familyName:n1 OR   familyName:n2^2 OR   familyName:n1^3) 
OR
(familyName2:n1 OR   familyName2:n2^2 OR   familyName2:n1^3)

This will lead to the fact that a lot more persons are found. In fact, everything after position 4 would not normally appear as a result in the old code. The ordering would be the following:

Exact matches (boost 3) in the fields and argument combination (firstName=n1, middlenName=n2, …)
Exact matches (boost 3) and close matches (boost 2) in combination
Several close matches of boost 2 (more than 1)
Elements that have only matches in least favored attribute (boost 1) (4 are required)
Element that has one exact match (boost 3)
Three elements match in least favored attribute (boost 1)
One close match of boost 2
One match of element

@ibacher Any suggestions or remarks? Otherwise I could try to implement it like this for the 3 argument search

ibacher · April 26, 2020, 8:09pm

I like this! And I’m perfectly happy with us adding results that aren’t found. One question: is there any way we can de-prioritize results for familyName2 a bit relative to the other results?

fruether · May 24, 2020, 1:30pm

I am trying to get a grip on the Lucene boost calculations. What I understood so far is:

Documents (so I guess patients in our case) that have more occurrences of a given term receive a higher score. → This should always be one since a term is sth like given_name:test which just could be set once
Rarer terms give higher contribution to the total score.
Typically, a document (patients) that contains more of the query’s terms will receive a higher score than another document with fewer query terms

Source

I have the following two questions:

The return type of the function is of type Set. This collection does not have any order guarantee in terms of equal to insertion oder? I assume @ibacher @dkayiwa that the return type should either be a TreeSet<> or a List<> if we want to use boost to have an ordered the result. If you agree I will create a new issue on jira to adapt this.
I implemented the boost similar to what we defined in the previous post. I would get the following result (as List) which is just partly as expected (see below). How much of an investigation to understand the ordering should be performed by me? Is the retrieved order sufficient:

Order of result:

Person.id → My boost expectation → Result of the added boost → My concern

1007 → 3x3x3x0 → 9

1006 → 1 x 2 x 2 x 0 → 5 → WHY → Because Jazayeri appears just a few time and we got three matches

1003 → 3x3x0x0 → 6

1004 → 2x3x0 → 5

1005 → 2 x 1 → 3 —> WHY

1009 → 2 x 1 x 1 → 4 → WHY - Lower due to weak middle_name=Darius I assume

1013 → 2 X 2 → 4 → WHY

1011 → 2x3 → 5

1012 → 3x2 → 5

1000 → 3 → 3

1008 → 3

1001 → 2 → 2

1010 → 3 → 3

Term Frequency

1005: family_name=“Darius” =3, given_name=“Graham” = 3

1009: middle_name=“Darius%” = 9,

1013: family_name="Graham = 2, family_name2=“Graham%” = 1

Test Data

<id="1003" given_name="Darius" middle_name="Graham" family_name="" />
<id="1007" given_name="Darius" middle_name="Graham" family_name="Jazayeri">
<id="1004" given_name="Graham" middle_name="Darius" family_name="">
<id="1006" given_name="Jazayeri" middle_name="Darius" family_name="Graham"/>
<id="1011" given_name="Graham2" middle_name="Graham2" family_name="" family_name2=""/>
<id="1012" given_name="Darius2" middle_name="Darius2" family_name=""  family_name2="">
<id="1000" given_name="Darius" middle_name="" family_name="">
<id="1008" given_name="Darius" middle_name="With" family_name="SomeOtherName">
<id="1005" given_name="Graham" middle_name="" family_name="Darius">
<id="1001" given_name="" middle_name="Darius" family_name="">
<id="1010" given_name="" middle_name="Graham3" family_name="" family_name2="" >
<id="1009" given_name="" middle_name="Darius2" family_name="Darius2"  family_name2="Darius2"/>
<id="1013" given_name="" middle_name="" family_name="Graham2" family_name2="Graham2">
<id="1002" given_name="" middle_name="" family_name="Darius"/>

ibacher · May 26, 2020, 2:00pm

I would think the correct thing to do is to either transform the list returned into a LinkedHashSet (which maintains insertion order) or use the streaming API to do List.stream().distinct().collect(Collectors.toList()). That said, if there is a way to guarantee that a Lucene search only returns each “document” (i.e., patient) once, then we can simply return the list returned from Lucene.

A TreeSet is the wrong way to go, as it ends up re-sorting the set in Java code, which I think we want to avoid.

One thought: would it make sense to make the boosts a bit higher, something like:

(givenName:n1^6 OR   givenName:n2^3 OR   givenName:n1) 
OR
(middleName:n1^3 OR   middleName:n2^6OR   middleName:n1) 
OR
(familyName:n1 OR   familyName:n2^3 OR   familyName:n1^6) 
OR
(familyName2:n1 OR   familyName2:n2^3 OR   familyName2:n1^6)

I.e., to somewhat mitigate the effects of more occurrences receiving a higher score by boosting the matches in the “appropriate” field to count more than that (these numbers may need to be adjusted).