Soundex search in LuceneQueries

Are you simply looking for the cause of the QuerySyntax in the two failing tests?

Oh yeah – also make sure you’re reindexing the Lucene search index when manually testing on a server. This is needed each time you make a change to an analyzer/mapping.

reindexing the Lucene search index

I am reindexing in the test getSimilarPeople_shouldMatchSearchToFamilyName2() by using the following method. So I think this should not be the cause or am I missing an important step of indexing?


In short: The reason why the tests are failing @dkayiwa

Long: The two test are failing because the following method does return an empty set.

This is the case because the Soundex() filter, specified as field familyName2Soundex., does not work which is executed in getSoundexPersonNameQuery()

The same method is working when I use familyName2Exact filters. So my conclusion is that the LuceneAnalyzer for soundex does not seem to work. But I do not know why?

My expectations would be that these two test are valid when familyNameSoundex acts like soundex() in the SQL as it was previous to the change

@dkayiwa @bistenes @raff good news. Thanks to your help I think I found the cause of the error. It seems to be case sensetive and hence if I add the LowerCaseFilterFactory before using Phonetic filter it seems to work. Or rather at least this first if statement seems to work as expected. Lets see if it works for the others:

		mapping.analyzerDef(LuceneAnalyzers.SOUNDEX_ANALYZER, StandardTokenizerFactory.class)
			.param("encoder", "Soundex")
			.param("inject", "true");

My issue where I may need the confirmation of a more experienced business analyst. The orginal query looked like that: soundex(pname.givenName) = soundex(:n1)

Do I change the intended behavior if I filter for lower cases?


I got a new question regarding the business logic which I would like to validate before I pull the code into the PR. The reason for this is that it does not seem to be completely covered by PersonServiceTest.

The query that is executed in the getSimilarName query for two names n1 and n2 is part of the second if statement of and looks like the following:

     case  when pname.givenName is null then 1  
        when pname.givenName = '' then 1  
        when soundex(pname.givenName) = soundex(:n1) then 4
        when soundex(pname.givenName) = soundex(:n2) then 3  
        else 0  end
  case  when pname.middleName is null then 1
        when pname.middleName = '' then 1
        when soundex(pname.middleName) = soundex(:n1) then 3
        when soundex(pname.middleName) = soundex(:n2) then 4  else 0  end
  case  when pname.familyName is null then 1
        when pname.familyName = '' then 1
        when soundex(pname.familyName) = soundex(:n1) then 3
        when soundex(pname.familyName) = soundex(:n2) then 4  else 0  end
  case  when pname.familyName2 is null then 1
        when pname.familyName2 = '' then 1
        when soundex(pname.familyName2) = soundex(:n1) then 3
        when soundex(pname.familyName2) = soundex(:n2) then 4  else 0  end")
  > 6");

I had a look to the logic and concluded that the following rules are implemented:

  1. People who have a matching name n1 in at least 3 name elements namely middleName, familyName, familyName2 OR givenNamen matches n1 and one other name element.
  2. People who have a matching name n2 in at least 2 names elements namely givenNamen, middleName, familyName, familyName2
  3. People who have at least one name matching with n1 and one name matching with n2
  4. People, who have only givenName set and it is matching n1
  5. People who have one name element, that is not not givenName, matching n2 and the other names elements are empty

*name element: Either givenName, familyName, familyName2 or middleName

I have the following two points I would like to validate/discuss before creating the pull request with new tests and according changes:

  1. Does someone see more or other business logic implemented in the query compared to my list?
  2. I am not sure if 1. and 2. are really valid in terms of business logic. Sounds a bit odd.

Before even thinking of creating new tests, do all existing tests pass with your changes?

Before I create a final PR I will make sure that all tests work. However, it’s a huge refactoring so I doubt it should be done without adding further tests. Especially since it does not look like the intended behavior is documented in the code comment.

Right now I work incrementally implementing first identified logic 1 and then in the end 6. The tests I created for 1 and 2 success with new and old code. I will probably have by Sunday the 4 missing logic parts implemented. Then I can answer your question finally.

Logic 3 is not correct as I defined it above:

  1. People who have at least one name matching with n1 and one name matching with n2

It should rather be:

3a. People whose first name matches n1 and second part of the name (givenName, familyName, familyName2) matches at least once n2

3b. People who have second part of the name(givenName, familyName, familyName2) matching n2 and n1 (at least once)

3c. People who have given_name matching n2, second name matching n1, and another part of second name matching “”

@dkayiwa @bistenes Seeing how complicated the logic is becoming even without representing the fix in logic 3, which btw is not reflected by the existing tests in PersonServiceTest, I am wondering if refactoring to Lucene makes the query really more readable

Awesome, glad to see such progress on this! And thanks for the additional investigation around the business logic.

Regarding the question of whether refactoring to Lucene makes the query more readable… it very well might not, but that’s not so much the point. Lucene provides much faster and more powerful text querying than MySQL possibly can. Hopefully as our code that uses it matures, we will get better at writing it more cleanly, and refactor it to be easier to read.

1 Like

@bistenes Thank you!

Unfortunately there still seems to be an issue with the PhoneticFilterFactory. The search logic is working but it does not seem like soundex() is really executed during the search. I think I know why that is the case.

@raff you were heavily involved in coding LuceneQuery so I hope I can get some advice from you. The issue lies in More precisely it seems like PersonName only uses the LuceneAnalyzers.EXACT_ANALYZER and does never touch the SoundexFilter I was defining.

  1. Is there a specific reason why it is only using ExactAnalyzer. If not can I change it

  2. What would be the prefered way to change it?

     protected MultiFieldQueryParser newMultipleFieldQueryParser(Collection<String> fields) {
     Analyzer analyzer;
     if (getType().isAssignableFrom(PatientIdentifier.class) || getType().isAssignableFrom(PersonName.class) || getType().isAssignableFrom(PersonAttribute.class)) {
     	analyzer = getFullTextSession().getSearchFactory().getAnalyzer(LuceneAnalyzers.EXACT_ANALYZER);
     } else {
     	analyzer = getFullTextSession().getSearchFactory().getAnalyzer(getType());
     MultiFieldQueryParser queryParser = new MultiFieldQueryParser(fields.toArray(new String[fields.size()]), analyzer);
     return queryParser;


Edit: I have to admit that I am maximized confused why the change of the Soundex_Analyzer namely setting it to “.filter(LowerCaseFilterFactory.class)” as it was suggested on the internet actually has had an affect. I can only assume that the fields, e. g. familyNameSoundex, still got the filter executed. However, it does not seem like PhoneticFilter did change their values…

It is straightforward to actually make it work. Since the following part of the if-condition has to be dropped.

|| getType().isAssignableFrom(PersonName.class)

Howerver that leads to make the following test fail:


In fact, I do not know why it fails it detail after the change. However, the difference is that due to the fact that the ExactAnalyzer is not used it also matches the beginning of Jonny

Options I see:

  1. Drop that patient from the xml file or giving it a different name. So the test cases would never have to change
  2. Accept that the search has been inaccurate in the past and change all the test cases accordingly
  3. Adapt the code so that it will use the LuceneAnalyzers.EXACT_ANALYZER by default for search queries that look for a PersonName on a not soundex base.

Who can answer this?


Of cause that leads to a lot more failed tests:

Failed tests: PatientServiceTest.getPatients_shouldSearchFamilyName2WithName:762 expected:<3> but was:<4> PatientDAOTest.getCountOfPatients_shouldObeyAttributeMatchMode:1983 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldGetCloseIdentifiersWithCorrectStartPhrase:2259 expected:<2> but was:<4> PatientDAOTest.getPatients_shouldGetPatientByFamily2Name_SignatureNo2:1218 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldGetPatientsWithMatchModeAnywhere_SignatureNo1:1020 expected:<2> but was:<5> PatientDAOTest.getPatients_shouldGetPatientsWithMatchModeAnywhere_SignatureNo2:1744 expected:<2> but was:<5> PatientDAOTest.getPatients_shouldGetVoidedPersonWhenVoidedTrueIsPassed:2001 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldNotGetCloseIdentifiersWithWrongStartPhrase:2291 expected:<1> but was:<3> PatientDAOTest.getPatients_shouldNotGetExcessPatientsOnIdentifierAndNameMatch:1644 expected:<3> but was:<4> PatientDAOTest.getPatients_shouldNotGetPatientByNonexistingSingleName_SignatureNo1:822 expected:<0> but was:<1> PatientDAOTest.getPatients_shouldNotGetPatientByNonexistingSingleName_SignatureNo2:1239 expected:<0> but was:<1> PatientDAOTest.getPatients_shouldNotMatchVoidedPatients:674 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldNotMatchVoidedPatients_SignatureNo1:719 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldReturnDistinctPatientList_SignatureNo2:1697 expected:<1> but was:<2> PatientDAOTest.getPatients_shouldReturnOnlyPatients:2120 Expected: is an empty collection but: <[Patient#46, Patient#78]> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByFamilyName:481 expected:<2> but was:<3> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByMultipleNameParts:533 expected:<5> but was:<7> HibernatePersonDAOTest.getPeople_shouldGetMultiplePeopleByNameAndAttribute:403 expected:<2> but was:<8> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonexistingNameAndNonsearchableAttribute:351 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonexistingNameAndVoidedAttribute:363 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetNoOneByNonsearchableAttribute:161 expected:<0> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetNoOneByVoidedAttribute:174 expected:<0> but was:<1> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByAttribute:189 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByGivenName:417 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByNameAndAttribute:375 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByNameAndVoidedAttribute:388 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetOnePersonByRandomCaseAttribute:205 expected:<1> but was:<5> HibernatePersonDAOTest.getPeople_shouldGetSingleDeadPerson:588 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldGetVoidedByVoidedNameWhenVoidedIsTrue:555 expected:<1> but was:<2> HibernatePersonDAOTest.getPeople_shouldNotGetDeadPerson:577 expected:<0> but was:<1>

1 Like

@dkayiwa @bistenes PR for one and two names search can be found here:

Subtask is here:

1 Like

@dkayiwa it seems like @ibacher and I need a bit more Lucene support to validate if my PR actually is following a valid approch. Do you perhaps have a suggestion who can help answering the questions and choices:

The question is related to what was discussed earlier in the topic

@fruether i have followed the discussions you have had on github with @ibacher and it does not look like you have any un answered question. Do you?

We found a particular solution for the addressed point. However, for me at least a certain kind of uncertainty remains touching a core functionality of the project

@ibacher invests a significant amount of time in this, and he is a /dev/5. So when you get a go ahead from him, you should feel safe!

1 Like

Similar to the previous thread I would like to document the business logic in this thread to make the debugging and review easier:

Logic in the SQL statement:

case.  when pname.givenName is null then 0.
       when soundex(pname.givenName) = soundex(:n1) then 3.
       when soundex(pname.givenName) = soundex(:n2) then 2.
       when soundex(pname.givenName) = soundex(:n3) then 1.  else 0 . end. 

case. when pname.middleName is null then 0.
      when soundex(pname.middleName) = soundex(:n1) then 2.
      when soundex(pname.middleName) = soundex(:n2) then 3.
      when soundex(pname.middleName) = soundex(:n3) then 1.  else 0. end.
case. when pname.familyName is null then 0.
      when soundex(pname.familyName) = soundex(:n1) then 1.
      when soundex(pname.familyName) = soundex(:n2) then 2.
      when soundex(pname.familyName) = soundex(:n3) then 3.  else 0. end.
case.  when pname.familyName2 is null then 0.
      when soundex(pname.familyName2) = soundex(:n1) then 1.
      when soundex(pname.familyName2) = soundex(:n2) then 2.
      when soundex(pname.familyName2) = soundex(:n3) then 3.  else 0. end.
 >= 5;

Logic 1: One variable (e. g. n1) matches multiple names so that count goes over 5. This happens with an expected match (n1 = givenName) and mutiple 1 and 2 add upp:

  • givenName=n1 and ((familyName= n1 and familyName2=n1) or middleName=n1)
  • middleName=n2 and (n2= familyName or familyName2=n2)
  • familyName=n3 and givenName=n3 and middleName=n3
  • familyName2=n3 and givenName=n3 and middleName=n3

Logic 2 There are two matches with a 3:

  • There exists two true returns in GivenName(N1), MiddleName(n2), FamilyName(N3), FamilyName2(N3)

Logic 3 There exist a match in the form of a 2 and 3 hit

  • givenName = n1 and familyName=n2
  • givenName = n1 and familyName=n3
  • familyName = n3 and middleName = n2
  • familyName2 = n3 and middleName = n2

Logic 4: There exist one almost correct attribute match worth of 2 points and three matches worth one point

  • givenName=n2, middleName=n3, familyName=n1, familyName2=n1
  • middleName=n3, givenName=n3, familyName=n1, familyName2=n1
  • familyName=n2, middleName=n1, givenName=n3, familyName2=n1
  • familyName2=n3, middleName=n1, givenName=n3, familyName=n1

Logic 5 - One match worth of 3 points and two matches worth of 1 point

  • givenName=n1, middleName=n3, (familyName=n1 or familyName2=n1)
  • middleName=n2, givenName=n3, (familyName=n1 or familyName2=n1)

Potential Approaches for a Lucene impelementation

Approach 1 - Point based solution Same Logic implementation in multiple SQL statements. For each variable=attribute statement a function is implemented that returns the matched patiend_ids plus the assigned matching points (1,2,3). In the end, id total points are added up and a filter is applied. Disadvantage is that 12 SQL statements are executed.

HashMap<int, int> person_id_to_points;
person_id_to_points += select_familyName(n1, 1);
person_id_to_points += select_familyName(n2, 2);
match= filter_persons_id_with_more_5_points(person_id_to_points)

Approach 2 Each logic block is executed in a seperate function. A lot of “AND” blocks have to be added to the Lucene query. The advantage is that less SQL statements are executed but each SQL statement becomes a lot more complex.

List person_id;
person_id += get_persons_matching_one_attribute_multiple_times(n1,n2,n3,n4);
person_id += get_persons_matching_a_three_and_two(n1,n2,n3,n4);


  1. Is there an error in my interpretation of the business logic?
  2. What approach would be the best?

It’s unclear to me from your outline if this falls under approach 2, but my preference would be to utilise Lucene’s scoring to boost matching the first name part to the first name, the second name part to the second name part to the second name, etc. and then favouring exact matches for any field or partial matches for any field.

That may not exactly duplicate the logic, but at the point where we’re doing fuzzy matching, I would prioritise getting a reasonable set of fuzzy matches as efficiently as possible over exactly duplicating current system behaviour. Otherwise, we lose a lot of the advantages of moving to Lucene. At the same time, though, we don’t want an algorithm completely unrelated to the current implementation, since it seems to work well enough for most cases.

Thanks for the details and approach.

I would have to think and investigate how to implement that kind of score with Lucene. In fact, I am not sure how I could set a boost for e.g. givenName=n1, middleName=n2, … on query condition level

So I can’t say for now if that would be possible easily and what consequences it has. I have to dig deeper in to this idea…

It seems like it can be done as part of the Query ( I guess that would rather be a way of implementation in approach 2. I have to look at it more in depth

1 Like