Soundex search in LuceneQueries

fruether · December 21, 2019, 6:21pm

Hi,

I currently working on TRUNK-5680. The aim of the ticket is to refactor the getSimilarPeople() to use Lucene queries.

The query inside the method is heavily based on the use of soundex() within the sql query: or soundex(pname.middleName) = soundex(:n1)

Lucene is able to deal with soundex by using a PhoneticFilter.

My current understanding is that the LuceneQueries of OpenMRS do not fulfill the requirement of enabling a Lucene query to use a phonetic filter. That is because the PhoneticFilterFactory is not part of the LuceneAnalyzeFactory.

My current pre-condition is:

There is currently no way to use soundex() like functionality within the LuceneQueries in open mrs

My approach would be:

Adding a new Lucene SearchMapping into the Factory class that adds the following filter to the definition: .filter(PhoneticFilterFactory.class)
Use that filter to map the sql query

Is there any feedback to my approach?

ibacher · December 22, 2019, 6:56pm

@fruether That sounds to me like the correct basic approach.

dkayiwa · December 22, 2019, 10:25pm

@fruether your approach makes sense to me. Just remember to add the appropriate tests.

mogoodrich · January 2, 2020, 5:13pm

fyi @bistenes did some work recently to add Soundex support for some of LuceneQueries… we should coordinate with what he did… @bistenes can you weigh in when you get a chance?

Thanks! Mark

bistenes · January 2, 2020, 5:29pm

Hi Fred, that looks like a good approach.

However, we’ll need to add to this plan if we want to get to parity with the current customizability. Presently there are a couple custom phonetic algorithms, and the active algorithm can be configured using a GP.

I took a crack at making Lucene analyzers configurable via GP a while ago, but couldn’t get past some Spring initialization problems (I think it introduced a cyclical dependency). But making it configurable using modules might be easier.

fruether · January 2, 2020, 9:48pm

Hi @bistenes (cc @mogoodrich), hope you two had good start into 2020.

Thanks for your reply. I assume you are referencing to the following issue: https://issues.openmrs.org/browse/TRUNK-5669

Do you think the work on TRUNK-5680 should be stopped till 5669 is done and Global Properties are used. I already noticed the similarity and commented below the issue.

My idea was to first make the getSimilarMethod use a phonetic filter and then when this is done thinking about making everything more modular. Basically that would mean TRUNK-5680 would focus on adapting PersonLuceneQuery class and then this would be kinda independent mostly from the GP settings?

bistenes · January 3, 2020, 4:43pm

Yes, TRUNK-5669 is exactly the one. Glad you found it, sorry I didn’t link it.

I agree with the iterative approach. I don’t think TRUNK-5680 should be stopped. I’m not even sure 5669 can be done. Ultimately, if it can’t be, I think it should be fine if phonetic filters are only configurable using code (i.e., not GP configurable).

fruether · January 5, 2020, 4:53pm

@raff @dkayiwa @bistenes do you already have a concept for supporting a between statement within the Lucene library?

Current situation is that I have to translate the following SQL statement into a PersonLuceneQuery:

String birthdayMatch = " (year(p.birthdate) between " + (birthyear - 1) + " and " + (birthyear + 1) + " or p.birthdate is null) ";

As far as I can tell the current implementation only allows to add Terms as a query. That is why I think we are having the following two options at hand:

Create a query term that is defined as following: query = name + " AND person.birthdate:[birthyear-1 TO birthyear+1]
The TermsFilterFactory can be extended to also apply TermRangeQuery. That would mean that the current concept has to be extended.

Do you have an opinion if 1 or 2 would be preferred? And even if one would work.

raff · January 6, 2020, 12:17pm

@fruether, filters and queries have distinct properties, see https://stackoverflow.com/a/3721135

With that in mind 1) is a preferred way to have a range query on a birthdate. Caching such a dynamic filter does not make much sense. A query may be faster.

fruether · January 6, 2020, 7:07pm

Thank you @raff Currently during the setup of the query we do not set the field name explicitly like: familyName:query

Instead we just parse the query, that usually happens to be the string that should be searched for without a column name: PersonLuceneQuery and then LuceneQuery

So if I want to add the range query I would have to add the column to the query. Or would something like: "McDonald and person.birthdate:[birthyear-1 TO birthyear+1] work? Since Lucene does automatically associated the string “McDonald” to the specified fields?

raff · January 7, 2020, 10:12am

MultiFieldQueryParser respects explicit fields in a query thus “AND person.birthdate:…” should work. If a field is not specified then it uses fields declared for a parser. Not sure why you ask instead of simply testing it…

fruether · January 12, 2020, 2:44pm

@raff It was more about understanding the library better and I could not find any details regarding this. But you are right I should have tried it myself.

However, I am facing some issue with the Phonetic filter where I am a bit lost. So let me shorty describe what I have done so far:

I added a Soundex filter to the LuceneFactory. I used the example definition of the this docu:

	mapping.analyzerDef(LuceneAnalyzers.SOUNDEX_ANALYZER,StandardTokenizerFactory.class)
  	.filter(PhoneticFilterFactory.class)
  		.param("encoder", "Soundex")
  		.param("inject", "false");

I annotated the fields in PersonName accordingly. To give an example:

	@Fields({
			@Field(name = "familyName2Exact", analyzer = @Analyzer(definition = LuceneAnalyzers.EXACT_ANALYZER), boost = @Boost(4f)),
			.....
			@Field(name = "familyName2Soundex", analyzer =  @Analyzer(definition = LuceneAnalyzers.SOUNDEX_ANALYZER), boost = @Boost(8f))
	})
	private String familyName2;

As far as I can tell the setting up of the LuceneAnalyzer did finish after this two steps and it can be used within a search. Do I miss an adjustment? Anything looking odd in the code above.
The next step is to implement the actual search function. The code basically looks like the following:

		List<String> fields = new ArrayList<>();
		fields.addAll(Arrays.asList("familyNameSoundex", "familyName2Soundex", "middleNameSoundex", "givenNameSoundex"));
		LuceneQuery<PersonName> luceneQuery = LuceneQuery
			.newQuery(PersonName.class, sessionFactory.getCurrentSession(), query, fields).useOrQueryParser();
			luceneQuery.list();

The problem is that this query results in an empty set. That is wrong since the test case (getSimilarPeople_shouldMatchSearchToFamilyName2() of PersonServiceTest expects at least two matches. If I adapt the code to use familyName2Exact it is working. That is why my conclusion is that something is wrong with the analyzer setup. The search routine seems to be valid.

Does someone has an idea how to fix this issue? @bistenes @dkayiwa perhaps

Files:

raff · January 13, 2020, 9:33am

Quoting the doc you linked:

inject

(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact spelling of the target word may not match.

It may explain why you get no results for exact match with this analyzer. Please consider matching your query against all analyzers for a field i.e. familyNameExact, familyNameSoundex, etc. and only adjusting the boost.

If you need further assistance please create a PR (even not working one) as it’s easier to look at the whole picture rather than snippets.

fruether · January 13, 2020, 8:25pm

Yes I tried both true and false before. I retried it with explicit setting inject to true. However, it is still not working. You can find my code under the following PR: https://github.com/openmrs/openmrs-core/pull/3116

I want to confirm that soundex() function in SQL will have the same result as the Phonetic filter to make sure that the search is working properly. That is why I would for now try to avoid using familyNameExact as well. Because then this filter will be implemented but may never really works properly. That could have side effects.

Once I know PhoneticFilter is working I am happy to also add the other analyzers.

fruether · March 4, 2020, 10:33pm

@raff @bistenes does someone may got an idea during the last weeks why my code, more precisely my defined Soundex Analyzer, does not seem to work as expected. I am still a bit lost

PR: https://github.com/openmrs/openmrs-core/pull/3116

Error description here: Soundex search in LuceneQueries

bistenes · March 5, 2020, 6:43pm

Provided some very generic advice in the PR (I don’t really understand the Hibernate Search stuff deeply)

dkayiwa · March 5, 2020, 10:19pm

Are you simply looking for the cause of the QuerySyntax in the two failing tests?

bistenes · March 6, 2020, 3:27pm

Oh yeah – also make sure you’re reindexing the Lucene search index when manually testing on a server. This is needed each time you make a change to an analyzer/mapping.

fruether · March 7, 2020, 10:11am

reindexing the Lucene search index

I am reindexing in the test getSimilarPeople_shouldMatchSearchToFamilyName2() by using the following method. So I think this should not be the cause or am I missing an important step of indexing?

updateSearchIndex();

fruether · March 7, 2020, 10:22am

In short: The reason why the tests are failing @dkayiwa

Long: The two test are failing because the following method HibernatePersonDAO.java does return an empty set.

This is the case because the Soundex() filter, specified as field familyName2Soundex., does not work which is executed in getSoundexPersonNameQuery()

The same method is working when I use familyName2Exact filters. So my conclusion is that the LuceneAnalyzer for soundex does not seem to work. But I do not know why?

My expectations would be that these two test are valid when familyNameSoundex acts like soundex() in the SQL as it was previous to the change