Soundex search in LuceneQueries

Tags: #<Tag:0x00007f302b798580> #<Tag:0x00007f302b798468> #<Tag:0x00007f302b798378>

Hi,

I currently working on TRUNK-5680. The aim of the ticket is to refactor the getSimilarPeople() to use Lucene queries.

The query inside the method is heavily based on the use of soundex() within the sql query: or soundex(pname.middleName) = soundex(:n1)

Lucene is able to deal with soundex by using a PhoneticFilter.

My current understanding is that the LuceneQueries of OpenMRS do not fulfill the requirement of enabling a Lucene query to use a phonetic filter. That is because the PhoneticFilterFactory is not part of the LuceneAnalyzeFactory.

My current pre-condition is:

  • There is currently no way to use soundex() like functionality within the LuceneQueries in open mrs

My approach would be:

  • Adding a new Lucene SearchMapping into the Factory class that adds the following filter to the definition: .filter(PhoneticFilterFactory.class)
  • Use that filter to map the sql query

Is there any feedback to my approach?

@fruether That sounds to me like the correct basic approach.

@fruether your approach makes sense to me. Just remember to add the appropriate tests.

fyi @bistenes did some work recently to add Soundex support for some of LuceneQueries… we should coordinate with what he did… @bistenes can you weigh in when you get a chance?

Thanks! Mark

Hi Fred, that looks like a good approach.

However, we’ll need to add to this plan if we want to get to parity with the current customizability. Presently there are a couple custom phonetic algorithms, and the active algorithm can be configured using a GP.

I took a crack at making Lucene analyzers configurable via GP a while ago, but couldn’t get past some Spring initialization problems (I think it introduced a cyclical dependency). But making it configurable using modules might be easier.

Hi @bistenes (cc @mogoodrich), hope you two had good start into 2020.

Thanks for your reply. I assume you are referencing to the following issue: https://issues.openmrs.org/browse/TRUNK-5669

Do you think the work on TRUNK-5680 should be stopped till 5669 is done and Global Properties are used. I already noticed the similarity and commented below the issue.

My idea was to first make the getSimilarMethod use a phonetic filter and then when this is done thinking about making everything more modular. Basically that would mean TRUNK-5680 would focus on adapting PersonLuceneQuery class and then this would be kinda independent mostly from the GP settings?

Yes, TRUNK-5669 is exactly the one. Glad you found it, sorry I didn’t link it.

I agree with the iterative approach. I don’t think TRUNK-5680 should be stopped. I’m not even sure 5669 can be done. Ultimately, if it can’t be, I think it should be fine if phonetic filters are only configurable using code (i.e., not GP configurable).

@raff @dkayiwa @bistenes do you already have a concept for supporting a between statement within the Lucene library?

Current situation is that I have to translate the following SQL statement into a PersonLuceneQuery:

String birthdayMatch = " (year(p.birthdate) between " + (birthyear - 1) + " and " + (birthyear + 1) + " or p.birthdate is null) ";

As far as I can tell the current implementation only allows to add Terms as a query. That is why I think we are having the following two options at hand:

  1. Create a query term that is defined as following: query = name + " AND person.birthdate:[birthyear-1 TO birthyear+1]
  2. The TermsFilterFactory can be extended to also apply TermRangeQuery. That would mean that the current concept has to be extended.

Do you have an opinion if 1 or 2 would be preferred? And even if one would work.

@fruether, filters and queries have distinct properties, see https://stackoverflow.com/a/3721135

With that in mind 1) is a preferred way to have a range query on a birthdate. Caching such a dynamic filter does not make much sense. A query may be faster.

Thank you @raff Currently during the setup of the query we do not set the field name explicitly like: familyName:query

Instead we just parse the query, that usually happens to be the string that should be searched for without a column name: PersonLuceneQuery and then LuceneQuery

So if I want to add the range query I would have to add the column to the query. Or would something like: "McDonald and person.birthdate:[birthyear-1 TO birthyear+1] work? Since Lucene does automatically associated the string “McDonald” to the specified fields?

MultiFieldQueryParser respects explicit fields in a query thus “AND person.birthdate:…” should work. If a field is not specified then it uses fields declared for a parser. Not sure why you ask instead of simply testing it…

@raff It was more about understanding the library better and I could not find any details regarding this. But you are right I should have tried it myself.

However, I am facing some issue with the Phonetic filter where I am a bit lost. So let me shorty describe what I have done so far:

  1. I added a Soundex filter to the LuceneFactory. I used the example definition of the this docu:
	mapping.analyzerDef(LuceneAnalyzers.SOUNDEX_ANALYZER,StandardTokenizerFactory.class)
  	.filter(PhoneticFilterFactory.class)
  		.param("encoder", "Soundex")
  		.param("inject", "false");
  1. I annotated the fields in PersonName accordingly. To give an example:
	@Fields({
			@Field(name = "familyName2Exact", analyzer = @Analyzer(definition = LuceneAnalyzers.EXACT_ANALYZER), boost = @Boost(4f)),
			.....
			@Field(name = "familyName2Soundex", analyzer =  @Analyzer(definition = LuceneAnalyzers.SOUNDEX_ANALYZER), boost = @Boost(8f))
	})
	private String familyName2;
  1. As far as I can tell the setting up of the LuceneAnalyzer did finish after this two steps and it can be used within a search. Do I miss an adjustment? Anything looking odd in the code above.

  2. The next step is to implement the actual search function. The code basically looks like the following:

		List<String> fields = new ArrayList<>();
		fields.addAll(Arrays.asList("familyNameSoundex", "familyName2Soundex", "middleNameSoundex", "givenNameSoundex"));
		LuceneQuery<PersonName> luceneQuery = LuceneQuery
			.newQuery(PersonName.class, sessionFactory.getCurrentSession(), query, fields).useOrQueryParser();
			luceneQuery.list();
  1. The problem is that this query results in an empty set. That is wrong since the test case (getSimilarPeople_shouldMatchSearchToFamilyName2() of PersonServiceTest expects at least two matches. If I adapt the code to use familyName2Exact it is working. That is why my conclusion is that something is wrong with the analyzer setup. The search routine seems to be valid.

Does someone has an idea how to fix this issue? @bistenes @dkayiwa perhaps

Files:

Quoting the doc you linked:

inject

(true/false) If true (the default), then new phonetic tokens are added to the stream. Otherwise, tokens are replaced with the phonetic equivalent. Setting this to false will enable phonetic matching, but the exact spelling of the target word may not match.

It may explain why you get no results for exact match with this analyzer. Please consider matching your query against all analyzers for a field i.e. familyNameExact, familyNameSoundex, etc. and only adjusting the boost.

If you need further assistance please create a PR (even not working one) as it’s easier to look at the whole picture rather than snippets.

Yes I tried both true and false before. I retried it with explicit setting inject to true. However, it is still not working. You can find my code under the following PR: https://github.com/openmrs/openmrs-core/pull/3116

I want to confirm that soundex() function in SQL will have the same result as the Phonetic filter to make sure that the search is working properly. That is why I would for now try to avoid using familyNameExact as well. Because then this filter will be implemented but may never really works properly. That could have side effects.

Once I know PhoneticFilter is working I am happy to also add the other analyzers.