Accent flattening in patient search

Tags: #<Tag:0x00007fddef3048d0>

The first step I’d like to take in making patient search analyzers configurable is to make it very easy, if not the default, to flatten accents in search – i.e. to make it so a search for “Jose” will bring up “José”.

Should patient search ignore accents by default?

Note that this would use ASCIIFoldingFilter – please review the filter’s description before voting. Note that it only folds character blocks that have “reasonable ASCII mappings” – i.e., it leaves Arabic, Hindi, etc. characters alone.

  • Ignore accents by default
  • Accent-sensitive search by default

0 voters

Was doing a little research on this to remind how this worked in the past, and it looks like prior to Lucene upgrade it did flatten characters… or more specifically, we were using a direct mysql search and we use the utf_general_ci collation in our db:

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive. utf8_general_ci is an example: 'a' , 'A' , 'À' , and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.

https://dev.mysql.com/doc/refman/8.0/en/charset-collation-implementations.html

So this is generally a regression for us… I think utf_general_ci is our recommended collation for OpenMRS, but I’m not sure (nor am I sure if there are many that use another collation).

Take care, Mark

This is correct.

Funny, I hadn’t heard of MySQL collation before now, but SO doesn’t have nice things to say about utf_general_ci (in comparison with utf_unicode_ci).

Anyway, I guess the important thing is that accent sensitivity is a regression that happened when we switched to Lucene, and we should fix it.

1 Like

Filed https://issues.openmrs.org/browse/TRUNK-5681