Accent flattening in patient search

The first step I’d like to take in making patient search analyzers configurable is to make it very easy, if not the default, to flatten accents in search – i.e. to make it so a search for “Jose” will bring up “José”.

Should patient search ignore accents by default?

Note that this would use ASCIIFoldingFilter – please review the filter’s description before voting. Note that it only folds character blocks that have “reasonable ASCII mappings” – i.e., it leaves Arabic, Hindi, etc. characters alone.

  • Ignore accents by default
  • Accent-sensitive search by default

0 voters

Was doing a little research on this to remind how this worked in the past, and it looks like prior to Lucene upgrade it did flatten characters… or more specifically, we were using a direct mysql search and we use the utf_general_ci collation in our db:

Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive. utf8_general_ci is an example: 'a' , 'A' , 'À' , and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.

https://dev.mysql.com/doc/refman/8.0/en/charset-collation-implementations.html

So this is generally a regression for us… I think utf_general_ci is our recommended collation for OpenMRS, but I’m not sure (nor am I sure if there are many that use another collation).

Take care, Mark

This is correct.

Funny, I hadn’t heard of MySQL collation before now, but SO doesn’t have nice things to say about utf_general_ci (in comparison with utf_unicode_ci).

Anyway, I guess the important thing is that accent sensitivity is a regression that happened when we switched to Lucene, and we should fix it.

1 Like

Filed https://issues.openmrs.org/browse/TRUNK-5681