The first step I’d like to take in making patient search analyzers configurable is to make it very easy, if not the default, to flatten accents in search – i.e. to make it so a search for “Jose” will bring up “José”.
Should patient search ignore accents by default?
Note that this would use ASCIIFoldingFilter – please review the filter’s description before voting. Note that it only folds character blocks that have “reasonable ASCII mappings” – i.e., it leaves Arabic, Hindi, etc. characters alone.
- Ignore accents by default
- Accent-sensitive search by default
Was doing a little research on this to remind how this worked in the past, and it looks like prior to Lucene upgrade it did flatten characters… or more specifically, we were using a direct mysql search and we use the utf_general_ci collation in our db:
Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive.
utf8_general_ci is an example:
'À' , and
'á' each have different character codes but all have a weight of
0x0041 and compare as equal.
So this is generally a regression for us… I think utf_general_ci is our recommended collation for OpenMRS, but I’m not sure (nor am I sure if there are many that use another collation).
Funny, I hadn’t heard of MySQL collation before now, but SO doesn’t have nice things to say about utf_general_ci (in comparison with utf_unicode_ci).
Anyway, I guess the important thing is that accent sensitivity is a regression that happened when we switched to Lucene, and we should fix it.