The first step I’d like to take in making patient search analyzers configurable is to make it very easy, if not the default, to flatten accents in search – i.e. to make it so a search for “Jose” will bring up “José”.
Should patient search ignore accents by default?
Note that this would use ASCIIFoldingFilter – please review the filter’s description before voting. Note that it only folds character blocks that have “reasonable ASCII mappings” – i.e., it leaves Arabic, Hindi, etc. characters alone.
Was doing a little research on this to remind how this worked in the past, and it looks like prior to Lucene upgrade it did flatten characters… or more specifically, we were using a direct mysql search and we use the utf_general_ci collation in our db:
Non-UCA collations have a one-to-one mapping from character code to weight. In MySQL, such collations are case insensitive and accent insensitive. utf8_general_ci is an example: 'a' , 'A' , 'À' , and 'á' each have different character codes but all have a weight of 0x0041 and compare as equal.
So this is generally a regression for us… I think utf_general_ci is our recommended collation for OpenMRS, but I’m not sure (nor am I sure if there are many that use another collation).