== Character case and diacritic marks (1), issues with stemming === Case and diacritics in Recoll Recoll versions up to 1.17 almost fully ignore character case and diacritic marks. All terms are converted to lower case and unaccented before they are written to the index. There are only two exceptions: * File paths (as used in _dir:_ clauses) are not converted. This might be a bug or a feature, but the main reason is that we don't know how they are encoded. * It is possible to specify that some characters will keep their diacritic marks, because the entity formed by the character and the diacritic mark is considered to be a different letter, not a modified one. This is highly dependant on the language. For exemple, in Swedish, +å+ should be preserved, not turned into +a+. As a necessary consequence, the same transformations are applied to search terms, and it is impossible to search for a specific capitalization of a word (+US+ is looked for as +us+), or a specific accented form (+café+ will be looked for as +cafe+). However, there are some cases where you would like to be more specific: * Searching for +US+ or +us+ should probably return different results. * Diacritics are seldom significant in English, but we can find a few examples anyway: +sake+ and +saké+, +mate+ and +maté+. Of course, there are many more cases in languages which use more diacritics. On the other hand, accents are often mistyped or forgotten (résumé, résume, resume?), and capitalization is most often unsignificant, so that it is very important to retain the capability to ignore accent and character case differences, and that the discrimination can be easily switched on or off for each search (or even for specific terms). This text and other pages which will follow will discuss issues in adding character case and diacritics sensitivity to Recoll, under the assumption that the main index will contain the raw source terms instead of case-folded and unaccented ones. The following will use the _unaccent_ neologism to mean _remove diacritic marks_ (and not only accents). English examples are used when possible, but given the limited use of diacritics in English, some French will probably creep in. === Diacritics and stemming Stemming is the process by which we extend a search to terms related by grammatical inflexion, for example singular/plural, verb tenses, etc. For example a search for +floor+ is normally expanded by Recoll to +floors, floored, flooring, ...+ In practice Recoll has a separate data structure that has stemmed terms (stems) as keys pointing to a list of expansion terms {{{floor -> (floor,floors,floorings,...)}}} Stemming should be applied to terms before they are stripped of diacritics. Accents may have a grammatical significance, and the accent may change how the term is stemmed. For example, in French the +âmes+ suffix generally marks a past conjugation but +ames+ does not. The standard Xapian French stemmer will turn +évitâmes+ (avoided) into an +évit+ stem, but +évitames+ will be turned into +évitam+ (stripping plural and feminine suffixes). When the search is set to ignore diacritics, this poses a specific problem: if the user enters the search term without accents (which is correct because the system is supposed to ignore them), there is no warranty that the term will be correctly expanded by stemming. The diacritic mismatch breaks the family relationship between the stem siblings, and this is independant of the type of index: it will happen with an index where diacritics are stripped just as with a raw one. The simpler case where diacritics in the original term only affects diacritics in the stem also necessitates specific processing, but it is easier to work around. Two examples illustrating these issues follow. ==== The simple case: diacritics in the term only affect diacritics in the stem Let's imagine that the document set contains the term +éviter+ (infinitive of +to avoid+), but not +évite+ (present). The only term in the actual index is then +éviter+. The user enters an unaccented +evite+, counting on the diacritics-insensitive search mode to deal with the accents. As +évite+ is not present in the index, we have no way to guess that +evite+ is really +évite+. The stemmer will turn +evite+ into +evit+. There is no way that this can be related to +éviter+, and this legitimate result can't be found. There is a way around this: we can compute a separate stem expansion dictionary for unaccented terms. This dictionary, to be used with diacritic-unsensitive searches only, contains the relationship between +evit+ and +eviter+ (as +éviter+ is in the index). We can then relate +eviter+ and +éviter+ because they differ only by accents, and the search will find the document with +éviter+. ==== The bad case: diacritics in the term change the stem beyond diacritics Some grammatically significant accents will cause unexpectedly missing search results when using a supposedly diacritics-insensitive search mode. Let's imagine that the document set contains the term +éviter+ (infinitive of +to avoid+), but not +évitâmes+ (past). So the stemming expansion table has an entry for +évit+ -> +éviter+. If the user enters an unaccented +evitames+, she would expect to find the documents containing +éviter+ in the results, because the latter term is a stemming sibling of +évitâmes+ and the search is supposedly not influenced by diacritics, so that +evitames+ and +évitâmes+ should be equivalent. However, our search is now in trouble, because +évitâmes+ is not in any document, so that there is no data in the index which would inform us about how to transform the input term into something that differs only by accents but would yield a correct input for the stemmer. If we try to feed the raw user input to the stemmer, it will propose an +evitam+ stem, which will not work, because the stem that actually exists is +évit+, and +evitam+ can not be related to +éviter+. The only palliative approach I can think of would be a spelling correction of the input, performed independantly of the actual index contents, which would notice that +évitames+ is not a French word and propose a change or an expansion to +évitâmes+, which would correctly stem to +évit+ and allow us to find +éviter+. This issue is not specific to Recoll or indeed to the fact that the index retains accent or not. As far as I can see, it is an intrinsic bad interaction between diacritics insensitivity and stemming. It is also interesting to note that this case becomes less probable when the data set becomes bigger, because more term inflexions will then be present in the index. We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate interface].