recoll / Code / [524491] /website/faqsandhowtos/ZDevCaseAndDiacritics1.txt

[524491]: website / faqsandhowtos / ZDevCaseAndDiacritics1.txt History

ZDevCaseAndDiacritics1.txt 144 lines (109 with data), 6.8 kB

== Character case and diacritic marks (1), issues with stemming

=== Case and diacritics in Recoll

Recoll versions up to 1.17 almost fully ignore character case and diacritic
marks. 

All terms are converted to lower case and unaccented before they are
written to the index. There are only two exceptions:

 * File paths (as used in _dir:_ clauses) are not converted. This might
   be a bug or a feature, but the main reason is that we don't know how they
   are encoded.
 * It is possible to specify that some characters will keep their diacritic
   marks, because the entity formed by the character and the diacritic mark
   is considered to be a different letter, not a modified one. This is
   highly dependant on the language. For exemple, in Swedish, +책+ should
   be preserved, not turned into +a+.

As a necessary consequence, the same transformations are applied to search
terms, and it is impossible to search for a specific capitalization of a
word (+US+ is looked for as +us+), or a specific accented form
(+caf챕+ will be looked for as +cafe+).

However, there are some cases where you would like to be more specific:

 * Searching for +US+ or +us+ should probably return different results.
 * Diacritics are seldom significant in English, but we can find a
   few examples anyway: +sake+ and +sak챕+, +mate+ and +mat챕+. Of
   course, there are many more cases in languages which use more diacritics.

On the other hand, accents are often mistyped or forgotten (r챕sum챕, r챕sume,
resume?), and capitalization is most often unsignificant, so that it is
very important to retain the capability to ignore accent and character
case differences, and that the discrimination can be easily switched on or
off for each search (or even for specific terms).

This text and other pages which will follow will discuss issues in adding
character case and diacritics sensitivity to Recoll, under the assumption
that the main index will contain the raw source terms instead of
case-folded and unaccented ones.

The following will use the _unaccent_ neologism to mean _remove
diacritic marks_ (and not only accents). 

English examples are used when possible, but given the limited use of
diacritics in English, some French will probably creep in.

=== Diacritics and stemming

Stemming is the process by which we extend a search to terms related by
grammatical inflexion, for example singular/plural, verb tenses, etc. For
example a search for +floor+ is normally expanded by Recoll to +floors,
floored, flooring, ...+

In practice Recoll has a separate data structure that has stemmed terms
(stems) as keys pointing to a list of expansion terms 
{{{floor -> (floor,floors,floorings,...)}}}

Stemming should be applied to terms before they are stripped of
diacritics. Accents may have a grammatical significance, and the accent may
change how the term is stemmed. For example, in French the +창mes+ suffix
generally marks a past conjugation but +ames+ does not. The standard
Xapian French stemmer will turn +챕vit창mes+ (avoided) into an +챕vit+ stem,
but +챕vitames+ will be turned into +챕vitam+ (stripping
plural and feminine suffixes).

When the search is set to ignore diacritics, this poses a specific problem:
if the user enters the search term without accents (which is correct
because the system is supposed to ignore them), there is no warranty that
the term will be correctly expanded by stemming.

The diacritic mismatch breaks the family relationship between the stem
siblings, and this is independant of the type of index: it will happen with
an index where diacritics are stripped just as with a raw one.

The simpler case where diacritics in the original term only affects
diacritics in the stem also necessitates specific processing, but it is
easier to work around.

Two examples illustrating these issues follow.

==== The simple case: diacritics in the term only affect diacritics in the stem

Let's imagine that the document set contains the term +챕viter+
(infinitive of +to avoid+), but not +챕vite+ (present). The only term in
the actual index is then +챕viter+.

The user enters an unaccented +evite+, counting on the
diacritics-insensitive search mode to deal with the accents. As +챕vite+
is not present in the index, we have no way to guess that +evite+ is
really +챕vite+.

The stemmer will turn +evite+ into +evit+. There is no way that this
can be related to +챕viter+, and this legitimate result can't be found.

There is a way around this: we can compute a separate
stem expansion dictionary for unaccented terms. This dictionary, to be used
with diacritic-unsensitive searches only, contains the relationship
between +evit+ and +eviter+ (as +챕viter+ is in the index). We can
then relate +eviter+ and +챕viter+ because they differ only by accents,
and the search will find the document with +챕viter+.

==== The bad case: diacritics in the term change the stem beyond diacritics

Some grammatically significant accents will cause unexpectedly missing
search results when using a supposedly diacritics-insensitive search mode.

Let's imagine that the document set contains the term +챕viter+ 
(infinitive of +to avoid+), but not +챕vit창mes+ (past). So the stemming
expansion table has an entry for +챕vit+ -> +챕viter+.

If the user enters an unaccented +evitames+, she would expect to find the
documents containing +챕viter+ in the results, because the latter term is
a stemming sibling of +챕vit창mes+ and the search is supposedly not
influenced by diacritics, so that +evitames+ and +챕vit창mes+ should be
equivalent. 

However, our search is now in trouble, because +챕vit창mes+ is not in any
document, so that there is no data in the index which would inform us about
how to transform the input term into something that differs only by accents
but would yield a correct input for the stemmer.

If we try to feed the raw user input to the stemmer, it will propose 
an +evitam+ stem, which will not work, because the stem that actually 
exists is +챕vit+, and +evitam+ can not be related to +챕viter+.

The only palliative approach I can think of would be a spelling correction
of the input, performed independantly of the actual index contents, which
would notice that +챕vitames+ is not a French word and propose a change or an
expansion to +챕vit창mes+, which would correctly stem to +챕vit+ and allow
us to find +챕viter+.

This issue is not specific to Recoll or indeed to the fact that the index
retains accent or not. As far as I can see, it is an intrinsic bad
interaction between diacritics insensitivity and stemming.

It is also interesting to note that this case becomes less probable when
the data set becomes bigger, because more term inflexions will then be
present in the index.

We'll next think about an link:ZDevCaseAndDiacritics2.html[appropriate
interface].