recoll / Code / [524491] /website/faqsandhowtos/ZDevCaseAndDiacritics3.txt

[524491]: website / faqsandhowtos / ZDevCaseAndDiacritics3.txt History

ZDevCaseAndDiacritics3.txt 68 lines (52 with data), 3.3 kB

== Character case and diacritic marks (3), implementation

In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
interface] for switchable search sensitivity to diacritics and character
case.

So you are in this mood again and you don't want to type accents (maybe you're
stuck with a QWERTY American english keyboard), or conversely you're
want to resume looking for your résumé, and you've told Recoll as much,
using the appropriate interface. What happens then ?

The second case is easy if the index is raw, and mostly impossible if it is
stripped. So we'll concentrate on the first case: how to achieve case and
diacritics insensitivity on a raw index ?

Recoll uses three expansion tables:

* The first table has stripped and lowercased terms as keys and raw terms as
  data: +mate -> (mate, maté, MATE,...)+.

* The second table has lowercased stems as keys and original lowercase terms
  as data (when using multiple languages, there are several such tables):
  +évit -> (éviter, évite, évitâmes, ...)+.

* The third table has stripped and lowercased stems as keys and stripped
  lowercased terms as data:
  +evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+

The first table can be used for full case and diacritics expansion or for
only one of those, by post-filtering the results of full expansion (e.g. if
we only want diacritics expansion, we filter by stripping diacritics from
each result term and check that it's identical to the input). For example
if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
only perform case expansion for an input of +maté+, we apply case folding
to the initial output and keep only +maté+, as +mate+ differs from the
input.

We only perform stemming expansion when case and diacritics sensitivity is
off. It is performed using the second and third tables, both on the
lowercased and lowercased/stripped output of the first step, and each term
in the output stemming is expanded again for case (using the first table).

A full example of the expansion occurring during an insensitive search 
for +resume+ using French stemming on a mixed English/French index
follows. An important thing to remember is that the result of each
expansion is a function of the terms actually present in the index, not
some arbitrary computation (and so, of course, many of the possible but
absent variations are missing).

# The case and diacritics expansion of +resume+ yields +RESUME Resume
  Résumé resumé résume résumé resume+ 

# The Stem expansion input list (lower-cased) is:
 +resume resumé résume résumé+, and the output is:
 +resum resume resumenes resumer resumes resumé resumée résum résumait
 résumant résume résumer résumerai résumerait résumes résumez résumé résumée
 résumées résumés+ 

# Each of the above terms is then fed to case and diacritics expansion (first
 table), for the final output:
 +resume résumé Résumé résumer résume Resume résumés RESUME resumes
 resumer résumant resúmenes resumé résumait résumes résumée resumee
 résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.

A Xapian OR query is finally constructed from the expanded term list.