recoll / Code / Diff of /website/faqsandhowtos/ZDevCaseAndDiacritics3.txt

Diff of /website/faqsandhowtos/ZDevCaseAndDiacritics3.txt [000000] .. [821fb7]

Switch to side-by-side view

--- a
+++ b/website/faqsandhowtos/ZDevCaseAndDiacritics3.txt
@@ -0,0 +1,67 @@
+== Character case and diacritic marks (3), implementation
+
+In previous pages, we discussed link:ZDevCaseAndDiacritics1.html[diacritics
+and stemming], and an link:ZDevCaseAndDiacritics2.html[appropriate
+interface] for switchable search sensitivity to diacritics and character
+case.
+
+So you are in this mood again and you don't want to type accents (maybe you're
+stuck with a QWERTY American english keyboard), or conversely you're
+want to resume looking for your résumé, and you've told Recoll as much,
+using the appropriate interface. What happens then ?
+
+The second case is easy if the index is raw, and mostly impossible if it is
+stripped. So we'll concentrate on the first case: how to achieve case and
+diacritics insensitivity on a raw index ?
+
+Recoll uses three expansion tables:
+
+* The first table has stripped and lowercased terms as keys and raw terms as
+  data: +mate -> (mate, maté, MATE,...)+.
+
+* The second table has lowercased stems as keys and original lowercase terms
+  as data (when using multiple languages, there are several such tables):
+  +évit -> (éviter, évite, évitâmes, ...)+.
+
+* The third table has stripped and lowercased stems as keys and stripped
+  lowercased terms as data:
+  +evit -> (eviter, evite, evitons)+ and +evitam -> (evitames, ...)+
+
+The first table can be used for full case and diacritics expansion or for
+only one of those, by post-filtering the results of full expansion (e.g. if
+we only want diacritics expansion, we filter by stripping diacritics from
+each result term and check that it's identical to the input). For example
+if we have +mate -> (mate, maté, MATE, MATÉ)+ in the table and want to
+only perform case expansion for an input of +maté+, we apply case folding
+to the initial output and keep only +maté+, as +mate+ differs from the
+input.
+
+We only perform stemming expansion when case and diacritics sensitivity is
+off. It is performed using the second and third tables, both on the
+lowercased and lowercased/stripped output of the first step, and each term
+in the output stemming is expanded again for case (using the first table).
+
+A full example of the expansion occurring during an insensitive search 
+for +resume+ using French stemming on a mixed English/French index
+follows. An important thing to remember is that the result of each
+expansion is a function of the terms actually present in the index, not
+some arbitrary computation (and so, of course, many of the possible but
+absent variations are missing).
+
+# The case and diacritics expansion of +resume+ yields +RESUME Resume
+  Résumé resumé résume résumé resume+ 
+
+# The Stem expansion input list (lower-cased) is:
+ +resume resumé résume résumé+, and the output is:
+ +resum resume resumenes resumer resumes resumé resumée résum résumait
+ résumant résume résumer résumerai résumerait résumes résumez résumé résumée
+ résumées résumés+ 
+
+# Each of the above terms is then fed to case and diacritics expansion (first
+ table), for the final output:
+ +resume résumé Résumé résumer résume Resume résumés RESUME resumes
+ resumer résumant resúmenes resumé résumait résumes résumée resumee
+ résumerait Résumez résumerai RÉSUMÉES Resumée Resumes résumées+.
+
+A Xapian OR query is finally constructed from the expanded term list.
+