recoll / Code / Diff of /website/faqsandhowtos/ZDevCaseAndDiacritics2.txt

Diff of /website/faqsandhowtos/ZDevCaseAndDiacritics2.txt [000000] .. [821fb7]

Switch to side-by-side view

--- a
+++ b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
@@ -0,0 +1,122 @@
+== Character case and diacritic marks (2), user interface
+
+In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
+of the problems which arise when mixing case/diacritics sensitivity and
+stemming.
+
+As of version 1.18, Recoll can create two types of indexes:
+* _Dumb_ indexes contain terms which are lowercased and stripped of
+  diacritics. Searches using such an index are naturally case- and
+  diacritics- insensitive: search terms are stripped before processing.
+* _Raw_ indexes contain terms which are just like they were found in the
+  source document. Searching such an index is naturally sensitive to case
+  and diacritics, and can be made insensitive by further processing.
+
+The following explains how users can control these Recoll features.
+
+=== Controlling the type of index we create: stripped or raw
+
+The kind of index that recoll creates is determined by:
+
+ * A build-time *configure* switch: _--enable-stripchars_. If this is
+   set, the code for case and diacritics sensitivity is not compiled in and
+   recoll will work like the previous versions: unaccented and casefolded
+   index, no runtime options for case or diacritics sensitivity
+
+ * An indexing configuration switch (in recoll.conf): if Recoll was built
+   with _--disable-stripchars_, this will provide a dynamic way to return
+   to the "traditional" index. The case and diacritics code will be present
+   but inactive. Normally, a recoll installation with this switch set
+   should behave exactly like one built with _--enable-stripchars_. When
+   using multiple indexes, this switch MUST be consistent between
+   indexes. There is no support whatsoever for mixing raw and dumb indexes.
+   The option is named _indexStripChars_, and it is not settable from the
+   GUI to avoid errors. This is something that would typically be set once
+   and for all for a given installation. We need to decide what the default
+   value will be for 1.18
+
+ * A number of query time switches. Using these it is also possible to
+   perform a search insensitive to case and diacritics on a raw index. Note
+   however, that, given the complexity of the issues involved, I give no
+   guaranty at this time that this will yield exactly the same results as
+   searching a dumb index. Details about query time behaviour follow.
+
+
+=== Controlling stem, case and diacritics expansion: user query interface 
+
+Recoll versions up to 1.17 were insensitive to case and diacritics. We only
+needed to give the user a way to control stem expansion. This was done in
+three ways:
+
+ * Globally, by setting a menu option.
+ * Globally, by setting the stemming language value to empty.
+ * On a term by term basis by Capitalizing the term, or, in query language
+   mode only, by using an 'l' clause modifier (_"term"l_).
+
+After switching to an unstripped index, capable of case and diacritic
+sensitivity, we need ways to control what processing is performed among:
+
+ * Case expansion.
+ * Diacritics expansion.
+ * Stem expansion.
+
+The default mode will be compatible with the previous version, because
+this is is most generally what we want to do: ignore case and diacritics,
+expand stems.
+
+There are two easy approaches for controlling the parameters:
+ * Global options set in the GUI menus or as *recollq* command line
+   switches. 
+ * Per-clause options set by modifiers in the query language.
+
+We would like, however to let the user entry automatically override the
+defaults in a sensible way. For example:
+
+ * If a term is entered with diacritics, diacritic sensitivity is turned on
+   (for this term only).
+ * If a term is entered with upper-case characters, case sensitivity is
+   turned on. In this case, we turn off stem expansion, because it makes
+   really no sense with case sensitivity.
+
+With this method we are stuck with 3 problems (only if the global mode is
+set to insensitive, and we're not using the query language):
+
+ * Turning off stemming without turning on case sensitivity.
+ * Searching for an all lower-case term in case-sensitive mode.
+ * Searching for a term without diacritics in diacritic-sensitive mode.
+
+The two latter issues are relatively marginal and can be worked around easily
+by switching to query language mode or using negative clauses in the
+advanced search. 
+
+However, we need to be able to turn stemming off while remaining
+insensitive to case, and we need to stay reasonably compatible with the
+previous versions. This means that a term which has a capital first letter
+but is otherwise lowercase will turn stemming off, but not case sensitivity
+on. 
+
+So we're left with how to search for such a term in a case-sensitive way,
+and for this, you'll have to use global options or the query language.
+
+The modified method is:
+
+ * If a term is entered with diacritics, diacritic sensitivity is turned on
+   (for this term only).
+ * If the first letter in a term is upper-case and the rest is lower-case,
+   we turn stem expansion off, but we do not become case-sensitive
+ * If any letter in a term except the first is upper-case, case sensitivity
+   is turned on. Stem expansion is also turned-off (even if the first
+   letter is lower-case), because it makes really no sense with case
+   sensitivity.
+ * To search for an all lower-case or capitalized term in a case-sensitive
+   way, use the query language: "Capitalized"C, "lowercase"C
+ * Use the query language and the "D" modifier to turn on diacritics
+   sensitivity.
+
+It can be noted that some combinations of choices do not make sense and
+they are not allowed by Recoll: for example, diacritics or case sensitivity
+do not make sense with stem expansion (which cannot preserve diacritics in
+any meaningful general way).
+
+The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
+implementation in Recoll 1.18.