--- a
+++ b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
@@ -0,0 +1,122 @@
+== Character case and diacritic marks (2), user interface
+
+In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
+of the problems which arise when mixing case/diacritics sensitivity and
+stemming.
+
+As of version 1.18, Recoll can create two types of indexes:
+* _Dumb_ indexes contain terms which are lowercased and stripped of
+ diacritics. Searches using such an index are naturally case- and
+ diacritics- insensitive: search terms are stripped before processing.
+* _Raw_ indexes contain terms which are just like they were found in the
+ source document. Searching such an index is naturally sensitive to case
+ and diacritics, and can be made insensitive by further processing.
+
+The following explains how users can control these Recoll features.
+
+=== Controlling the type of index we create: stripped or raw
+
+The kind of index that recoll creates is determined by:
+
+ * A build-time *configure* switch: _--enable-stripchars_. If this is
+ set, the code for case and diacritics sensitivity is not compiled in and
+ recoll will work like the previous versions: unaccented and casefolded
+ index, no runtime options for case or diacritics sensitivity
+
+ * An indexing configuration switch (in recoll.conf): if Recoll was built
+ with _--disable-stripchars_, this will provide a dynamic way to return
+ to the "traditional" index. The case and diacritics code will be present
+ but inactive. Normally, a recoll installation with this switch set
+ should behave exactly like one built with _--enable-stripchars_. When
+ using multiple indexes, this switch MUST be consistent between
+ indexes. There is no support whatsoever for mixing raw and dumb indexes.
+ The option is named _indexStripChars_, and it is not settable from the
+ GUI to avoid errors. This is something that would typically be set once
+ and for all for a given installation. We need to decide what the default
+ value will be for 1.18
+
+ * A number of query time switches. Using these it is also possible to
+ perform a search insensitive to case and diacritics on a raw index. Note
+ however, that, given the complexity of the issues involved, I give no
+ guaranty at this time that this will yield exactly the same results as
+ searching a dumb index. Details about query time behaviour follow.
+
+
+=== Controlling stem, case and diacritics expansion: user query interface
+
+Recoll versions up to 1.17 were insensitive to case and diacritics. We only
+needed to give the user a way to control stem expansion. This was done in
+three ways:
+
+ * Globally, by setting a menu option.
+ * Globally, by setting the stemming language value to empty.
+ * On a term by term basis by Capitalizing the term, or, in query language
+ mode only, by using an 'l' clause modifier (_"term"l_).
+
+After switching to an unstripped index, capable of case and diacritic
+sensitivity, we need ways to control what processing is performed among:
+
+ * Case expansion.
+ * Diacritics expansion.
+ * Stem expansion.
+
+The default mode will be compatible with the previous version, because
+this is is most generally what we want to do: ignore case and diacritics,
+expand stems.
+
+There are two easy approaches for controlling the parameters:
+ * Global options set in the GUI menus or as *recollq* command line
+ switches.
+ * Per-clause options set by modifiers in the query language.
+
+We would like, however to let the user entry automatically override the
+defaults in a sensible way. For example:
+
+ * If a term is entered with diacritics, diacritic sensitivity is turned on
+ (for this term only).
+ * If a term is entered with upper-case characters, case sensitivity is
+ turned on. In this case, we turn off stem expansion, because it makes
+ really no sense with case sensitivity.
+
+With this method we are stuck with 3 problems (only if the global mode is
+set to insensitive, and we're not using the query language):
+
+ * Turning off stemming without turning on case sensitivity.
+ * Searching for an all lower-case term in case-sensitive mode.
+ * Searching for a term without diacritics in diacritic-sensitive mode.
+
+The two latter issues are relatively marginal and can be worked around easily
+by switching to query language mode or using negative clauses in the
+advanced search.
+
+However, we need to be able to turn stemming off while remaining
+insensitive to case, and we need to stay reasonably compatible with the
+previous versions. This means that a term which has a capital first letter
+but is otherwise lowercase will turn stemming off, but not case sensitivity
+on.
+
+So we're left with how to search for such a term in a case-sensitive way,
+and for this, you'll have to use global options or the query language.
+
+The modified method is:
+
+ * If a term is entered with diacritics, diacritic sensitivity is turned on
+ (for this term only).
+ * If the first letter in a term is upper-case and the rest is lower-case,
+ we turn stem expansion off, but we do not become case-sensitive
+ * If any letter in a term except the first is upper-case, case sensitivity
+ is turned on. Stem expansion is also turned-off (even if the first
+ letter is lower-case), because it makes really no sense with case
+ sensitivity.
+ * To search for an all lower-case or capitalized term in a case-sensitive
+ way, use the query language: "Capitalized"C, "lowercase"C
+ * Use the query language and the "D" modifier to turn on diacritics
+ sensitivity.
+
+It can be noted that some combinations of choices do not make sense and
+they are not allowed by Recoll: for example, diacritics or case sensitivity
+do not make sense with stem expansion (which cannot preserve diacritics in
+any meaningful general way).
+
+The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
+implementation in Recoll 1.18.