recoll / Code / [524491] /website/faqsandhowtos/ZDevCaseAndDiacritics2.txt

[524491]: website / faqsandhowtos / ZDevCaseAndDiacritics2.txt History

ZDevCaseAndDiacritics2.txt 123 lines (95 with data), 5.8 kB

== Character case and diacritic marks (2), user interface

In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
of the problems which arise when mixing case/diacritics sensitivity and
stemming.

As of version 1.18, Recoll can create two types of indexes:
* _Dumb_ indexes contain terms which are lowercased and stripped of
  diacritics. Searches using such an index are naturally case- and
  diacritics- insensitive: search terms are stripped before processing.
* _Raw_ indexes contain terms which are just like they were found in the
  source document. Searching such an index is naturally sensitive to case
  and diacritics, and can be made insensitive by further processing.

The following explains how users can control these Recoll features.

=== Controlling the type of index we create: stripped or raw

The kind of index that recoll creates is determined by:

 * A build-time *configure* switch: _--enable-stripchars_. If this is
   set, the code for case and diacritics sensitivity is not compiled in and
   recoll will work like the previous versions: unaccented and casefolded
   index, no runtime options for case or diacritics sensitivity

 * An indexing configuration switch (in recoll.conf): if Recoll was built
   with _--disable-stripchars_, this will provide a dynamic way to return
   to the "traditional" index. The case and diacritics code will be present
   but inactive. Normally, a recoll installation with this switch set
   should behave exactly like one built with _--enable-stripchars_. When
   using multiple indexes, this switch MUST be consistent between
   indexes. There is no support whatsoever for mixing raw and dumb indexes.
   The option is named _indexStripChars_, and it is not settable from the
   GUI to avoid errors. This is something that would typically be set once
   and for all for a given installation. We need to decide what the default
   value will be for 1.18

 * A number of query time switches. Using these it is also possible to
   perform a search insensitive to case and diacritics on a raw index. Note
   however, that, given the complexity of the issues involved, I give no
   guaranty at this time that this will yield exactly the same results as
   searching a dumb index. Details about query time behaviour follow.


=== Controlling stem, case and diacritics expansion: user query interface 

Recoll versions up to 1.17 were insensitive to case and diacritics. We only
needed to give the user a way to control stem expansion. This was done in
three ways:

 * Globally, by setting a menu option.
 * Globally, by setting the stemming language value to empty.
 * On a term by term basis by Capitalizing the term, or, in query language
   mode only, by using an 'l' clause modifier (_"term"l_).

After switching to an unstripped index, capable of case and diacritic
sensitivity, we need ways to control what processing is performed among:

 * Case expansion.
 * Diacritics expansion.
 * Stem expansion.

The default mode will be compatible with the previous version, because
this is is most generally what we want to do: ignore case and diacritics,
expand stems.

There are two easy approaches for controlling the parameters:
 * Global options set in the GUI menus or as *recollq* command line
   switches. 
 * Per-clause options set by modifiers in the query language.

We would like, however to let the user entry automatically override the
defaults in a sensible way. For example:

 * If a term is entered with diacritics, diacritic sensitivity is turned on
   (for this term only).
 * If a term is entered with upper-case characters, case sensitivity is
   turned on. In this case, we turn off stem expansion, because it makes
   really no sense with case sensitivity.

With this method we are stuck with 3 problems (only if the global mode is
set to insensitive, and we're not using the query language):

 * Turning off stemming without turning on case sensitivity.
 * Searching for an all lower-case term in case-sensitive mode.
 * Searching for a term without diacritics in diacritic-sensitive mode.

The two latter issues are relatively marginal and can be worked around easily
by switching to query language mode or using negative clauses in the
advanced search. 

However, we need to be able to turn stemming off while remaining
insensitive to case, and we need to stay reasonably compatible with the
previous versions. This means that a term which has a capital first letter
but is otherwise lowercase will turn stemming off, but not case sensitivity
on. 

So we're left with how to search for such a term in a case-sensitive way,
and for this, you'll have to use global options or the query language.

The modified method is:

 * If a term is entered with diacritics, diacritic sensitivity is turned on
   (for this term only).
 * If the first letter in a term is upper-case and the rest is lower-case,
   we turn stem expansion off, but we do not become case-sensitive
 * If any letter in a term except the first is upper-case, case sensitivity
   is turned on. Stem expansion is also turned-off (even if the first
   letter is lower-case), because it makes really no sense with case
   sensitivity.
 * To search for an all lower-case or capitalized term in a case-sensitive
   way, use the query language: "Capitalized"C, "lowercase"C
 * Use the query language and the "D" modifier to turn on diacritics
   sensitivity.

It can be noted that some combinations of choices do not make sense and
they are not allowed by Recoll: for example, diacritics or case sensitivity
do not make sense with stem expansion (which cannot preserve diacritics in
any meaningful general way).

The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
implementation in Recoll 1.18.