Switch to unified view

a b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt
1
== Character case and diacritic marks (2), user interface
2
3
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
4
of the problems which arise when mixing case/diacritics sensitivity and
5
stemming.
6
7
As of version 1.18, Recoll can create two types of indexes:
8
* _Dumb_ indexes contain terms which are lowercased and stripped of
9
  diacritics. Searches using such an index are naturally case- and
10
  diacritics- insensitive: search terms are stripped before processing.
11
* _Raw_ indexes contain terms which are just like they were found in the
12
  source document. Searching such an index is naturally sensitive to case
13
  and diacritics, and can be made insensitive by further processing.
14
15
The following explains how users can control these Recoll features.
16
17
=== Controlling the type of index we create: stripped or raw
18
19
The kind of index that recoll creates is determined by:
20
21
 * A build-time *configure* switch: _--enable-stripchars_. If this is
22
   set, the code for case and diacritics sensitivity is not compiled in and
23
   recoll will work like the previous versions: unaccented and casefolded
24
   index, no runtime options for case or diacritics sensitivity
25
26
 * An indexing configuration switch (in recoll.conf): if Recoll was built
27
   with _--disable-stripchars_, this will provide a dynamic way to return
28
   to the "traditional" index. The case and diacritics code will be present
29
   but inactive. Normally, a recoll installation with this switch set
30
   should behave exactly like one built with _--enable-stripchars_. When
31
   using multiple indexes, this switch MUST be consistent between
32
   indexes. There is no support whatsoever for mixing raw and dumb indexes.
33
   The option is named _indexStripChars_, and it is not settable from the
34
   GUI to avoid errors. This is something that would typically be set once
35
   and for all for a given installation. We need to decide what the default
36
   value will be for 1.18
37
38
 * A number of query time switches. Using these it is also possible to
39
   perform a search insensitive to case and diacritics on a raw index. Note
40
   however, that, given the complexity of the issues involved, I give no
41
   guaranty at this time that this will yield exactly the same results as
42
   searching a dumb index. Details about query time behaviour follow.
43
44
45
=== Controlling stem, case and diacritics expansion: user query interface 
46
47
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
48
needed to give the user a way to control stem expansion. This was done in
49
three ways:
50
51
 * Globally, by setting a menu option.
52
 * Globally, by setting the stemming language value to empty.
53
 * On a term by term basis by Capitalizing the term, or, in query language
54
   mode only, by using an 'l' clause modifier (_"term"l_).
55
56
After switching to an unstripped index, capable of case and diacritic
57
sensitivity, we need ways to control what processing is performed among:
58
59
 * Case expansion.
60
 * Diacritics expansion.
61
 * Stem expansion.
62
63
The default mode will be compatible with the previous version, because
64
this is is most generally what we want to do: ignore case and diacritics,
65
expand stems.
66
67
There are two easy approaches for controlling the parameters:
68
 * Global options set in the GUI menus or as *recollq* command line
69
   switches. 
70
 * Per-clause options set by modifiers in the query language.
71
72
We would like, however to let the user entry automatically override the
73
defaults in a sensible way. For example:
74
75
 * If a term is entered with diacritics, diacritic sensitivity is turned on
76
   (for this term only).
77
 * If a term is entered with upper-case characters, case sensitivity is
78
   turned on. In this case, we turn off stem expansion, because it makes
79
   really no sense with case sensitivity.
80
81
With this method we are stuck with 3 problems (only if the global mode is
82
set to insensitive, and we're not using the query language):
83
84
 * Turning off stemming without turning on case sensitivity.
85
 * Searching for an all lower-case term in case-sensitive mode.
86
 * Searching for a term without diacritics in diacritic-sensitive mode.
87
88
The two latter issues are relatively marginal and can be worked around easily
89
by switching to query language mode or using negative clauses in the
90
advanced search. 
91
92
However, we need to be able to turn stemming off while remaining
93
insensitive to case, and we need to stay reasonably compatible with the
94
previous versions. This means that a term which has a capital first letter
95
but is otherwise lowercase will turn stemming off, but not case sensitivity
96
on. 
97
98
So we're left with how to search for such a term in a case-sensitive way,
99
and for this, you'll have to use global options or the query language.
100
101
The modified method is:
102
103
 * If a term is entered with diacritics, diacritic sensitivity is turned on
104
   (for this term only).
105
 * If the first letter in a term is upper-case and the rest is lower-case,
106
   we turn stem expansion off, but we do not become case-sensitive
107
 * If any letter in a term except the first is upper-case, case sensitivity
108
   is turned on. Stem expansion is also turned-off (even if the first
109
   letter is lower-case), because it makes really no sense with case
110
   sensitivity.
111
 * To search for an all lower-case or capitalized term in a case-sensitive
112
   way, use the query language: "Capitalized"C, "lowercase"C
113
 * Use the query language and the "D" modifier to turn on diacritics
114
   sensitivity.
115
116
It can be noted that some combinations of choices do not make sense and
117
they are not allowed by Recoll: for example, diacritics or case sensitivity
118
do not make sense with stem expansion (which cannot preserve diacritics in
119
any meaningful general way).
120
121
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
122
implementation in Recoll 1.18.