|
a |
|
b/website/faqsandhowtos/ZDevCaseAndDiacritics2.txt |
|
|
1 |
== Character case and diacritic marks (2), user interface
|
|
|
2 |
|
|
|
3 |
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
|
|
|
4 |
of the problems which arise when mixing case/diacritics sensitivity and
|
|
|
5 |
stemming.
|
|
|
6 |
|
|
|
7 |
As of version 1.18, Recoll can create two types of indexes:
|
|
|
8 |
* _Dumb_ indexes contain terms which are lowercased and stripped of
|
|
|
9 |
diacritics. Searches using such an index are naturally case- and
|
|
|
10 |
diacritics- insensitive: search terms are stripped before processing.
|
|
|
11 |
* _Raw_ indexes contain terms which are just like they were found in the
|
|
|
12 |
source document. Searching such an index is naturally sensitive to case
|
|
|
13 |
and diacritics, and can be made insensitive by further processing.
|
|
|
14 |
|
|
|
15 |
The following explains how users can control these Recoll features.
|
|
|
16 |
|
|
|
17 |
=== Controlling the type of index we create: stripped or raw
|
|
|
18 |
|
|
|
19 |
The kind of index that recoll creates is determined by:
|
|
|
20 |
|
|
|
21 |
* A build-time *configure* switch: _--enable-stripchars_. If this is
|
|
|
22 |
set, the code for case and diacritics sensitivity is not compiled in and
|
|
|
23 |
recoll will work like the previous versions: unaccented and casefolded
|
|
|
24 |
index, no runtime options for case or diacritics sensitivity
|
|
|
25 |
|
|
|
26 |
* An indexing configuration switch (in recoll.conf): if Recoll was built
|
|
|
27 |
with _--disable-stripchars_, this will provide a dynamic way to return
|
|
|
28 |
to the "traditional" index. The case and diacritics code will be present
|
|
|
29 |
but inactive. Normally, a recoll installation with this switch set
|
|
|
30 |
should behave exactly like one built with _--enable-stripchars_. When
|
|
|
31 |
using multiple indexes, this switch MUST be consistent between
|
|
|
32 |
indexes. There is no support whatsoever for mixing raw and dumb indexes.
|
|
|
33 |
The option is named _indexStripChars_, and it is not settable from the
|
|
|
34 |
GUI to avoid errors. This is something that would typically be set once
|
|
|
35 |
and for all for a given installation. We need to decide what the default
|
|
|
36 |
value will be for 1.18
|
|
|
37 |
|
|
|
38 |
* A number of query time switches. Using these it is also possible to
|
|
|
39 |
perform a search insensitive to case and diacritics on a raw index. Note
|
|
|
40 |
however, that, given the complexity of the issues involved, I give no
|
|
|
41 |
guaranty at this time that this will yield exactly the same results as
|
|
|
42 |
searching a dumb index. Details about query time behaviour follow.
|
|
|
43 |
|
|
|
44 |
|
|
|
45 |
=== Controlling stem, case and diacritics expansion: user query interface
|
|
|
46 |
|
|
|
47 |
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
|
|
|
48 |
needed to give the user a way to control stem expansion. This was done in
|
|
|
49 |
three ways:
|
|
|
50 |
|
|
|
51 |
* Globally, by setting a menu option.
|
|
|
52 |
* Globally, by setting the stemming language value to empty.
|
|
|
53 |
* On a term by term basis by Capitalizing the term, or, in query language
|
|
|
54 |
mode only, by using an 'l' clause modifier (_"term"l_).
|
|
|
55 |
|
|
|
56 |
After switching to an unstripped index, capable of case and diacritic
|
|
|
57 |
sensitivity, we need ways to control what processing is performed among:
|
|
|
58 |
|
|
|
59 |
* Case expansion.
|
|
|
60 |
* Diacritics expansion.
|
|
|
61 |
* Stem expansion.
|
|
|
62 |
|
|
|
63 |
The default mode will be compatible with the previous version, because
|
|
|
64 |
this is is most generally what we want to do: ignore case and diacritics,
|
|
|
65 |
expand stems.
|
|
|
66 |
|
|
|
67 |
There are two easy approaches for controlling the parameters:
|
|
|
68 |
* Global options set in the GUI menus or as *recollq* command line
|
|
|
69 |
switches.
|
|
|
70 |
* Per-clause options set by modifiers in the query language.
|
|
|
71 |
|
|
|
72 |
We would like, however to let the user entry automatically override the
|
|
|
73 |
defaults in a sensible way. For example:
|
|
|
74 |
|
|
|
75 |
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
|
|
76 |
(for this term only).
|
|
|
77 |
* If a term is entered with upper-case characters, case sensitivity is
|
|
|
78 |
turned on. In this case, we turn off stem expansion, because it makes
|
|
|
79 |
really no sense with case sensitivity.
|
|
|
80 |
|
|
|
81 |
With this method we are stuck with 3 problems (only if the global mode is
|
|
|
82 |
set to insensitive, and we're not using the query language):
|
|
|
83 |
|
|
|
84 |
* Turning off stemming without turning on case sensitivity.
|
|
|
85 |
* Searching for an all lower-case term in case-sensitive mode.
|
|
|
86 |
* Searching for a term without diacritics in diacritic-sensitive mode.
|
|
|
87 |
|
|
|
88 |
The two latter issues are relatively marginal and can be worked around easily
|
|
|
89 |
by switching to query language mode or using negative clauses in the
|
|
|
90 |
advanced search.
|
|
|
91 |
|
|
|
92 |
However, we need to be able to turn stemming off while remaining
|
|
|
93 |
insensitive to case, and we need to stay reasonably compatible with the
|
|
|
94 |
previous versions. This means that a term which has a capital first letter
|
|
|
95 |
but is otherwise lowercase will turn stemming off, but not case sensitivity
|
|
|
96 |
on.
|
|
|
97 |
|
|
|
98 |
So we're left with how to search for such a term in a case-sensitive way,
|
|
|
99 |
and for this, you'll have to use global options or the query language.
|
|
|
100 |
|
|
|
101 |
The modified method is:
|
|
|
102 |
|
|
|
103 |
* If a term is entered with diacritics, diacritic sensitivity is turned on
|
|
|
104 |
(for this term only).
|
|
|
105 |
* If the first letter in a term is upper-case and the rest is lower-case,
|
|
|
106 |
we turn stem expansion off, but we do not become case-sensitive
|
|
|
107 |
* If any letter in a term except the first is upper-case, case sensitivity
|
|
|
108 |
is turned on. Stem expansion is also turned-off (even if the first
|
|
|
109 |
letter is lower-case), because it makes really no sense with case
|
|
|
110 |
sensitivity.
|
|
|
111 |
* To search for an all lower-case or capitalized term in a case-sensitive
|
|
|
112 |
way, use the query language: "Capitalized"C, "lowercase"C
|
|
|
113 |
* Use the query language and the "D" modifier to turn on diacritics
|
|
|
114 |
sensitivity.
|
|
|
115 |
|
|
|
116 |
It can be noted that some combinations of choices do not make sense and
|
|
|
117 |
they are not allowed by Recoll: for example, diacritics or case sensitivity
|
|
|
118 |
do not make sense with stem expansion (which cannot preserve diacritics in
|
|
|
119 |
any meaningful general way).
|
|
|
120 |
|
|
|
121 |
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
|
|
|
122 |
implementation in Recoll 1.18.
|