Download this file

ZDevCaseAndDiacritics2.txt    123 lines (95 with data), 5.8 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
== Character case and diacritic marks (2), user interface
In a link:ZDevCaseAndDiacritics1.html[previous document], we discussed some
of the problems which arise when mixing case/diacritics sensitivity and
stemming.
As of version 1.18, Recoll can create two types of indexes:
* _Dumb_ indexes contain terms which are lowercased and stripped of
diacritics. Searches using such an index are naturally case- and
diacritics- insensitive: search terms are stripped before processing.
* _Raw_ indexes contain terms which are just like they were found in the
source document. Searching such an index is naturally sensitive to case
and diacritics, and can be made insensitive by further processing.
The following explains how users can control these Recoll features.
=== Controlling the type of index we create: stripped or raw
The kind of index that recoll creates is determined by:
* A build-time *configure* switch: _--enable-stripchars_. If this is
set, the code for case and diacritics sensitivity is not compiled in and
recoll will work like the previous versions: unaccented and casefolded
index, no runtime options for case or diacritics sensitivity
* An indexing configuration switch (in recoll.conf): if Recoll was built
with _--disable-stripchars_, this will provide a dynamic way to return
to the "traditional" index. The case and diacritics code will be present
but inactive. Normally, a recoll installation with this switch set
should behave exactly like one built with _--enable-stripchars_. When
using multiple indexes, this switch MUST be consistent between
indexes. There is no support whatsoever for mixing raw and dumb indexes.
The option is named _indexStripChars_, and it is not settable from the
GUI to avoid errors. This is something that would typically be set once
and for all for a given installation. We need to decide what the default
value will be for 1.18
* A number of query time switches. Using these it is also possible to
perform a search insensitive to case and diacritics on a raw index. Note
however, that, given the complexity of the issues involved, I give no
guaranty at this time that this will yield exactly the same results as
searching a dumb index. Details about query time behaviour follow.
=== Controlling stem, case and diacritics expansion: user query interface
Recoll versions up to 1.17 were insensitive to case and diacritics. We only
needed to give the user a way to control stem expansion. This was done in
three ways:
* Globally, by setting a menu option.
* Globally, by setting the stemming language value to empty.
* On a term by term basis by Capitalizing the term, or, in query language
mode only, by using an 'l' clause modifier (_"term"l_).
After switching to an unstripped index, capable of case and diacritic
sensitivity, we need ways to control what processing is performed among:
* Case expansion.
* Diacritics expansion.
* Stem expansion.
The default mode will be compatible with the previous version, because
this is is most generally what we want to do: ignore case and diacritics,
expand stems.
There are two easy approaches for controlling the parameters:
* Global options set in the GUI menus or as *recollq* command line
switches.
* Per-clause options set by modifiers in the query language.
We would like, however to let the user entry automatically override the
defaults in a sensible way. For example:
* If a term is entered with diacritics, diacritic sensitivity is turned on
(for this term only).
* If a term is entered with upper-case characters, case sensitivity is
turned on. In this case, we turn off stem expansion, because it makes
really no sense with case sensitivity.
With this method we are stuck with 3 problems (only if the global mode is
set to insensitive, and we're not using the query language):
* Turning off stemming without turning on case sensitivity.
* Searching for an all lower-case term in case-sensitive mode.
* Searching for a term without diacritics in diacritic-sensitive mode.
The two latter issues are relatively marginal and can be worked around easily
by switching to query language mode or using negative clauses in the
advanced search.
However, we need to be able to turn stemming off while remaining
insensitive to case, and we need to stay reasonably compatible with the
previous versions. This means that a term which has a capital first letter
but is otherwise lowercase will turn stemming off, but not case sensitivity
on.
So we're left with how to search for such a term in a case-sensitive way,
and for this, you'll have to use global options or the query language.
The modified method is:
* If a term is entered with diacritics, diacritic sensitivity is turned on
(for this term only).
* If the first letter in a term is upper-case and the rest is lower-case,
we turn stem expansion off, but we do not become case-sensitive
* If any letter in a term except the first is upper-case, case sensitivity
is turned on. Stem expansion is also turned-off (even if the first
letter is lower-case), because it makes really no sense with case
sensitivity.
* To search for an all lower-case or capitalized term in a case-sensitive
way, use the query language: "Capitalized"C, "lowercase"C
* Use the query language and the "D" modifier to turn on diacritics
sensitivity.
It can be noted that some combinations of choices do not make sense and
they are not allowed by Recoll: for example, diacritics or case sensitivity
do not make sense with stem expansion (which cannot preserve diacritics in
any meaningful general way).
The [[ZDevCaseAndDiacritics3.wiki|next page]] describes the actual
implementation in Recoll 1.18.