Download this file

minus-hyphen-dash.txt    59 lines (43 with data), 2.4 kB

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
= 2014-04-30: Notes about the hyphen-minus character '-':
Ascii hyphen-minus used to be glue, but stopped around version 1.18, then
was re-instated in 1.20.
Having - as glue avoids generating phrase searches with bad performance.
== Dashes
There is a diversity of Unicode characters used mostly indistinctly (and
independant of their correct intent) as dash/minus/hyphen (hyphen, n-dash,
em-dash, etc.) in real-world texts.
The Unicode dashes are properly treated as word-breaking by the splitter,
but it means that there will sometimes be a discrepancy between the
character in the search (usually an ascii hyphen-minus), and the character
in the text (which could be anything because of mis-use).
It does happen (incorrectly) that a dash is used in a text instead of an
hyphen to join a compound word, resulting in no span constructed, and a
minus in the question, generating a span search, resulting in missed
match.
A possible solution consisting in changing all dash signs into minus signs
at indexing time has been dismissed because this would introduce problems
with *correct* uses of dashes (which should be treated as space). This
would not be a major issue though because a matching search would probably
use white space in this case, and single terms are also generated for the
span.
There are auxiliary arguments:
- Treating all dash/hyphen/minus as whitespace (except at eol) makes for a
smaller index.
- Which is especially significant for raw indexes because of
multiplicative effects ("jean francois" "Jean francois" "jean Francois"
...)
== Hyphens
Hyphens have several distinct uses which should yield different treatment:
- Use with prefixes and suffixes: co-worker should probably be transformed
into or supplemented by coworker
- Use in compound words: American-football in "American-football player"
should certainly not be collapsed.
If an hyphen-minus is present in the text in the first case, as will be
current in practise, there is no way we can get it right anyway, except by
using a language dictionary.
So, given that even a real hyphen needs an ambiguous treatment, we don't
try and we just replace a Unicode hyphen (0x2010) with an ascii
hyphen-minus while indexing. This has the best chance of matching what a
user would type.
The current (1.20) recoll is unable to match coworker and co-worker. The
best treatment for this would probably be synonym expansion at search time.