= 2014-04-30: Notes about the hyphen-minus character '-':
Ascii hyphen-minus used to be glue, but stopped around version 1.18, then
was re-instated in 1.20.
Having - as glue avoids generating phrase searches with bad performance.
== Dashes
There is a diversity of Unicode characters used mostly indistinctly (and
independant of their correct intent) as dash/minus/hyphen (hyphen, n-dash,
em-dash, etc.) in real-world texts.
The Unicode dashes are properly treated as word-breaking by the splitter,
but it means that there will sometimes be a discrepancy between the
character in the search (usually an ascii hyphen-minus), and the character
in the text (which could be anything because of mis-use).
It does happen (incorrectly) that a dash is used in a text instead of an
hyphen to join a compound word, resulting in no span constructed, and a
minus in the question, generating a span search, resulting in missed
match.
A possible solution consisting in changing all dash signs into minus signs
at indexing time has been dismissed because this would introduce problems
with *correct* uses of dashes (which should be treated as space). This
would not be a major issue though because a matching search would probably
use white space in this case, and single terms are also generated for the
span.
There are auxiliary arguments:
- Treating all dash/hyphen/minus as whitespace (except at eol) makes for a
smaller index.
- Which is especially significant for raw indexes because of
multiplicative effects ("jean francois" "Jean francois" "jean Francois"
...)
== Hyphens
Hyphens have several distinct uses which should yield different treatment:
- Use with prefixes and suffixes: co-worker should probably be transformed
into or supplemented by coworker
- Use in compound words: American-football in "American-football player"
should certainly not be collapsed.
If an hyphen-minus is present in the text in the first case, as will be
current in practise, there is no way we can get it right anyway, except by
using a language dictionary.
So, given that even a real hyphen needs an ambiguous treatment, we don't
try and we just replace a Unicode hyphen (0x2010) with an ascii
hyphen-minus while indexing. This has the best chance of matching what a
user would type.
The current (1.20) recoll is unable to match coworker and co-worker. The
best treatment for this would probably be synonym expansion at search time.