--- a
+++ b/src/doc/notes/minus-hyphen-dash.txt
@@ -0,0 +1,58 @@
+= 2014-04-30: Notes about the hyphen-minus character '-': 
+
+Ascii hyphen-minus used to be glue, but stopped around version 1.18, then
+was re-instated in 1.20.
+
+Having - as glue avoids generating phrase searches with bad performance.
+
+== Dashes
+
+There is a diversity of Unicode characters used mostly indistinctly (and
+independant of their correct intent) as dash/minus/hyphen (hyphen, n-dash,
+em-dash, etc.) in real-world texts.
+
+The Unicode dashes are properly treated as word-breaking by the splitter,
+but it means that there will sometimes be a discrepancy between the
+character in the search (usually an ascii hyphen-minus), and the character
+in the text (which could be anything because of mis-use). 
+
+It does happen (incorrectly) that a dash is used in a text instead of an
+hyphen to join a compound word, resulting in no span constructed, and a
+minus in the question, generating a span search, resulting in missed
+match. 
+
+A possible solution consisting in changing all dash signs into minus signs
+at indexing time has been dismissed because this would introduce problems
+with *correct* uses of dashes (which should be treated as space). This
+would not be a major issue though because a matching search would probably
+use white space in this case, and single terms are also generated for the
+span.
+
+There are auxiliary arguments:
+
+ - Treating all dash/hyphen/minus as whitespace (except at eol) makes for a
+   smaller index.
+ - Which is especially significant for raw indexes because of
+   multiplicative effects ("jean francois" "Jean francois" "jean Francois"
+   ...)
+
+== Hyphens
+
+Hyphens have several distinct uses which should yield different treatment:
+
+ - Use with prefixes and suffixes: co-worker should probably be transformed
+   into or supplemented by coworker
+ - Use in compound words: American-football in "American-football player"
+   should certainly not be collapsed.
+
+If an hyphen-minus is present in the text in the first case, as will be
+current in practise, there is no way we can get it right anyway, except by
+using a language dictionary. 
+
+So, given that even a real hyphen needs an ambiguous treatment, we don't
+try and we just replace a Unicode hyphen (0x2010) with an ascii
+hyphen-minus while indexing. This has the best chance of matching what a
+user would type.
+
+The current (1.20) recoll is unable to match coworker and co-worker. The
+best treatment for this would probably be synonym expansion at search time.