recoll / Code / Diff of /website/features.html

Diff of /website/features.html [72fd14] .. [d2b4b9]

Switch to side-by-side view

--- a/website/features.html
+++ b/website/features.html
@@ -59,7 +59,7 @@
                 <li><var class="literal">html</var>.</li>
 
                 <li><span class="application">OpenOffice</span>
-                files.</li>
+                files (needs <b>unzip</b> command).</li>
 
                 <li><var class="literal">maildir</var> and <var
 		    class="literal">mailbox</var> (<span class=
@@ -122,8 +122,8 @@
 	    <li>Support for multiple charsets. Internal processing and
 	      storage uses Unicode UTF-8.</li>
 
-	    <li>Stemming performed at query time (can switch stemming
-	      language after indexing).</li>
+	    <li><a href="#Stemming">Stemming</a> performed at query
+	      time (can switch stemming language after indexing).</li>
 
 	    <li>Easy installation. No database daemon, web server or
 	      exotic language necessary.</li>
@@ -134,7 +134,47 @@
 	</dd>
       </ul>
 
+      <h2><a name="#stemming"></a>Stemming</h2>
 
+      <p>Stemming is a process which transforms inflected words into
+      their most basic form. For exemple, <i>flooring</i>,
+      <i>floors</i>, <i>floored</i> would probably all be transformed
+      to <i>floor</i> by a stemmer for the English language.</p>
+
+      <p>In many search engines, the stemming process occurs during
+      indexing. The index will only contain the stemmed form of words,
+      with exceptions for terms which are detected as being probably
+      proper nouns (ie: capitalized). At query time, the terms entered
+      by the user are stemmed, then matched against the index.</p>
+
+      <p>This process results into a smaller index, but it has the
+	grave inconvenient of irrevocably losing information during
+	indexing.</p>
+
+      <p>Recoll works in a different way. No stemming is performed at
+	query time, so that all information gets into the index. The
+	resulting index is bigger, but most people probably don't care
+	much about this nowadays, because they have a 100Gb disk 95%
+	full of binary data <em>which does not get indexed</em>.</p>
+      <p>At the end of an indexing pass, Recoll builds one or several
+	stemming dictionaries, where all word stems are listed in
+	correspondence to the list of their derivatives.</p>
+
+      <p>At query time, by default, user-entered terms are stemmed,
+	then matched against the stem database, and the query is
+	expanded to include all derivatives. This will yield search
+	results analogous to those obtained by a classical engine.
+	The benefits of this approach is that stem expansion can be
+	controlled instantly at query time in several ways:
+	<ul>
+	<li>It can be selectively turned-off for any query term by
+	  capitalizing it (<i>Floor</i>).</li>
+	<li>The stemming language (ie: english, french...) can be
+	  selected (this supposes that several stemming databases have
+	  been built, which can be configured as part of the indexing,
+	  or done later, in a reasonably fast way).</li>
+      </ul>
+	
     </div>
   </body>
 </html>