|
a/website/features.html |
|
b/website/features.html |
|
... |
|
... |
57 |
<li><var class="literal">text</var>.</li>
|
57 |
<li><var class="literal">text</var>.</li>
|
58 |
|
58 |
|
59 |
<li><var class="literal">html</var>.</li>
|
59 |
<li><var class="literal">html</var>.</li>
|
60 |
|
60 |
|
61 |
<li><span class="application">OpenOffice</span>
|
61 |
<li><span class="application">OpenOffice</span>
|
62 |
files.</li>
|
62 |
files (needs <b>unzip</b> command).</li>
|
63 |
|
63 |
|
64 |
<li><var class="literal">maildir</var> and <var
|
64 |
<li><var class="literal">maildir</var> and <var
|
65 |
class="literal">mailbox</var> (<span class=
|
65 |
class="literal">mailbox</var> (<span class=
|
66 |
"application">Mozilla</span>, <span class=
|
66 |
"application">Mozilla</span>, <span class=
|
67 |
"application">Thunderbird</span> and <span class=
|
67 |
"application">Thunderbird</span> and <span class=
|
|
... |
|
... |
120 |
<li>Specific file name searches with wildcards.</li>
|
120 |
<li>Specific file name searches with wildcards.</li>
|
121 |
|
121 |
|
122 |
<li>Support for multiple charsets. Internal processing and
|
122 |
<li>Support for multiple charsets. Internal processing and
|
123 |
storage uses Unicode UTF-8.</li>
|
123 |
storage uses Unicode UTF-8.</li>
|
124 |
|
124 |
|
125 |
<li>Stemming performed at query time (can switch stemming
|
125 |
<li><a href="#Stemming">Stemming</a> performed at query
|
126 |
language after indexing).</li>
|
126 |
time (can switch stemming language after indexing).</li>
|
127 |
|
127 |
|
128 |
<li>Easy installation. No database daemon, web server or
|
128 |
<li>Easy installation. No database daemon, web server or
|
129 |
exotic language necessary.</li>
|
129 |
exotic language necessary.</li>
|
130 |
|
130 |
|
131 |
<li>An indexer which runs either as a thread inside the GUI
|
131 |
<li>An indexer which runs either as a thread inside the GUI
|
132 |
or as an external, cron'able program.</li>
|
132 |
or as an external, cron'able program.</li>
|
133 |
</ul>
|
133 |
</ul>
|
134 |
</dd>
|
134 |
</dd>
|
135 |
</ul>
|
135 |
</ul>
|
136 |
|
136 |
|
|
|
137 |
<h2><a name="#stemming"></a>Stemming</h2>
|
137 |
|
138 |
|
|
|
139 |
<p>Stemming is a process which transforms inflected words into
|
|
|
140 |
their most basic form. For exemple, <i>flooring</i>,
|
|
|
141 |
<i>floors</i>, <i>floored</i> would probably all be transformed
|
|
|
142 |
to <i>floor</i> by a stemmer for the English language.</p>
|
|
|
143 |
|
|
|
144 |
<p>In many search engines, the stemming process occurs during
|
|
|
145 |
indexing. The index will only contain the stemmed form of words,
|
|
|
146 |
with exceptions for terms which are detected as being probably
|
|
|
147 |
proper nouns (ie: capitalized). At query time, the terms entered
|
|
|
148 |
by the user are stemmed, then matched against the index.</p>
|
|
|
149 |
|
|
|
150 |
<p>This process results into a smaller index, but it has the
|
|
|
151 |
grave inconvenient of irrevocably losing information during
|
|
|
152 |
indexing.</p>
|
|
|
153 |
|
|
|
154 |
<p>Recoll works in a different way. No stemming is performed at
|
|
|
155 |
query time, so that all information gets into the index. The
|
|
|
156 |
resulting index is bigger, but most people probably don't care
|
|
|
157 |
much about this nowadays, because they have a 100Gb disk 95%
|
|
|
158 |
full of binary data <em>which does not get indexed</em>.</p>
|
|
|
159 |
<p>At the end of an indexing pass, Recoll builds one or several
|
|
|
160 |
stemming dictionaries, where all word stems are listed in
|
|
|
161 |
correspondence to the list of their derivatives.</p>
|
|
|
162 |
|
|
|
163 |
<p>At query time, by default, user-entered terms are stemmed,
|
|
|
164 |
then matched against the stem database, and the query is
|
|
|
165 |
expanded to include all derivatives. This will yield search
|
|
|
166 |
results analogous to those obtained by a classical engine.
|
|
|
167 |
The benefits of this approach is that stem expansion can be
|
|
|
168 |
controlled instantly at query time in several ways:
|
|
|
169 |
<ul>
|
|
|
170 |
<li>It can be selectively turned-off for any query term by
|
|
|
171 |
capitalizing it (<i>Floor</i>).</li>
|
|
|
172 |
<li>The stemming language (ie: english, french...) can be
|
|
|
173 |
selected (this supposes that several stemming databases have
|
|
|
174 |
been built, which can be configured as part of the indexing,
|
|
|
175 |
or done later, in a reasonably fast way).</li>
|
|
|
176 |
</ul>
|
|
|
177 |
|
138 |
</div>
|
178 |
</div>
|
139 |
</body>
|
179 |
</body>
|
140 |
</html>
|
180 |
</html>
|
141 |
|
181 |
|