Parent: [e2c463] (diff)

Child: [c9a017] (diff)

Download this file

features.html    269 lines (212 with data), 10.4 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>RECOLL: a personal text search system for
Unix/Linux</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
<meta name="Keywords" content=
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="en">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="pics/index.html">Screenshots</a></li>
<li><a href="download.html">Downloads</a></li>
<li><a href="usermanual/index.html">User manual</a></li>
<li><a href="index.html#support">Support</a></li>
<li><a href="devel.html">Development</a></li>
</ul>
</div>
<div class="content">
<h1 class="intro">Recoll features</h1>
<dl>
<dt><a name="systems">Supported systems</a></dt>
<dd><span class="application">Recoll</span> has been compiled and
tested on FreeBSD, Linux, Darwin and Solaris (versions
FreeBSD 5-7, Redhat 7/8/9, Fedora Core 5-13, Suse 10/11,
Gentoo, Debian 3.1, Solaris 8/9/10. Other not too distant
releases should be ok too).</dd>
<dd>Qt versions from 3.1 to 4.5</dd>
<dt><a name="doctypes">Document types</a></dt>
<dd>Recoll can index many document types (along with their
compressed versions). Some types are handled internally (no
external application needed). Other types need some application to
be installed to extract the text. Types that only need common
very common utilities (awk/sed/groff etc.) are listed in the
native section.</dd>
<dl>
<dt>Natively</dt>
<dd>
<ul>
<li><span class="literal">text</span>.</li>
<li><span class="literal">html</span>.</li>
<li><span class="literal">maildir</span> and <span
class="literal">mailbox</span> (<span class=
"literal">Mozilla</span>, <span class=
"literal">Thunderbird</span> and <span class=
"literal">Evolution</span> mail ok).</li>
<li><span class="literal">OpenOffice</span>
files (needs <span class="command">unzip</span> command).</li>
<li><span class="literal">Abiword</span> files.</li>
<li><span class="literal">Kword</span> files.</li>
<li><span class="literal">gaim</span> and <span
class="literal">purple</span> log files.</li>
<li><span class="literal">Lyx</span> files (needs
<span class="literal">Lyx</span> to be installed).</li>
<li><span class="literal">Scribus</span> files.</li>
<li><span class="literal">Man pages</span> (need <span
class="command">groff</span>).</li>
</ul>
</dd>
<dt>With external helpers</dt>
<dd>
<para>In addition to the applications listed below, many
document types need the <span
class="command">iconv</span> command.</para>
<ul>
<li><span class="literal">Microsoft Office Open XML</span>
files with the <span class="command">unzip</span>
and <span class="command">xsltproc</span> commands.</li>
<li><span class="literal">pdf</span> with the <span
class="command">pdftotext</span> command, which can be
installed as part of <a href=
"http://www.foolabs.com/xpdf/">xpdf</a> or <a
href="http://poppler.freedesktop.org/">poppler</a>,
depending on your distribution.</li>
<li><span class="literal">msword</span> with <a href=
"http://www.winfield.demon.nl/">antiword</a>.</li>
<li><span class="literal">Powerpoint</span> and
<span class="literal">Excel</span> with the
<a href="http://catdoc.klik.atekon.de">
catdoc</a> utilities.</li>
<li><span class="literal">CHM (Microsoft help)</span>
files (needs <span class="command">Python, pychm or
chmlib</span>).</li>
<li><span class="literal">Zip</span>
archives (needs <span class="command">Python</span>).</li>
<li><span class="literal">iCalendar</span>(.ics) files
(needs <span class="command">Python,
<a href="http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
<li><span class="literal">Mozilla calendar data</span>
See <a href="http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
the wiki</a> about this.</li>
<li><span class="literal">Wordperfect</span> with <a href=
"http://libwpd.sourceforge.net">libwpd</a>.</li>
<li><span class="literal">postscript</span> with <a
href=
"http://www.gnu.org/software/ghostscript/ghostscript.html">
ghostscript</a> and <a href=
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.</li>
<li><span class="literal">rtf</span> with <a href=
"http://www.gnu.org/software/unrtf/unrtf.html">unrtf</a>.</li>
<li><span class="literal">TeX</span> with
<span class="command">untex</span>. If there is no untex
package for your distribution,
<a href="untex/untex-1.3.jf.tar.gz">a source package is
stored on this site</a> (as untex has no obvious
home).
Will also work
with <a
href="http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
if this is installed.
</li>
<li><span class="literal">dvi</span> with
<a href="http://www.radicaleye.com/dvips.html">dvips</a>.
</li>
<li><span class="literal">djvu</span> with
<a href="http://djvu.sourceforge.net">DjVuLibre</a>.
</li>
<li><span class="literal">mp3/flac/ogg vorbis</span>
tags support with
<a href="http://id3lib.sourceforge.net/">id3info (id3lib)
</a> (compiling id3lib on recent systems may need
a small patch, see <a href="id3lib.html">here.</a>) or
the ogg and flac tools. Release 1.14 and later use a
python filter based on
<a href="http://code.google.com/p/mutagen/">mutagen</a>
for all audio tags.
</li>
<li>Image file tags support with
<a href="http://www.sno.phy.queensu.ca/~phil/exiftool/">
exiftool</a>. This is a perl program, so you also
need perl on the system. This works with about any
possible image file and tag format (jpg, png, tiff,
gif etc.).
</li>
</ul>
</dd>
</dl>
</dd>
<dt>Other features</dt>
<dd>
<ul>
<li>Can use <b>Beagle</b> browser plug-ins to index web
history. See the
<a href="http://bitbucket.org/medoc/recoll/wiki/IndexBeagleWeb">
the Wiki</a> for more detail.</li>
<li>Processes all email attachments.</li>
<li>Multiple selectable databases.</li>
<li>Powerful query facilities, with boolean searches,
phrases, filter on file types and directory tree.</li>
<li>Xesam-compatible query language.</li>
<li>Wildcard searches (with a specific and faster function for
file names).</li>
<li>Support for multiple charsets. Internal processing and
storage uses Unicode UTF-8.</li>
<li><a href="#Stemming">Stemming</a> performed at query
time (can switch stemming language after indexing).</li>
<li>Easy installation. No database daemon, web server or
exotic language necessary.</li>
<li>An indexer which runs either as a thread inside the GUI,
as an external, batch, cron'able program, or as a
real-time indexing daemon.</li>
</ul>
</dd>
</ul>
<h2><a name="#stemming"></a>Stemming</h2>
<p>Stemming is a process which transforms inflected words into
their most basic form. For example, <i>flooring</i>,
<i>floors</i>, <i>floored</i> would probably all be transformed
to <i>floor</i> by a stemmer for the English language.</p>
<p>In many search engines, the stemming process occurs during
indexing. The index will only contain the stemmed form of words,
with exceptions for terms which are detected as being probably
proper nouns (ie: capitalized). At query time, the terms entered
by the user are stemmed, then matched against the index.</p>
<p>This process results into a smaller index, but it has the
grave inconvenient of irrevocably losing information during
indexing.</p>
<p>Recoll works in a different way. No stemming is performed at
query time, so that all information gets into the index. The
resulting index is bigger, but most people probably don't care
much about this nowadays, because they have a 100Gb disk 95%
full of binary data <em>which does not get indexed</em>.</p>
<p>At the end of an indexing pass, Recoll builds one or several
stemming dictionaries, where all word stems are listed in
correspondence to the list of their derivatives.</p>
<p>At query time, by default, user-entered terms are stemmed,
then matched against the stem database, and the query is
expanded to include all derivatives. This will yield search
results analogous to those obtained by a classical engine.
The benefits of this approach is that stem expansion can be
controlled instantly at query time in several ways:
<ul>
<li>It can be selectively turned-off for any query term by
capitalizing it (<i>Floor</i>).</li>
<li>The stemming language (ie: english, french...) can be
selected (this supposes that several stemming databases have
been built, which can be configured as part of the indexing,
or done later, in a reasonably fast way).</li>
</ul>
</div>
</body>
</html>