Parent: [056b8e] (diff)

Child: [a5c2a9] (diff)

Download this file

features.html    477 lines (384 with data), 20.4 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>RECOLL: a personal text search system for
Unix/Linux</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
<meta name="Keywords" content=
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="en">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="pics/index.html">Screenshots</a></li>
<li><a href="download.html">Downloads</a></li>
<li><a href="doc.html">Documentation</a></li>
<li><a href="support.html">Support</a></li>
<li><a href="devel.html">Development</a></li>
</ul>
</div>
<div class="content">
<h1>Recoll features</h1>
<div class="intrapage">
<table width=100%>
<tbody>
<tr>
<td><a href="#systems">Supported systems</a></td>
<td><a href="#doctypes">Document types</a></td>
<td><a href="#other">Other features</a></td>
<td><a href="#integration">Desktop and web integration</a></td>
<td><a href="#stemming">Stemming</a></td>
</tr>
</tbody>
</table>
</div>
<h2><a name="general">General features</a></h2>
<ul>
<li>Easy installation, few dependancies. No database daemon,
web server, desktop environment or exotic language necessary.</li>
<li>Will run on most Unix-based <a href="features.html#systems">
systems</a>, and on MS-Windows too.</li>
<li>Qt 4 GUI, plus command line, Unity Lens, KIO and krunner
interfaces.</li>
<li>Searches most common
<a href="features.html#doctypes">document types</a>, emails and
their attachments. Transparently handles decompression
(gzip, bzip2).</li>
<li>Powerful query facilities, with boolean searches,
phrases, proximity, wildcards, filter on file types and directory
tree.</li>
<li>Multi-language and multi-character set with Unicode based
internals.</li>
<li>Extensive documentation, with a
complete <a href="usermanual/usermanual.html">user
manual</a> and manual pages for each command.</li>
</ul>
<h2><a name="systems">Supported systems</a></h2>
<p><span class="application">Recoll</span> has been compiled and
tested on Linux, MS-Windows 7-10, MacOS X and Solaris (initial
versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1,
Solaris 8). It should compile and run on all subsequent releases
of these systems and probably a few others too.</p>
<p>Qt versions from 4.7 and later</p>
<h2><a name="doctypes">Document types</a></h2>
<p><span class="application">Recoll</span> can index many document
types (along with their compressed versions). Some types are
handled internally (no external application needed). Other types
need a separate application to be installed to extract the
text. Types that only need very common utilities
(awk/sed/groff/Python etc.) are listed in the native section.</p>
<p>The MS-Windows installer includes the supporting application,
the only additional package you will need is the Python language
installation.</p>
<h4>File types indexed natively</h4>
<ul>
<li><span class="application">text</span>.</li>
<li><span class="application">html</span>.</li>
<li><span class="application">maildir</span> and
<span class="application">mailbox</span> (
<span class="application">Mozilla</span>,
<span class="application">Thunderbird</span> and
<span class="application">Evolution</span> mail ok).
<em><b>Evolution note</b>: be sure to remove <tt>.cache</tt> from
the <tt>skippedNames</tt> list in the GUI <tt>Indexing
preferences/Local Parameters/</tt> pane if you want to
index local copies of Imap mail.</em>
</li>
<li><span class="application">gaim</span> and
<span class="application">purple</span> log files.</li>
<li><span class="application">Scribus</span> files.</li>
<li><span class="application">Man pages</span> (needs
<span class="application">groff</span>).</li>
<li><span class="application">Dia</span> diagrams.</li>
<li><span class="application">Excel</span>
and <span class="application">Powerpoint</span>
for <span class="application">Recoll</span> versions 1.19.12
and later.</li>
</ul>
<h4>File types indexed with external helpers</h4>
<p>Many document types need the <span class="command">iconv</span>
command in addition to the applications specifically listed.</p>
<h5>The XML ones</h5>
<p>The following types need <span class="command">
xsltproc</span> from the <b>libxslt</b> package.
Quite a few also need <span class="command">unzip</span>:</p>
<ul>
<li><span class="application">Abiword</span> files.</li>
<li><span class="application">Fb2</span> ebooks.</li>
<li><span class="application">Kword</span> files.</li>
<li><span class="application">Microsoft Office Open XML</span>
files.</li>
<li><span class="application">OpenOffice</span> files.</li>
<li><span class="application">SVG</span> files.</li>
<li><span class="application">Gnumeric</span> files.</li>
<li><span class="application">Okular</span> annotations files.</li>
</ul>
<h5>Other formats</h5>
<p>The following need miscellaneous helper programs to decode
the internal formats.</p>
<ul>
<li><span class="application">pdf</span> with the <span class=
"command">pdftotext</span> command, which comes with
<a href="http://poppler.freedesktop.org/">poppler</a>,
(the package name is quite often <tt>poppler-utils</tt>). <br/>
Note: the older <span class="command">pdftotext</span> command
which comes with <span class="application">xpdf</span> is
not compatible with <span class="application">
Recoll</span><br/>
<em>New in 1.21</em>: if the <span class="application">
tesseract</span> OCR application, and the
<span class="command">pdftoppm</span> command are available
on the system, the <span class="command">rclpdf</span>
filter has the capability to run OCR. See the comments at
the top of <span class="command">rclpdf</span> (usually
found
in <span class="filename">/usr/share/recoll/filters</span>)
for how to enable this and configuration details.<br/>
<em>Opening PDFs at the right page</em>: the default
configuration uses <span class="command">evince</span>,
which has options for direct page access and pre-setting the
search strings (hits will be highlighted). There is an
example line in the default mimeview for doing the same
thing with <span class="command">qpdfview</span>
(<span class="literal">qpdfview --search %s %f#%p</span>).
Okular does not have a search string option (but it does
have a page number one).
</li>
<li><span class="application">msword</span> with <a href=
"http://www.winfield.demon.nl/">antiword</a>. It is also useful to
have <a href="http://wvware.sourceforge.net/">wvWare</a> installed
as it may be be used as a fallback for some files which antiword
does not handle.</li>
<li><span class="application">Wordperfect</span> with the
<span class="command">wpd2html</span> command from <a href=
"http://libwpd.sourceforge.net">libwpd</a>. On some distributions,
the command may come with a package named <span
class="literal">libwpd-tools</span> or such, not the base <a
span="literal">libwpd</a> package.</li>
<li><span class="application">Lyx</span> files (needs
<span class="application">Lyx</span> to be installed).</li>
<li><span class="application">Powerpoint</span> and <span
class="application">Excel</span> with the <a href=
"http://vitus.wagner.pp.ru/software/catdoc/">catdoc</a>
utilities up to recoll 1.19.12. Recoll 1.19.12 and later use
internal Python filters for Excel and Powerpoint, and catdoc
is not needed at all (catdoc did not work on many semi-recent
Excel and Powerpoint files).</li>
<li><span class="application">CHM (Microsoft help)</span> files
with <span class="command">Python,
<a href="http://gnochm.sourceforge.net/pychm.html">pychm</a>
and <a href="http://www.jedrea.com/chmlib/">chmlib</a></span>.</li>
<li><span class="application">GNU info</span> files
with <span class="command">Python</span> and the
<span class="command">info</span> command.</li>
<li><span class="application">EPUB</span> files
with <span class="command">Python</span> and this
<a href="http://pypi.python.org/pypi/epub/">Python epub</a>
decoding module.</li>
<li><span class="application">Tar</span> archives (needs <span
class="command">Python</span>). Tar file indexing is disabled
by default (because tar archives don't typically contain the
kind of documents that people search for), you will need to
enable it explicitely, like with the following in your
<span class="filename">$HOME/.recoll/mimeconf</span> file:
<pre>
[index]
application/x-tar = execm rcltar
</pre>
</li>
<li><span class="application">Zip</span> archives (needs <span
class="command">Python</span>).</li>
<li><span class="application">Rar</span> archives (needs <span
class="command">Python</span>), the
<a href="http://pypi.python.org/pypi/rarfile/">rarfile</a> Python
module and the <a
href="http://www.rarlab.com/rar_add.htm">unrar</a> utility.</li>
<li><span class="application">7zip</span> archives (needs
<span class="command">Python</span> and
the <a href="https://pypi.python.org/pypi/pylzma">pylzma
module</a>). This is a recent addition, and you need to
download the filter from
the <a href="filters/filters.html">filters pages</a> for
all Recoll versions prior to 1.21.</li>
<li><span class="application">iCalendar</span>(.ics) files
(needs <span class="command">Python, <a href=
"http://pypi.python.org/pypi/icalendar/2.1">icalendar</a></span>).</li>
<li><span class="application">Mozilla calendar data</span> See
<a href=
"http://bitbucket.org/medoc/recoll/wiki/IndexMozillaCalendari">
the wiki</a> about this.</li>
<li><span class="application">postscript</span> with <a href=
"http://www.gnu.org/software/ghostscript/ghostscript.html">
ghostscript</a> and <a href=
"http://www.cs.wisc.edu/~ghost/doc/pstotext.htm">pstotext</a>.
Pstotext 1.9 has a serious issue with special characters in
file names, and you should either use the version packaged for
your system which is probably patched, or apply the Debian
patch which is stored <a href=
"files/pstotext-1.9_4-debian.patch">here</a> for
convenience. See http://packages.debian.org/squeeze/pstotext
and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
for references/explanations.
<blockquote>
To make things a bit easier, I also
store <a href="files/pstotext-1.9-patched.tar.gz">an
already patched version</a>. I added an
install target to the Makefile... This installs to
/usr/local, use <i>make install PREFIX=/usr</i> to
change. So all you need is:
<pre>
tar xvzf pstotext-1.9-patched.tar.gz
cd pstotext-1.9-patched
make
make install
</pre>
</blockquote>
</li>
<li><span class="application">RTF</span> files with
<a href="http://www.gnu.org/software/unrtf/unrtf.html">
unrtf</a>. Please note that up to version 0.21.3,
<span class="command">unrtf</span> mostly does not work with
non western-european character sets. Many serious problems
(crashes with serious security implications and infinite
loops) were fixed in unrtf 0.21.8, so you really want to use
this or a newer release. Building Unrtf from source is quick
and easy.</li>
<li><span class="application">TeX</span> with <span class=
"command">untex</span>. If there is no untex package for
your distribution, <a href="untex/untex-1.3.jf.tar.gz">a
source package is stored on this site</a> (as untex has no
obvious home). Will also work with <a href=
"http://www.cs.purdue.edu/homes/trinkle/detex/">detex</a>
if this is installed.</li>
<li><span class="application">dvi</span> with <a href=
"http://www.radicaleye.com/dvips.html">dvips</a>.</li>
<li><span class="application">djvu</span> with <a href=
"http://djvu.sourceforge.net">DjVuLibre</a>.</li>
<li><span class="application">Audio file tags</span>.
Recoll releases 1.14 and later use a Python filter based
on <a href="http://code.google.com/p/mutagen/">mutagen</a>
for all audio types.</li>
<li><span class="application">Image file tags</span> with <a href=
"http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a>.
This is a perl program, so you also need perl on the
system. This works with about any possible image file and
tag format (jpg, png, tiff, gif etc.).</li>
<li><span class="application">Midi karaoke files</span> with
Python, the
<a href="http://pypi.python.org/pypi/midi/0.2.1">
midi module</a>, and some help
from <a href="http://chardet.feedparser.org/">chardet</a>. There
is probably a <tt>python-chardet</tt> package for your distribution,
but you will quite probably need to build the midi
package. This is easy but see the <a href="helpernotes.html#midi">
notes here</a>.
</li>
<li><span class="application">Konqueror webarchive</span>
format with Python (uses the <tt>tarfile</tt> standard
library module).</li>
<li><span class="application">Mimehtml web archive
format</span> (support based on the mail
filter, which introduces some mild weirdness, but still
usable).</li>
</ul>
<h2><a name="other">Other features</a></h2>
<ul>
<li>Can use a Firefox extension to index visited Web pages
history. See <a href=
"http://bitbucket.org/medoc/recoll/wiki/IndexWebHistory">the
Wiki</a> for more detail.</li>
<li>Processes all email attachments, and more generally any
realistic level of container imbrication (the "msword attachment to
a message inside a mailbox in a zip" thingy...) .</li>
<li>Multiple selectable databases.</li>
<li>Powerful query facilities, with boolean searches,
phrases, filter on file types and directory tree.</li>
<li>Xesam-compatible query language.</li>
<li>Wildcard searches (with a specific and faster function
for file names).</li>
<li>Support for multiple charsets. Internal processing and
storage uses Unicode UTF-8.</li>
<li><a href="#Stemming">Stemming</a> performed at query
time (can switch stemming language after indexing).</li>
<li>Easy installation. No database daemon, web server or
exotic language necessary.</li>
<li>An indexer which runs either as a batch, cron'able
program, or as a real-time indexing daemon, depending on
preference.</li>
</ul>
<h2><a name="integration">Desktop and web integration</a></h2>
<p>The <span class="application">Recoll</span> GUI has many
features that help to specify an efficient search and to manage
the results. However it maybe sometimes preferable to use a
simpler tool with a better integration with your desktop
interfaces. Several solutions exist:</p>
<ul>
<li>The <span class="application">Recoll</span> KIO module
allows starting queries and viewing results from the
Konqueror browser or KDE applications <em>Open</em> dialogs.</li>
<li>The <a href="http://kde-apps.org">recollrunner</a> krunner
module allows integrating Recoll search results into a
krunner query.</li>
<li>The Ubuntu Unity Recoll Lens lets you access Recoll search
from the Unity Dash. More
info <a href="https://bitbucket.org/medoc/recoll/wiki/UnityLens">
here</a>. </li>
<li>The <a href="http://github.com/medoc92/recoll-webui">Recoll
Web UI</a> lets you query a Recoll index from a web browser</li>
</ul>
<p>Recoll also has
<a href="usermanual/usermanual.html#RCL.PROGRAM.API.PYTHON">
<span class="application">Python</span></a> and
<span class="application">PHP</span> modules which can allow
easy integration with web or other applications.</p>
<h2><a name="stemming"></a>Stemming</h2>
<p>Stemming is a process which transforms inflected words
into their most basic form. For example, <i>flooring</i>,
<i>floors</i>, <i>floored</i> would probably all be
transformed to <i>floor</i> by a stemmer for the English
language.</p>
<p>In many search engines, the stemming process occurs during
indexing. The index will only contain the stemmed form of
words, with exceptions for terms which are detected as being
probably proper nouns (ie: capitalized). At query time, the
terms entered by the user are stemmed, then matched against
the index.</p>
<p>This process results into a smaller index, but it has the
grave inconvenient of irrevocably losing information during
indexing.</p>
<p>Recoll works in a different way. No stemming is performed
at query time, so that all information gets into the index.
The resulting index is bigger, but most people probably don't
care much about this nowadays, because they have a 100Gb disk
95% full of binary data <em>which does not get
indexed</em>.</p>
<p>At the end of an indexing pass, Recoll builds one or
several stemming dictionaries, where all word stems are
listed in correspondence to the list of their
derivatives.</p>
<p>At query time, by default, user-entered terms are stemmed,
then matched against the stem database, and the query is
expanded to include all derivatives. This will yield search
results analogous to those obtained by a classical engine.
The benefits of this approach is that stem expansion can be
controlled instantly at query time in several ways:</p>
<ul>
<li>It can be selectively turned-off for any query term by
capitalizing it (<i>Floor</i>).</li>
<li>The stemming language (ie: english, french...) can be
selected (this supposes that several stemming databases
have been built, which can be configured as part of the
indexing, or done later, in a reasonably fast way).</li>
</ul>
</div>
</body>
</html>