|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
18 |
<address><email>jfd@recoll.org</email></address>
|
18 |
<address><email>jfd@recoll.org</email></address>
|
19 |
</affiliation>
|
19 |
</affiliation>
|
20 |
</author>
|
20 |
</author>
|
21 |
|
21 |
|
22 |
<copyright>
|
22 |
<copyright>
|
23 |
<year>2005</year>
|
23 |
<year>2005-2011</year>
|
24 |
<holder role="mailto:jfd@recoll.org">Jean-Francois
|
24 |
<holder role="mailto:jfd@recoll.org">Jean-Francois
|
25 |
Dockes</holder>
|
25 |
Dockes</holder>
|
26 |
</copyright>
|
26 |
</copyright>
|
27 |
|
27 |
|
28 |
<releaseinfo>$Id: usermanual.sgml,v 1.71 2008-12-15 09:33:49 dockes Exp $</releaseinfo>
|
28 |
<releaseinfo>$Id: usermanual.sgml,v 1.71 2008-12-15 09:33:49 dockes Exp $</releaseinfo>
|
|
... |
|
... |
195 |
<itemizedlist>
|
195 |
<itemizedlist>
|
196 |
|
196 |
|
197 |
<listitem>
|
197 |
<listitem>
|
198 |
<formalpara><title>Periodic indexing:</title>
|
198 |
<formalpara><title>Periodic indexing:</title>
|
199 |
<para>indexing takes place at discrete
|
199 |
<para>indexing takes place at discrete
|
200 |
times, by executing the <command>recollindex</command>
|
200 |
times, by executing the <command>recollindex</command>
|
201 |
command. The typical usage is to have a nightly indexing run
|
201 |
command. The typical usage is to have a nightly indexing run
|
202 |
<link linkend="rcl.indexing.periodic.automat">programmed</link> into your
|
202 |
<link linkend="rcl.indexing.periodic.automat">programmed</link>
|
203 |
<command>cron</command> file.</para>
|
203 |
into your <command>cron</command> file.</para>
|
204 |
</formalpara>
|
204 |
</formalpara>
|
205 |
</listitem>
|
205 |
</listitem>
|
206 |
|
206 |
|
207 |
<listitem>
|
207 |
<listitem>
|
208 |
<formalpara><title>Real time indexing:</title>
|
208 |
<formalpara><title>Real time indexing:</title>
|
209 |
<para>indexing takes place as soon as a file is created or
|
209 |
<para>indexing takes place as soon as a file is created or
|
210 |
changed. <command>recollindex</command> runs as a daemon
|
210 |
changed. <command>recollindex</command> runs as a daemon
|
211 |
and uses a file system alteration monitor such as
|
211 |
and uses a file system alteration monitor such as
|
212 |
<application>inotify</application>,
|
212 |
<application>inotify</application>,
|
213 |
<application>Fam</application> or
|
213 |
<application>Fam</application> or
|
214 |
<application>Gamin</application>
|
214 |
<application>Gamin</application>
|
215 |
to detect file changes.</para>
|
215 |
to detect file changes.</para>
|
216 |
</formalpara>
|
216 |
</formalpara>
|
217 |
</listitem>
|
217 |
</listitem>
|
218 |
</itemizedlist>
|
218 |
</itemizedlist>
|
219 |
|
219 |
|
220 |
<para>The choice between the two methods is mostly a matter of
|
220 |
<para>The choice between the two methods is mostly a matter of
|
221 |
preference, and they can be combined by setting up multiple
|
221 |
preference, and they can be combined by setting up multiple
|
222 |
indexes (ie: use periodic indexing on a big documentation
|
222 |
indexes (ie: use periodic indexing on a big documentation
|
223 |
directory, and real time indexing on a small home
|
223 |
directory, and real time indexing on a small home
|
224 |
directory). Monitoring a big file system tree can consume
|
224 |
directory). Monitoring a big file system tree can consume
|
225 |
significant system resources.<para>
|
225 |
significant system resources.<para>
|
226 |
|
226 |
|
227 |
<para>&RCL; knows about quite a few different document
|
227 |
<para>&RCL; knows about quite a few different document
|
228 |
types. The parameters for document types recognition and
|
228 |
types. The parameters for document types recognition and
|
229 |
processing are set in
|
229 |
processing are set in
|
230 |
<link linkend="rcl.indexing.config">configuration files</link>.
|
230 |
<link linkend="rcl.indexing.config">configuration files</link>.</para>
|
231 |
</para>
|
|
|
232 |
|
231 |
|
233 |
<para>Most file types, like HTML or word processing files, only hold
|
232 |
<para>Most file types, like HTML or word processing files, only hold
|
234 |
one document. Some file types, like mail folder files or zip
|
233 |
one document. Some file types, like mail folder files or zip
|
235 |
archives, can hold many individually indexed documents, which may
|
234 |
archives, can hold many individually indexed documents, which may
|
236 |
in turn be themselves compound ones. Such hierarchies can go quite
|
235 |
in turn be themselves compound ones. Such hierarchies can go quite
|
237 |
deep, and &RCL; has no problem processing, for example, an ms-word
|
236 |
deep, and &RCL; has no problem processing, for example, an ms-word
|
238 |
document which would be an attachment to an email message part of
|
237 |
document which would be an attachment to an email message part of
|
239 |
a folder file archived inside a zip file...
|
238 |
a folder file archived inside a zip file...</para>
|
240 |
</para>
|
|
|
241 |
|
239 |
|
242 |
<para>&RCL; indexing processes plain text, HTML, openoffice
|
240 |
<para>&RCL; indexing processes plain text, HTML, openoffice
|
243 |
and e-mail files internally (a few more actually).</para>
|
241 |
and e-mail files internally (a few more actually).</para>
|
244 |
|
242 |
|
245 |
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
243 |
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
246 |
need external applications for preprocessing. The list is in the
|
244 |
need external applications for preprocessing. The list is in the
|
247 |
<link linkend="rcl.install.external"> installation</link>
|
245 |
<link linkend="rcl.install.external"> installation</link>
|
248 |
section. After every indexing operation, &RCL; updates a list of
|
246 |
section. After every indexing operation, &RCL; updates a list of
|
249 |
commands that would be needed for indexing existing files
|
247 |
commands that would be needed for indexing existing files
|
250 |
types. This list can be displayed from the
|
248 |
types. This list can be displayed from the
|
251 |
<command>recoll</command> <guilabel>File</guilabel> menu. It is
|
249 |
<command>recoll</command> <guilabel>File</guilabel> menu. It is
|
252 |
stored in the <filename>missing</filename> text file
|
250 |
stored in the <filename>missing</filename> text file
|
253 |
inside the configuration directory.</para>
|
251 |
inside the configuration directory.</para>
|
254 |
|
252 |
|
255 |
<para>Without further configuration, &RCL; will index all
|
253 |
<para>Without further configuration, &RCL; will index all
|
256 |
appropriate files from your home directory, with a reasonable
|
254 |
appropriate files from your home directory, with a reasonable
|
257 |
set of defaults.</para>
|
255 |
set of defaults.</para>
|
258 |
|
256 |
|
259 |
<para>In some cases, it may be interesting to index different
|
257 |
<para>In some cases, it may be interesting to index different
|
260 |
areas of the file system to separate databases. You can do this
|
258 |
areas of the file system to separate databases. You can do this
|
261 |
by using multiple configuration directories, each indexing a
|
259 |
by using multiple configuration directories, each indexing a
|
262 |
file system area to a specific database. See the
|
260 |
file system area to a specific database. See the
|
|
... |
|
... |
321 |
</listitem>
|
319 |
</listitem>
|
322 |
|
320 |
|
323 |
</itemizedlist>
|
321 |
</itemizedlist>
|
324 |
|
322 |
|
325 |
<para>The size of the index is determined by the document set size,
|
323 |
<para>The size of the index is determined by the document set size,
|
326 |
but the ratio can vary a lot. For a typical mixed
|
324 |
but the ratio can vary a lot. For a typical mixed
|
327 |
set of documents, the index size will often be close to
|
325 |
set of documents, the index size will often be close to
|
328 |
the data set size. In specific cases (a set of compressed
|
326 |
the data set size. In specific cases (a set of compressed
|
329 |
mbox files for example), the index can become much bigger than
|
327 |
mbox files for example), the index can become much bigger than
|
330 |
the documents. It may also be much smaller if the documents
|
328 |
the documents. It may also be much smaller if the documents
|
331 |
contain a lot of images or other non-indexed data (an extreme
|
329 |
contain a lot of images or other non-indexed data (an extreme
|
332 |
example being a set of mp3 files where only the tags would be
|
330 |
example being a set of mp3 files where only the tags would be
|
333 |
indexed).</para>
|
331 |
indexed).</para>
|
334 |
|
332 |
|
335 |
<para>Of course, images, sound and video do not increase the
|
333 |
<para>Of course, images, sound and video do not increase the
|
336 |
index size, which means that it will be quite typical nowadays
|
334 |
index size, which means that it will be quite typical nowadays
|
337 |
(2006), that even a big index will be negligible against the
|
335 |
(2006), that even a big index will be negligible against the
|
338 |
total amount of data on the computer.</para>
|
336 |
total amount of data on the computer.</para>
|
339 |
|
337 |
|
340 |
<para>The index data directory (<filename>xapiandb</filename>)
|
338 |
<para>The index data directory (<filename>xapiandb</filename>)
|
341 |
only contains data that can be completely rebuilt by an index
|
339 |
only contains data that can be completely rebuilt by an index
|
342 |
run, and it can always be destroyed safely.</para>
|
340 |
run, and it can always be destroyed safely.</para>
|
343 |
|
341 |
|
|
... |
|
... |
383 |
|
381 |
|
384 |
<sect2 id="rcl.indexing.storage.security">
|
382 |
<sect2 id="rcl.indexing.storage.security">
|
385 |
<title>Security aspects</title>
|
383 |
<title>Security aspects</title>
|
386 |
|
384 |
|
387 |
<para>The &RCL; index does not hold copies of the indexed
|
385 |
<para>The &RCL; index does not hold copies of the indexed
|
388 |
documents. But it does hold enough data to allow for an almost
|
386 |
documents. But it does hold enough data to allow for an almost
|
389 |
complete reconstruction. If confidential data is indexed,
|
387 |
complete reconstruction. If confidential data is indexed,
|
390 |
access to the database directory should be restricted. </para>
|
388 |
access to the database directory should be restricted. </para>
|
391 |
|
389 |
|
392 |
<para>As of version 1.4, &RCL; will create the configuration
|
390 |
<para>As of version 1.4, &RCL; will create the configuration
|
393 |
directory with a mode of 0700 (access by owner only). As the
|
391 |
directory with a mode of 0700 (access by owner only). As the
|
394 |
index data directory is by default a sub-directory of the
|
392 |
index data directory is by default a sub-directory of the
|
395 |
configuration directory, this should result in appropriate
|
393 |
configuration directory, this should result in appropriate
|
396 |
protection.</para>
|
394 |
protection.</para>
|
397 |
|
395 |
|
398 |
<para>If you use another setup, you should think of the kind
|
396 |
<para>If you use another setup, you should think of the kind
|
399 |
of protection you need for your index, set the directory
|
397 |
of protection you need for your index, set the directory
|
400 |
and files access modes appropriately, and also maybe adjust
|
398 |
and files access modes appropriately, and also maybe adjust
|
401 |
the <literal>umask</literal> used during index updates.</para>
|
399 |
the <literal>umask</literal> used during index updates.</para>
|
402 |
|
400 |
|
403 |
|
401 |
|
404 |
</sect2>
|
402 |
</sect2>
|
405 |
|
403 |
|
406 |
</sect1>
|
404 |
</sect1>
|
407 |
|
405 |
|
408 |
<sect1 id="rcl.indexing.config">
|
406 |
<sect1 id="rcl.indexing.config">
|
409 |
<title>Indexing configuration</title>
|
407 |
<title>Indexing configuration</title>
|
410 |
|
408 |
|
411 |
<para>Variables set inside the
|
409 |
<para>Variables set inside the
|
412 |
<link linkend="rcl.install.config">&RCL; configuration files</link>
|
410 |
<link linkend="rcl.install.config">&RCL; configuration files</link>
|
413 |
control which areas of the file system are indexed, and how
|
411 |
control which areas of the file system are indexed, and how
|
414 |
files are processed. These variables can be set either by
|
412 |
files are processed. These variables can be set either by
|
415 |
editing the text files or using the dialogs in the
|
413 |
editing the text files or using the dialogs in the
|
416 |
<command>recoll</command> GUI.</para>
|
414 |
<command>recoll</command> GUI.</para>
|
417 |
|
415 |
|
418 |
<para>You can also use <link linkend="rcl.search.multidb">multiple
|
416 |
<para>You can also use <link linkend="rcl.search.multidb">multiple
|
419 |
indexes</link> defined by separate configurations, typically to
|
417 |
indexes</link> defined by separate configurations, typically to
|
420 |
separate personal and shared indexes, or to take advantage of
|
418 |
separate personal and shared indexes, or to take advantage of
|
421 |
the organization of your data to improve search precision.</para>
|
419 |
the organization of your data to improve search precision.</para>
|
422 |
|
420 |
|
423 |
<para>The first time you start <command>recoll</command>, you
|
421 |
<para>The first time you start <command>recoll</command>, you
|
424 |
will be asked whether or not you would like it to build the
|
422 |
will be asked whether or not you would like it to build the
|
425 |
index. If you want to adjust the configuration before indexing,
|
423 |
index. If you want to adjust the configuration before indexing,
|
426 |
just click <guilabel>Cancel</guilabel> at this point, which will get
|
424 |
just click <guilabel>Cancel</guilabel> at this point, which will get
|
427 |
you into the configuration interface. If you exit,
|
425 |
you into the configuration interface. If you exit,
|
428 |
<filename>recoll</filename> will have created a ~/.recoll directory
|
426 |
<filename>recoll</filename> will have created a ~/.recoll directory
|
429 |
containing empty configuration files, which you can edit by hand.</para>
|
427 |
containing empty configuration files, which you can edit by hand.</para>
|
430 |
|
428 |
|
431 |
<para>The configuration is documented inside the <link
|
429 |
<para>The configuration is documented inside the
|
432 |
linkend="rcl.install.config">installation chapter</link> of this
|
430 |
<link linkend="rcl.install.config">installation chapter</link>
|
433 |
document, or in the recoll.conf(5) man page, but the most
|
431 |
of this document, or in the recoll.conf(5) man page, but the most
|
434 |
current information will most likely be the comments inside the
|
432 |
current information will most likely be the comments inside the
|
435 |
sample file. The most immediately useful variable you may
|
433 |
sample file. The most immediately useful variable you may
|
436 |
interested in is probably <link
|
434 |
interested in is probably
|
437 |
linkend="rcl.install.config.recollconf.topdirs">topdirs</link>,
|
435 |
<link linkend="rcl.install.config.recollconf.topdirs">topdirs</link>,
|
438 |
which determines what subtrees get indexed.</para>
|
436 |
which determines what subtrees get indexed.</para>
|
439 |
|
437 |
|
440 |
<para>The applications needed to index file types other than
|
438 |
<para>The applications needed to index file types other than
|
441 |
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
439 |
text, HTML or email (ie: pdf, postscript, ms-word...) are
|
442 |
described in the <link linkend="rcl.install.external">external
|
440 |
described in the <link linkend="rcl.install.external">external
|
443 |
packages section</link></para>
|
441 |
packages section</link></para>
|
444 |
|
442 |
|
445 |
<sect2 id="rcl.indexing.config.gui">
|
443 |
<sect2 id="rcl.indexing.config.gui">
|
446 |
<title>The indexing configuration GUI</title>
|
444 |
<title>The indexing configuration GUI</title>
|
447 |
|
445 |
|
448 |
<para>Most parameters for a given indexing configuration can
|
446 |
<para>Most parameters for a given indexing configuration can
|
|
... |
|
... |
508 |
|
506 |
|
509 |
<sect1 id="rcl.indexing.periodic">
|
507 |
<sect1 id="rcl.indexing.periodic">
|
510 |
<title>Periodic indexing</title>
|
508 |
<title>Periodic indexing</title>
|
511 |
|
509 |
|
512 |
<sect2 id="rcl.indexing.periodic.exec">
|
510 |
<sect2 id="rcl.indexing.periodic.exec">
|
513 |
<title>Starting indexing</title>
|
511 |
<title>Running indexing</title>
|
514 |
|
512 |
|
515 |
<para>Indexing is performed either by the
|
513 |
<para>Indexing is performed either by the
|
516 |
<command>recollindex</command> program, or by the
|
514 |
<command>recollindex</command> program, or by the
|
517 |
indexing thread inside the <command>recoll</command>
|
515 |
indexing thread inside the <command>recoll</command>
|
518 |
program (use the <guimenu>File</guimenu> menu). Both programs
|
516 |
program (use the <guimenu>File</guimenu> menu). Both programs
|
|
... |
|
... |
523 |
|
521 |
|
524 |
<para>Reasons to use either the indexing thread or the
|
522 |
<para>Reasons to use either the indexing thread or the
|
525 |
<command>recollindex</command> command:
|
523 |
<command>recollindex</command> command:
|
526 |
<itemizedlist>
|
524 |
<itemizedlist>
|
527 |
<listitem><para>Starting the indexing thread is more convenient,
|
525 |
<listitem><para>Starting the indexing thread is more convenient,
|
528 |
being just one click away.</para>
|
526 |
being just one click away.</para>
|
529 |
</listitem>
|
527 |
</listitem>
|
530 |
<listitem><para>The <command>recollindex</command> command has
|
528 |
<listitem><para>The <command>recollindex</command> command has
|
531 |
more options, especially the one to reset the index
|
529 |
more options, especially the one to reset the index
|
532 |
(<literal>-z</literal>).</para>
|
530 |
(<literal>-z</literal>).</para>
|
533 |
</listitem>
|
531 |
</listitem>
|
534 |
<listitem><para>The <command>recollindex</command> command will
|
532 |
<listitem><para>The <command>recollindex</command> command will
|
535 |
not take down your GUI if it crashes (a rare occurrence, but who
|
533 |
not take down your GUI if it crashes (a rare occurrence,
|
536 |
knows...)</para>
|
534 |
but who knows...)</para>
|
537 |
</listitem>
|
535 |
</listitem>
|
538 |
<listitem><para>The <command>recollindex</command> command uses
|
536 |
<listitem><para>The <command>recollindex</command> command uses
|
539 |
<command>setpriority/nice</command> to lower its priority while
|
537 |
<command>setpriority/nice</command> to lower its priority while
|
540 |
indexing
|
538 |
indexing
|
541 |
(it will also use <command>ionice</command> when this becomes
|
539 |
(it will also use <command>ionice</command> when this becomes
|
542 |
more widely available), the thread can't do it, else it would
|
540 |
more widely available), the thread can't do it, else it would
|
543 |
also slow down the user/search interface.</para>
|
541 |
also slow down the user/search interface.</para>
|
544 |
</listitem>
|
542 |
</listitem>
|
545 |
</itemizedlist>
|
543 |
</itemizedlist>
|
546 |
I'll let the reader decide where my heart belongs...</para>
|
544 |
I'll let the reader decide where my heart belongs...</para>
|
547 |
|
545 |
|
548 |
<para>If the <command>recoll</command> program finds no index
|
546 |
<para>If the <command>recoll</command> program finds no index
|
|
... |
|
... |
565 |
interruption point (the full file tree will be traversed,
|
563 |
interruption point (the full file tree will be traversed,
|
566 |
but files that were indexed up to the interruption and are still
|
564 |
but files that were indexed up to the interruption and are still
|
567 |
up to date will not need to be reindexed).</para>
|
565 |
up to date will not need to be reindexed).</para>
|
568 |
|
566 |
|
569 |
<para><command>recollindex</command> has a number of other options
|
567 |
<para><command>recollindex</command> has a number of other options
|
570 |
which are described in its man page.</para>
|
568 |
which are described in its man page.</para>
|
|
|
569 |
|
|
|
570 |
<para>Of special interest maybe are the <literal>-i</literal> and
|
|
|
571 |
<literal>-f</literal> options. <literal>-i</literal> allows
|
|
|
572 |
indexing an explicit list of files (given as command line
|
|
|
573 |
parameters or read on stdin). <literal>-f</literal> tells
|
|
|
574 |
<command>recollindex</command> to ignore file selection
|
|
|
575 |
parameters from the configuration. Together, these options allow
|
|
|
576 |
building a custom file selection process for some area of the
|
|
|
577 |
file system, by adding the top directory to the
|
|
|
578 |
<literal>skippedPaths</literal> list and using an appropriate
|
|
|
579 |
file selection method to build the file list to be fed to
|
|
|
580 |
<literal>recollindex -if</literal> .</para>
|
|
|
581 |
|
|
|
582 |
<para><literal>recollindex -i</literal> will not descend into
|
|
|
583 |
directory parameters, but just add them as index entries. It is
|
|
|
584 |
up to the external file selection method to build the complete
|
|
|
585 |
file list.</para>
|
571 |
</sect2>
|
586 |
</sect2>
|
572 |
|
587 |
|
573 |
<sect2 id="rcl.indexing.periodic.automat">
|
588 |
<sect2 id="rcl.indexing.periodic.automat">
|
574 |
<title>Using <command>cron</command> to automate
|
589 |
<title>Using <command>cron</command> to automate
|
575 |
indexing</title>
|
590 |
indexing</title>
|