Switch to unified view

a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml
...
...
18
        <address><email>jfd@recoll.org</email></address>
18
        <address><email>jfd@recoll.org</email></address>
19
      </affiliation>
19
      </affiliation>
20
    </author>
20
    </author>
21
21
22
    <copyright>
22
    <copyright>
23
      <year>2005</year>
23
      <year>2005-2011</year>
24
      <holder role="mailto:jfd@recoll.org">Jean-Francois
24
      <holder role="mailto:jfd@recoll.org">Jean-Francois
25
      Dockes</holder>
25
      Dockes</holder>
26
    </copyright>
26
    </copyright>
27
27
28
    <releaseinfo>$Id: usermanual.sgml,v 1.71 2008-12-15 09:33:49 dockes Exp $</releaseinfo>
28
    <releaseinfo>$Id: usermanual.sgml,v 1.71 2008-12-15 09:33:49 dockes Exp $</releaseinfo>
...
...
195
      <itemizedlist>
195
      <itemizedlist>
196
196
197
        <listitem>
197
        <listitem>
198
          <formalpara><title>Periodic indexing:</title>
198
          <formalpara><title>Periodic indexing:</title>
199
            <para>indexing takes place at discrete
199
            <para>indexing takes place at discrete
200
        times, by executing the <command>recollindex</command>
200
              times, by executing the <command>recollindex</command>
201
        command. The typical usage is to have a nightly indexing run 
201
              command. The typical usage is to have a nightly indexing run 
202
      <link linkend="rcl.indexing.periodic.automat">programmed</link> into your
202
              <link linkend="rcl.indexing.periodic.automat">programmed</link>
203
      <command>cron</command> file.</para>
203
              into your <command>cron</command> file.</para>
204
          </formalpara>
204
          </formalpara>
205
        </listitem>
205
        </listitem>
206
206
207
        <listitem>
207
        <listitem>
208
          <formalpara><title>Real time indexing:</title>
208
          <formalpara><title>Real time indexing:</title>
209
            <para>indexing takes place as soon as a file is created or
209
            <para>indexing takes place as soon as a file is created or
210
            changed. <command>recollindex</command> runs as a daemon
210
              changed. <command>recollindex</command> runs as a daemon
211
            and uses a file system alteration monitor such as 
211
              and uses a file system alteration monitor such as 
212
              <application>inotify</application>, 
212
              <application>inotify</application>, 
213
            <application>Fam</application> or
213
            <application>Fam</application> or
214
            <application>Gamin</application>
214
            <application>Gamin</application>
215
            to detect file changes.</para>
215
            to detect file changes.</para>
216
          </formalpara>
216
          </formalpara>
217
        </listitem>
217
        </listitem>
218
      </itemizedlist>
218
      </itemizedlist>
219
219
220
      <para>The choice between the two methods is mostly a matter of
220
      <para>The choice between the two methods is mostly a matter of
221
      preference, and they can be combined by setting up multiple
221
        preference, and they can be combined by setting up multiple
222
      indexes (ie: use periodic indexing on a big documentation
222
        indexes (ie: use periodic indexing on a big documentation
223
      directory, and real time indexing on a small home
223
        directory, and real time indexing on a small home
224
      directory). Monitoring a big file system tree can consume
224
        directory). Monitoring a big file system tree can consume
225
      significant system resources.<para>
225
        significant system resources.<para>
226
226
227
      <para>&RCL; knows about quite a few different document
227
      <para>&RCL; knows about quite a few different document
228
      types. The parameters for document types recognition and
228
        types. The parameters for document types recognition and
229
      processing are set in 
229
        processing are set in 
230
       <link linkend="rcl.indexing.config">configuration files</link>.
230
        <link linkend="rcl.indexing.config">configuration files</link>.</para>
231
      </para>
232
231
233
      <para>Most file types, like HTML or word processing files, only hold
232
      <para>Most file types, like HTML or word processing files, only hold
234
        one document. Some file types, like mail folder files or zip
233
        one document. Some file types, like mail folder files or zip
235
        archives, can hold many individually indexed documents, which may
234
        archives, can hold many individually indexed documents, which may
236
        in turn be themselves compound ones. Such hierarchies can go quite
235
        in turn be themselves compound ones. Such hierarchies can go quite
237
        deep, and &RCL; has no problem processing, for example, an ms-word
236
        deep, and &RCL; has no problem processing, for example, an ms-word
238
        document which would be an attachment to an email message part of
237
        document which would be an attachment to an email message part of
239
        a folder file archived inside a zip file...
238
        a folder file archived inside a zip file...</para>
240
      </para>
241
239
242
      <para>&RCL; indexing processes plain text, HTML, openoffice
240
      <para>&RCL; indexing processes plain text, HTML, openoffice
243
      and e-mail files internally (a few more actually).</para>
241
        and e-mail files internally (a few more actually).</para>
244
242
245
      <para>Other file types (ie: postscript, pdf, ms-word, rtf ...) 
243
      <para>Other file types (ie: postscript, pdf, ms-word, rtf ...) 
246
      need external applications for preprocessing. The list is in the
244
        need external applications for preprocessing. The list is in the
247
      <link linkend="rcl.install.external"> installation</link>
245
        <link linkend="rcl.install.external"> installation</link>
248
      section. After every indexing operation, &RCL; updates a list of
246
        section. After every indexing operation, &RCL; updates a list of
249
      commands that would be needed for indexing existing files
247
        commands that would be needed for indexing existing files
250
      types. This list can be displayed from the
248
        types. This list can be displayed from the
251
      <command>recoll</command> <guilabel>File</guilabel> menu. It is
249
        <command>recoll</command> <guilabel>File</guilabel> menu. It is
252
      stored in the <filename>missing</filename> text file
250
        stored in the <filename>missing</filename> text file
253
      inside the configuration directory.</para>
251
        inside the configuration directory.</para>
254
252
255
      <para>Without further configuration, &RCL; will index all
253
      <para>Without further configuration, &RCL; will index all
256
      appropriate files from your home directory, with a reasonable
254
        appropriate files from your home directory, with a reasonable
257
      set of defaults.</para>
255
        set of defaults.</para>
258
256
259
      <para>In some cases, it may be interesting to index different
257
      <para>In some cases, it may be interesting to index different
260
    areas of the file system to separate databases. You can do this
258
    areas of the file system to separate databases. You can do this
261
    by using multiple configuration directories, each indexing a
259
    by using multiple configuration directories, each indexing a
262
    file system area to a specific database. See the 
260
    file system area to a specific database. See the 
...
...
321
          </listitem>
319
          </listitem>
322
320
323
        </itemizedlist>
321
        </itemizedlist>
324
322
325
      <para>The size of the index is determined by the document set size,
323
      <para>The size of the index is determined by the document set size,
326
      but the ratio can vary a lot. For a typical mixed
324
        but the ratio can vary a lot. For a typical mixed
327
      set of documents, the index size will often be close to
325
        set of documents, the index size will often be close to
328
      the data set size. In specific cases (a set of compressed
326
        the data set size. In specific cases (a set of compressed
329
      mbox files for example), the index can become much bigger than
327
        mbox files for example), the index can become much bigger than
330
      the documents. It may also be much smaller if the documents
328
        the documents. It may also be much smaller if the documents
331
      contain a lot of images or other non-indexed data (an extreme
329
        contain a lot of images or other non-indexed data (an extreme
332
      example being a set of mp3 files where only the tags would be
330
        example being a set of mp3 files where only the tags would be
333
      indexed).</para>
331
        indexed).</para>
334
332
335
      <para>Of course, images, sound and video do not increase the
333
      <para>Of course, images, sound and video do not increase the
336
      index size, which means that it will be quite typical nowadays
334
        index size, which means that it will be quite typical nowadays
337
      (2006), that even a big index will be negligible against the
335
        (2006), that even a big index will be negligible against the
338
      total amount of data on the computer.</para>
336
        total amount of data on the computer.</para>
339
      
337
      
340
      <para>The index data directory (<filename>xapiandb</filename>)
338
      <para>The index data directory (<filename>xapiandb</filename>)
341
    only contains data that can be completely rebuilt by an index
339
    only contains data that can be completely rebuilt by an index
342
    run, and it can always be destroyed safely.</para>
340
    run, and it can always be destroyed safely.</para>
343
341
...
...
383
381
384
      <sect2 id="rcl.indexing.storage.security">
382
      <sect2 id="rcl.indexing.storage.security">
385
        <title>Security aspects</title>
383
        <title>Security aspects</title>
386
384
387
        <para>The &RCL; index does not hold copies of the indexed
385
        <para>The &RCL; index does not hold copies of the indexed
388
        documents. But it does hold enough data to allow for an almost
386
          documents. But it does hold enough data to allow for an almost
389
        complete reconstruction. If confidential data is indexed,
387
          complete reconstruction. If confidential data is indexed,
390
        access to the database directory should be restricted. </para>
388
          access to the database directory should be restricted. </para>
391
389
392
        <para>As of version 1.4, &RCL; will create the configuration
390
        <para>As of version 1.4, &RCL; will create the configuration
393
        directory with a mode of 0700 (access by owner only). As the
391
          directory with a mode of 0700 (access by owner only). As the
394
        index data directory is by default a sub-directory of the
392
          index data directory is by default a sub-directory of the
395
        configuration directory, this should result in appropriate
393
          configuration directory, this should result in appropriate
396
        protection.</para> 
394
          protection.</para> 
397
395
398
        <para>If you use another setup, you should think of the kind
396
        <para>If you use another setup, you should think of the kind
399
        of protection you need for your index, set the directory
397
          of protection you need for your index, set the directory
400
        and files access modes appropriately, and also maybe adjust
398
          and files access modes appropriately, and also maybe adjust
401
        the <literal>umask</literal> used during index updates.</para>
399
          the <literal>umask</literal> used during index updates.</para>
402
        
400
        
403
401
404
      </sect2>
402
      </sect2>
405
403
406
    </sect1>
404
    </sect1>
407
405
408
    <sect1 id="rcl.indexing.config">
406
    <sect1 id="rcl.indexing.config">
409
      <title>Indexing configuration</title>
407
      <title>Indexing configuration</title>
410
408
411
      <para>Variables set inside the 
409
      <para>Variables set inside the 
412
      <link linkend="rcl.install.config">&RCL; configuration files</link>
410
        <link linkend="rcl.install.config">&RCL; configuration files</link>
413
      control which areas of the file system are indexed, and how
411
        control which areas of the file system are indexed, and how
414
      files are processed. These variables can be set either by
412
        files are processed. These variables can be set either by
415
      editing the text files or using the dialogs in the
413
        editing the text files or using the dialogs in the
416
      <command>recoll</command> GUI.</para>
414
        <command>recoll</command> GUI.</para>
417
415
418
      <para>You can also use <link linkend="rcl.search.multidb">multiple 
416
      <para>You can also use <link linkend="rcl.search.multidb">multiple 
419
      indexes</link> defined by separate configurations, typically to 
417
          indexes</link> defined by separate configurations, typically to 
420
      separate personal and shared indexes, or to take advantage of
418
        separate personal and shared indexes, or to take advantage of
421
      the organization of your data to improve search precision.</para> 
419
        the organization of your data to improve search precision.</para> 
422
420
423
      <para>The first time you start <command>recoll</command>, you
421
      <para>The first time you start <command>recoll</command>, you
424
      will be asked whether or not you would like it to build the
422
        will be asked whether or not you would like it to build the
425
      index. If you want to adjust the configuration before indexing,
423
        index. If you want to adjust the configuration before indexing,
426
      just click <guilabel>Cancel</guilabel> at this point, which will get
424
        just click <guilabel>Cancel</guilabel> at this point, which will get
427
      you into the configuration interface. If you exit, 
425
        you into the configuration interface. If you exit, 
428
      <filename>recoll</filename> will have created a ~/.recoll directory
426
        <filename>recoll</filename> will have created a ~/.recoll directory
429
      containing empty configuration files, which you can edit by hand.</para>
427
        containing empty configuration files, which you can edit by hand.</para>
430
428
431
      <para>The configuration is documented inside the <link
429
      <para>The configuration is documented inside the 
432
      linkend="rcl.install.config">installation chapter</link> of this
430
        <link linkend="rcl.install.config">installation chapter</link> 
433
      document, or in the recoll.conf(5) man page, but the most
431
        of this document, or in the recoll.conf(5) man page, but the most
434
      current information will most likely be the comments inside the
432
        current information will most likely be the comments inside the
435
      sample file. The most immediately useful variable you may
433
        sample file. The most immediately useful variable you may
436
      interested in is probably <link
434
        interested in is probably 
437
      linkend="rcl.install.config.recollconf.topdirs">topdirs</link>,
435
        <link linkend="rcl.install.config.recollconf.topdirs">topdirs</link>,
438
      which determines what subtrees get indexed.</para>
436
        which determines what subtrees get indexed.</para>
439
437
440
      <para>The applications needed to index file types other than
438
      <para>The applications needed to index file types other than
441
      text, HTML or email (ie: pdf, postscript, ms-word...) are
439
        text, HTML or email (ie: pdf, postscript, ms-word...) are
442
      described in the <link linkend="rcl.install.external">external
440
        described in the <link linkend="rcl.install.external">external
443
      packages section</link></para>
441
          packages section</link></para>
444
442
445
      <sect2 id="rcl.indexing.config.gui">
443
      <sect2 id="rcl.indexing.config.gui">
446
        <title>The indexing configuration GUI</title>
444
        <title>The indexing configuration GUI</title>
447
445
448
        <para>Most parameters for a given indexing configuration can
446
        <para>Most parameters for a given indexing configuration can
...
...
508
506
509
    <sect1 id="rcl.indexing.periodic">
507
    <sect1 id="rcl.indexing.periodic">
510
      <title>Periodic indexing</title>
508
      <title>Periodic indexing</title>
511
509
512
      <sect2 id="rcl.indexing.periodic.exec">
510
      <sect2 id="rcl.indexing.periodic.exec">
513
        <title>Starting indexing</title>
511
        <title>Running indexing</title>
514
512
515
        <para>Indexing is performed either by the
513
        <para>Indexing is performed either by the
516
          <command>recollindex</command> program, or by the
514
          <command>recollindex</command> program, or by the
517
          indexing thread inside the <command>recoll</command>
515
          indexing thread inside the <command>recoll</command>
518
          program (use the <guimenu>File</guimenu> menu). Both programs
516
          program (use the <guimenu>File</guimenu> menu). Both programs
...
...
523
521
524
        <para>Reasons to use either the indexing thread or the
522
        <para>Reasons to use either the indexing thread or the
525
        <command>recollindex</command> command:
523
        <command>recollindex</command> command:
526
          <itemizedlist>
524
          <itemizedlist>
527
            <listitem><para>Starting the indexing thread is more convenient,
525
            <listitem><para>Starting the indexing thread is more convenient,
528
            being just one click away.</para>
526
                being just one click away.</para>
529
            </listitem>
527
            </listitem>
530
            <listitem><para>The <command>recollindex</command> command has
528
            <listitem><para>The <command>recollindex</command> command has
531
            more options, especially the one to reset the index
529
                more options, especially the one to reset the index
532
            (<literal>-z</literal>).</para>
530
                (<literal>-z</literal>).</para>
533
            </listitem>
531
            </listitem>
534
            <listitem><para>The <command>recollindex</command> command will
532
            <listitem><para>The <command>recollindex</command> command will
535
            not take down your GUI if it crashes (a rare occurrence, but who
533
                not take down your GUI if it crashes (a rare occurrence,
536
            knows...)</para>
534
                but who knows...)</para>
537
            </listitem>
535
            </listitem>
538
            <listitem><para>The <command>recollindex</command> command uses
536
            <listitem><para>The <command>recollindex</command> command uses
539
            <command>setpriority/nice</command> to lower its priority while
537
                <command>setpriority/nice</command> to lower its priority while
540
            indexing 
538
                indexing 
541
            (it will also use <command>ionice</command> when this becomes
539
                (it will also use <command>ionice</command> when this becomes
542
            more widely available), the thread can't do it, else it would
540
                more widely available), the thread can't do it, else it would
543
            also slow down the user/search interface.</para>
541
                also slow down the user/search interface.</para>
544
            </listitem>
542
            </listitem>
545
          </itemizedlist>
543
          </itemizedlist>
546
          I'll let the reader decide where my heart belongs...</para>
544
          I'll let the reader decide where my heart belongs...</para>
547
545
548
        <para>If the <command>recoll</command> program finds no index
546
        <para>If the <command>recoll</command> program finds no index
...
...
565
          interruption point (the full file tree will be traversed,
563
          interruption point (the full file tree will be traversed,
566
          but files that were indexed up to the interruption and are still
564
          but files that were indexed up to the interruption and are still
567
          up to date will not need to be reindexed).</para>
565
          up to date will not need to be reindexed).</para>
568
566
569
        <para><command>recollindex</command> has a number of other options
567
        <para><command>recollindex</command> has a number of other options
570
        which are described in its man page.</para>
568
          which are described in its man page.</para>
569
570
        <para>Of special interest maybe are the <literal>-i</literal> and
571
          <literal>-f</literal> options. <literal>-i</literal> allows
572
          indexing an explicit list of files (given as command line
573
          parameters or read on stdin). <literal>-f</literal> tells
574
          <command>recollindex</command> to ignore file selection
575
          parameters from the configuration. Together, these options allow
576
          building a custom file selection process for some area of the
577
          file system, by adding the top directory to the
578
          <literal>skippedPaths</literal> list and using an appropriate
579
          file selection method to build the file list to be fed to
580
          <literal>recollindex&nbsp;-if</literal> .</para>
581
582
        <para><literal>recollindex&nbsp;-i</literal> will not descend into
583
          directory parameters, but just add them as index entries. It is
584
          up to the external file selection method to build the complete
585
          file list.</para>
571
      </sect2>
586
      </sect2>
572
587
573
      <sect2 id="rcl.indexing.periodic.automat">
588
      <sect2 id="rcl.indexing.periodic.automat">
574
        <title>Using <command>cron</command> to automate
589
        <title>Using <command>cron</command> to automate
575
          indexing</title>
590
          indexing</title>