Switch to unified view

a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml
...
...
37
          web site</ulink>.</literal></para>
37
          web site</ulink>.</literal></para>
38
38
39
      <para>This document introduces full text search notions
39
      <para>This document introduces full text search notions
40
      and describes the installation and use of the &RCL;
40
      and describes the installation and use of the &RCL;
41
      application. It currently describes &RCL; &RCLVERSION;.</para>
41
      application. It currently describes &RCL; &RCLVERSION;.</para>
42
<!--      <para>[ <ulink url="index.html">Split HTML</ulink> / 
43
             <ulink url="usermanual-xml.html">Single HTML</ulink> ]</para>
44
>
45
    </abstract>
42
    </abstract>
46
43
47
44
48
  </bookinfo>
45
  </bookinfo>
49
  
46
  
...
...
139
        punctuation and capitalization are lost).</para>
136
        punctuation and capitalization are lost).</para>
140
137
141
      <para>&RCL; stores all internal data in <application>Unicode
138
      <para>&RCL; stores all internal data in <application>Unicode
142
      UTF-8</application> format, and it can index files with
139
      UTF-8</application> format, and it can index files with
143
      different character sets, encodings, and languages into the same
140
      different character sets, encodings, and languages into the same
144
      index. It has input filters for many document types.</para>
141
      index. It has can process many document types.</para>
145
      
142
      
146
      <para>Stemming is the process by which &RCL; reduces words to
143
      <para>Stemming is the process by which &RCL; reduces words to
147
        their radicals so that searching does not depend, for example, on a
144
        their radicals so that searching does not depend, for example, on a
148
        word being singular or plural (floor, floors), or on a verb tense
145
        word being singular or plural (floor, floors), or on a verb tense
149
        (flooring, floored). Because the mechanisms used for stemming
146
        (flooring, floored). Because the mechanisms used for stemming
...
...
379
          be ignored.</para>
376
          be ignored.</para>
380
        <para>Excluding types can be done by adding wildcard name
377
        <para>Excluding types can be done by adding wildcard name
381
          patterns to the <literal>skippedNames</literal> list, which
378
          patterns to the <literal>skippedNames</literal> list, which
382
          can be done from the GUI Index configuration menu. It is
379
          can be done from the GUI Index configuration menu. It is
383
          also possible to exclude a mime type independantly of the
380
          also possible to exclude a mime type independantly of the
384
          file name by associating it with
381
          file name by associating it with the
385
          the <filename>rclnull</filename> filter. This can be done by
382
          <filename>rclnull</filename> input handler. This can be done
386
          editing the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
383
          by editing the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
387
            <filename>mimeconf</filename> configuration
384
            <filename>mimeconf</filename> configuration
388
            file</link>.</para>
385
            file</link>.</para>
389
386
390
        <para>In order to define a positive list, You need to edit the 
387
        <para>In order to define a positive list, You need to edit the 
391
          <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main
388
          <link linkend="RCL.INSTALL.CONFIG.RECOLLCONF">main
...
...
2461
        stored by default, apart from the values above
2458
        stored by default, apart from the values above
2462
        (only <literal>author</literal>
2459
        (only <literal>author</literal>
2463
        and <literal>filename</literal>), so this feature will need
2460
        and <literal>filename</literal>), so this feature will need
2464
        some custom local configuration to be useful. An example
2461
        some custom local configuration to be useful. An example
2465
        candidate would be the <literal>recipient</literal> field
2462
        candidate would be the <literal>recipient</literal> field
2466
        which is generated by the message filters.</para>
2463
        which is generated by the message input handlers.</para>
2467
2464
2468
        <para>The default value for the paragraph format string is:
2465
        <para>The default value for the paragraph format string is:
2469
        <screen><![CDATA[
2466
        <screen><![CDATA[
2470
<img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
2467
<img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
2471
%M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
2468
%M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
...
...
2959
        slow search (or even an incorrect one if the expansion is
2956
        slow search (or even an incorrect one if the expansion is
2960
        truncated because of excessive size). Also see 
2957
        truncated because of excessive size). Also see 
2961
        <link linkend="RCL.SEARCH.WILDCARDS">
2958
        <link linkend="RCL.SEARCH.WILDCARDS">
2962
          More about wildcards</link>.</para>
2959
          More about wildcards</link>.</para>
2963
2960
2964
      <para>The document filters used while indexing have the
2961
      <para>The document input handlers used while indexing have the
2965
        possibility to create other fields with arbitrary names, and
2962
        possibility to create other fields with arbitrary names, and
2966
        aliases may be defined in the configuration, so that the exact
2963
        aliases may be defined in the configuration, so that the exact
2967
        field search possibilities may be different for you if someone
2964
        field search possibilities may be different for you if someone
2968
        took care of the customisation.</para>
2965
        took care of the customisation.</para>
2969
2966
...
...
3291
      <para>&RCL; has an Application Programming Interface, usable both
3288
      <para>&RCL; has an Application Programming Interface, usable both
3292
        for indexing and searching, currently accessible from the
3289
        for indexing and searching, currently accessible from the
3293
        <application>Python</application> language.</para>
3290
        <application>Python</application> language.</para>
3294
3291
3295
      <para>Another less radical way to extend the application is to
3292
      <para>Another less radical way to extend the application is to
3296
        write filters for new types of documents.</para>
3293
        write input handlers for new types of documents.</para>
3297
3294
3298
      <para>The processing of metadata attributes for documents
3295
      <para>The processing of metadata attributes for documents
3299
        (<literal>fields</literal>) is highly configurable.</para>
3296
        (<literal>fields</literal>) is highly configurable.</para>
3300
3297
3301
3298
3302
3299
3303
      <sect1 id="RCL.PROGRAM.FILTERS">
3300
      <sect1 id="RCL.PROGRAM.FILTERS">
3304
        <title>Writing a document filter</title>
3301
        <title>Writing a document input handler</title>
3302
        
3303
        <note><title>Terminology</title>The small programs or pieces
3304
        of code which handle the processing of the different document
3305
        types for &RCL; used to be called <literal>filters</literal>,
3306
        which is still reflected in the name of the directory which
3307
        holds them and many configuration variables. They were named
3308
        this way because one of their primary functions is to filter
3309
        out the formatting directives and keep the text
3310
        content. However these modules may have other behaviours, and
3311
        the term <literal>input handler</literal> is now progressively
3312
        substituted in the documentation. <literal>filter</literal> is
3313
        still used in many places though.</note>
3305
3314
3306
        <para>&RCL; filters cooperate to translate from the multitude
3315
        <para>&RCL; input handlers cooperate to translate from the multitude
3307
        of input document formats, simple ones
3316
        of input document formats, simple ones
3308
        as <application>opendocument</application>, 
3317
        as <application>opendocument</application>, 
3309
          <application>acrobat</application>), or compound ones such
3318
          <application>acrobat</application>), or compound ones such
3310
          as <application>Zip</application>
3319
          as <application>Zip</application>
3311
          or <application>Email</application>, into the final &RCL;
3320
          or <application>Email</application>, into the final &RCL;
3312
          indexing input format, which may
3321
          indexing input format, which is plain text.
3313
          be <literal>text/plain</literal>
3322
          Most input handlers are executable
3314
          or <literal>text/html</literal>. Most filters are executable
3315
          programs or scripts. A few filters are coded in C++ and live
3323
          programs or scripts. A few handlers are coded in C++ and live
3316
          inside <command>recollindex</command>. This latter kind will not
3324
          inside <command>recollindex</command>. This latter kind will not
3317
          be described here.</para>
3325
          be described here.</para>
3318
3326
3319
        <para>There are currently (1.18 and since 1.13) two kinds of
3327
        <para>There are currently (1.18 and since 1.13) two kinds of
3320
        external executable filters:
3328
        external executable input handlers:
3321
          <itemizedlist>
3329
          <itemizedlist>
3322
        <listitem><para>Simple filters (<literal>exec</literal>
3330
        <listitem><para>Simple <literal>exec</literal> handlers
3323
          filters) run once and
3331
            run once and exit. They can be bare programs like
3324
          exit. They can be bare programs
3332
            <command>antiword</command>, or scripts using other
3325
          like <application>antiword</application>, or scripts
3333
            programs. They are very simple to write, because they just
3326
          using other programs. They are very simple to write,
3334
            need to print the converted document to the standard
3327
          because they just need to print the converted document
3335
            output. Their output can be plain text or HTML. HTML is
3328
          to the standard output. Their output can
3336
            usually preferred because it can store metadata fields and
3329
          be <literal>text/plain</literal>
3337
            it allows preserving some of the formatting for the GUI
3330
          or <literal>text/html</literal>.</para>
3338
            preview.</para>
3331
        </listitem>
3339
        </listitem>
3332
        <listitem><para>Multiple filters (<literal>execm</literal>
3340
        <listitem><para>Multiple <literal>execm</literal> handlers
3333
          filters), run as long as
3341
      can process multiple files (sparing the process startup
3334
          their master process (<command>recollindex</command>) is
3342
      time which can be very significant), or multiple documents
3335
          active. They can process multiple files (sparing the
3343
      per file (e.g.: for <application>zip</application> or
3336
          process startup time which can be very significant),
3344
      <application>chm</application> files). They communicate
3337
          or multiple documents per file (e.g.: for zip or chm
3345
      with the indexer through a simple protocol, but are
3338
          files). They communicate with the indexer through a
3346
      nevertheless a bit more complicated than the older
3339
          simple protocol, but are nevertheless a bit more
3347
      kind. Most of new handlers are written in
3340
          complicated than the older kind. Most of new
3341
          filters are written
3342
            in <application>Python</application>, using a common
3348
        <application>Python</application>, using a common module
3343
            module to handle the protocol. There is an
3349
        to handle the protocol. There is an exception,
3344
            exception, <command>rclimg</command> which is written
3350
        <command>rclimg</command> which is written in Perl. The
3345
          in Perl. The subdocuments output by these filters can
3351
      subdocuments output by these handlers can be directly
3346
          be directly indexable (text or HTML), or they can be
3352
      indexable (text or HTML), or they can be other simple or
3347
          other simple or compound documents that will need to
3353
      compound documents that will need to be processed by
3348
          be processed by another filter.</para>
3354
      another handler.</para>
3349
        </listitem>
3355
        </listitem>
3350
      </itemizedlist>
3356
      </itemizedlist>
3351
        </para>
3357
        </para>
3352
3358
3353
        <para>In both cases, filters deal with regular file system
3359
        <para>In both cases, handlers deal with regular file system
3354
          files, and can process either a single document, or a
3360
          files, and can process either a single document, or a
3355
          linear list of documents in each file. &RCL; is responsible
3361
          linear list of documents in each file. &RCL; is responsible
3356
          for performing up to date checks, deal with more complex
3362
          for performing up to date checks, deal with more complex
3357
          embedding and other upper level issues.</para>
3363
          embedding and other upper level issues.</para>
3358
3364
3359
        <para>In the extreme case of a simple filter returning a
3365
        <para>A simple handler returning a
3360
          document in <literal>text/plain</literal> format, no
3366
          document in <literal>text/plain</literal> format, can transfer
3361
          metadata can be transferred from the filter to the
3362
          indexer. Generic metadata, like document size or
3367
          no metadata to the indexer. Generic metadata, like document
3363
          modification date, will be gathered and stored by the
3368
          size or modification date, will be gathered and stored by
3364
          indexer.</para> 
3369
          the indexer.</para>
3365
3370
3366
        <para>Filters that produce  <literal>text/html</literal>
3371
        <para>Handlers that produce  <literal>text/html</literal>
3367
          format can return an arbitrary amount of metadata inside HTML
3372
          format can return an arbitrary amount of metadata inside HTML
3368
          <literal>meta</literal> tags. These will be processed
3373
          <literal>meta</literal> tags. These will be processed
3369
          according to the directives found in 
3374
          according to the directives found in 
3370
          the <link linkend="RCL.PROGRAM.FIELDS">
3375
          the <link linkend="RCL.PROGRAM.FIELDS">
3371
            <filename>fields</filename> configuration
3376
            <filename>fields</filename> configuration
3372
            file</link>.</para>
3377
            file</link>.</para>
3373
3378
3374
        <para>The filters that can handle multiple documents per file
3379
        <para>The handlers that can handle multiple documents per file
3375
          return a single piece of data to identify each document inside
3380
          return a single piece of data to identify each document inside
3376
          the file. This piece of data, called
3381
          the file. This piece of data, called
3377
          an <literal>ipath element</literal> will be sent back by
3382
          an <literal>ipath element</literal> will be sent back by
3378
          &RCL; to extract the document at query time, for previewing,
3383
          &RCL; to extract the document at query time, for previewing,
3379
          or for creating a temporary file to be opened by a
3384
          or for creating a temporary file to be opened by a
3380
          viewer.</para>  
3385
          viewer.</para>  
3381
3386
3382
        <para>The following section describes the simple
3387
        <para>The following section describes the simple
3383
          filters, and the next one gives a few explanations about
3388
          handlers, and the next one gives a few explanations about
3384
          the <literal>execm</literal> ones. You could conceivably
3389
          the <literal>execm</literal> ones. You could conceivably
3385
          write a simple filter with only the elements in the
3390
          write a simple handler with only the elements in the
3386
          manual. This will not be the case for the other ones, for
3391
          manual. This will not be the case for the other ones, for
3387
          which you will have to look at the code.</para>
3392
          which you will have to look at the code.</para>
3388
3393
3389
      <sect2 id="RCL.PROGRAM.FILTERS.SIMPLE">
3394
      <sect2 id="RCL.PROGRAM.FILTERS.SIMPLE">
3390
        <title>Simple filters</title>
3395
        <title>Simple input handlers</title>
3391
3396
3392
      <para>&RCL; simple filters are usually shell-scripts, but this is in
3397
      <para>&RCL; simple handlers are usually shell-scripts, but this is in
3393
        no way necessary. Extracting the text from the native format is the
3398
        no way necessary. Extracting the text from the native format is the
3394
        difficult part. Outputting the format expected by &RCL; is
3399
        difficult part. Outputting the format expected by &RCL; is
3395
        trivial. Happily enough, most document formats have translators or
3400
        trivial. Happily enough, most document formats have translators or
3396
        text extractors which can be called from the filter. In some cases
3401
        text extractors which can be called from the handler. In some cases
3397
        the output of the translating program is completely appropriate,
3402
        the output of the translating program is completely appropriate,
3398
        and no intermediate shell-script is needed.</para>
3403
        and no intermediate shell-script is needed.</para>
3399
3404
3400
        <para>Filters are called with a single argument which is the
3405
        <para>Input handlers are called with a single argument which is the
3401
        source file name. They should output the result to stdout.</para>
3406
        source file name. They should output the result to stdout.</para>
3402
3407
3403
      <para>When writing a filter, you should decide if it will output
3408
      <para>When writing a handler, you should decide if it will output
3404
      plain text or HTML. Plain text is simpler, but you will not be able
3409
      plain text or HTML. Plain text is simpler, but you will not be able
3405
      to add metadata or vary the output character encoding (this will be
3410
      to add metadata or vary the output character encoding (this will be
3406
      defined in a configuration file). Additionally, some formatting may
3411
      defined in a configuration file). Additionally, some formatting may
3407
      be easier to preserve when previewing HTML. Actually the deciding factor
3412
      be easier to preserve when previewing HTML. Actually the deciding factor
3408
      is metadata: &RCL; has a way to <link linkend="RCL.PROGRAM.FILTERS.HTML">
3413
      is metadata: &RCL; has a way to <link linkend="RCL.PROGRAM.FILTERS.HTML">
3409
      extract metadata from the HTML header and use it for field 
3414
      extract metadata from the HTML header and use it for field 
3410
      searches.</link>.</para>
3415
      searches.</link>.</para>
3411
3416
3412
      <para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
3417
      <para>The <envar>RECOLL_FILTER_FORPREVIEW</envar> environment
3413
        variable (values <literal>yes</literal>, <literal>no</literal>)
3418
        variable (values <literal>yes</literal>, <literal>no</literal>)
3414
        tells the filter if the operation is for indexing or
3419
        tells the handler if the operation is for indexing or
3415
        previewing. Some filters use this to output a slightly different
3420
        previewing. Some handlers use this to output a slightly different
3416
        format, for example stripping uninteresting repeated keywords (ie:
3421
        format, for example stripping uninteresting repeated keywords (ie:
3417
        <literal>Subject:</literal> for email) when indexing. This is not
3422
        <literal>Subject:</literal> for email) when indexing. This is not
3418
        essential.</para>
3423
        essential.</para>
3419
3424
3420
      <para>You should look at one of the simple filters, for example
3425
      <para>You should look at one of the simple handlers, for example
3421
        <command>rclps</command> for a starting point.</para>
3426
        <command>rclps</command> for a starting point.</para>
3422
3427
3423
        <para>Don't forget to make your filter executable before 
3428
        <para>Don't forget to make your handler executable before 
3424
         testing !</para>
3429
         testing !</para>
3425
3430
3426
      </sect2>
3431
      </sect2>
3427
3432
3428
      <sect2 id="RCL.PROGRAM.FILTERS.MULTIPLE">
3433
      <sect2 id="RCL.PROGRAM.FILTERS.MULTIPLE">
3429
        <title>"Multiple" filters</title>
3434
        <title>"Multiple" handlers</title>
3430
3435
3431
        <para>If you can program and want to write
3436
        <para>If you can program and want to write
3432
          an <literal>execm</literal> filter, it should not be too
3437
          an <literal>execm</literal> handler, it should not be too
3433
          difficult to make sense of one of the existing modules. For
3438
          difficult to make sense of one of the existing modules. For
3434
          example, look at <command>rclzip</command> which uses Zip
3439
          example, look at <command>rclzip</command> which uses Zip
3435
          file paths as identifiers (<literal>ipath</literal>),
3440
          file paths as identifiers (<literal>ipath</literal>),
3436
          and <command>rclics</command>, which uses an integer
3441
          and <command>rclics</command>, which uses an integer
3437
          index. Also have a look at the comments inside
3442
          index. Also have a look at the comments inside
3438
          the <filename>internfile/mh_execm.h</filename> file and
3443
          the <filename>internfile/mh_execm.h</filename> file and
3439
          possibly at the corresponding module.</para>
3444
          possibly at the corresponding module.</para>
3440
3445
3441
        <para><literal>execm</literal> filters sometimes need to make
3446
        <para><literal>execm</literal> handlers sometimes need to make
3442
          a choice for the nature of the <literal>ipath</literal>
3447
          a choice for the nature of the <literal>ipath</literal>
3443
          elements that they use in communication with the
3448
          elements that they use in communication with the
3444
          indexer. Here are a few guidelines:
3449
          indexer. Here are a few guidelines:
3445
          <itemizedlist>
3450
          <itemizedlist>
3446
            <listitem><para>Use ASCII or UTF-8 (if the identifier is an
3451
            <listitem><para>Use ASCII or UTF-8 (if the identifier is an
...
...
3451
                debugging.</para></listitem>
3456
                debugging.</para></listitem>
3452
            <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
3457
            <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
3453
                separator to store a complex path internally (for
3458
                separator to store a complex path internally (for
3454
                deeper embedding). Colons inside
3459
                deeper embedding). Colons inside
3455
                the <literal>ipath</literal> elements output by a
3460
                the <literal>ipath</literal> elements output by a
3456
                filter will be escaped, but would be a bad choice as a
3461
                handler will be escaped, but would be a bad choice as a
3457
                filter-specific separator (mostly, again, for
3462
                handler-specific separator (mostly, again, for
3458
                debugging issues).</para></listitem>
3463
                debugging issues).</para></listitem>
3459
          </itemizedlist>
3464
          </itemizedlist>
3460
          In any case, the main goal is that it should
3465
          In any case, the main goal is that it should
3461
          be easy for the filter to extract the target document, given
3466
          be easy for the handler to extract the target document, given
3462
          the file name and the <literal>ipath</literal>
3467
          the file name and the <literal>ipath</literal>
3463
          element.</para>
3468
          element.</para>
3464
3469
3465
        <para><literal>execm</literal> filters will also produce
3470
        <para><literal>execm</literal> handlers will also produce
3466
          a document with a null <literal>ipath</literal>
3471
          a document with a null <literal>ipath</literal>
3467
          element. Depending on the type of document, this may have
3472
          element. Depending on the type of document, this may have
3468
          some associated data (e.g. the body of an email message), or
3473
          some associated data (e.g. the body of an email message), or
3469
          none (typical for an archive file). If it is empty, this
3474
          none (typical for an archive file). If it is empty, this
3470
          document will be useful anyway for some operations, as the
3475
          document will be useful anyway for some operations, as the
3471
          parent of the actual data documents.</para>
3476
          parent of the actual data documents.</para>
3472
      </sect2>
3477
      </sect2>
3473
3478
3474
      <sect2 id="RCL.PROGRAM.FILTERS.ASSOCIATION">
3479
      <sect2 id="RCL.PROGRAM.FILTERS.ASSOCIATION">
3475
        <title>Telling &RCL; about the filter</title>
3480
        <title>Telling &RCL; about the handler</title>
3476
3481
3477
      <para>There are two elements that link a file to the filter which
3482
      <para>There are two elements that link a file to the handler which
3478
      should process it: the association of file to mime type and the
3483
      should process it: the association of file to mime type and the
3479
      association of a mime type with a filter.</para>
3484
      association of a mime type with a handler.</para>
3480
3485
3481
      <para>The association of files to mime types is mostly based on
3486
      <para>The association of files to mime types is mostly based on
3482
        name suffixes. The types are defined inside the
3487
        name suffixes. The types are defined inside the
3483
        <link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
3488
        <link linkend="RCL.INSTALL.CONFIG.MIMEMAP">
3484
        <filename>mimemap</filename> file</link>. Example:
3489
        <filename>mimemap</filename> file</link>. Example:
...
...
3488
</programlisting>
3493
</programlisting>
3489
       If no suffix association is found for the file name, &RCL; will try
3494
       If no suffix association is found for the file name, &RCL; will try
3490
       to execute the <command>file -i</command> command to determine a
3495
       to execute the <command>file -i</command> command to determine a
3491
       mime type.</para>
3496
       mime type.</para>
3492
3497
3493
      <para>The association of file types to filters is performed in
3498
      <para>The association of file types to handlers is performed in
3494
      the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
3499
      the <link linkend="RCL.INSTALL.CONFIG.MIMECONF">
3495
      <filename>mimeconf</filename> file</link>. A sample will probably be
3500
      <filename>mimeconf</filename> file</link>. A sample will probably be
3496
      of better help than a long explanation:</para>
3501
      of better help than a long explanation:</para>
3497
<programlisting>
3502
<programlisting>
3498
3503
...
...
3530
            <literal>iso-8859-1</literal> encoding is specified because it
3535
            <literal>iso-8859-1</literal> encoding is specified because it
3531
            is not the <literal>utf-8</literal> default, and not output by
3536
            is not the <literal>utf-8</literal> default, and not output by
3532
            <command>unrtf</command> in the HTML header section.</para>
3537
            <command>unrtf</command> in the HTML header section.</para>
3533
      </listitem>
3538
      </listitem>
3534
      <listitem><para><literal>application/x-chm</literal> is processed
3539
      <listitem><para><literal>application/x-chm</literal> is processed
3535
          by a persistant filter. This is determined by the
3540
          by a persistant handler. This is determined by the
3536
          <literal>execm</literal> keyword.</para>
3541
          <literal>execm</literal> keyword.</para>
3537
      </listitem>
3542
      </listitem>
3538
    </itemizedlist>
3543
    </itemizedlist>
3539
       </para> 
3544
       </para> 
3540
3545
3541
      </sect2>
3546
      </sect2>
3542
3547
3543
    <sect2 id="RCL.PROGRAM.FILTERS.HTML">
3548
    <sect2 id="RCL.PROGRAM.FILTERS.HTML">
3544
        <title>Filter HTML output</title>
3549
        <title>Input handler HTML output</title>
3545
3550
3546
        <para>The output HTML could be very minimal like the following
3551
        <para>The output HTML could be very minimal like the following
3547
        example:
3552
        example:
3548
          <programlisting>
3553
          <programlisting>
3549
&lt;html>
3554
&lt;html>
...
...
3605
          <programlisting>
3610
          <programlisting>
3606
&lt;meta name="date" content="2013-02-24 17:50:00">
3611
&lt;meta name="date" content="2013-02-24 17:50:00">
3607
          </programlisting>
3612
          </programlisting>
3608
        </para>
3613
        </para>
3609
3614
3610
        <para>Filters also have the possibility to "invent" field
3615
        <para>Input handlers also have the possibility to "invent" field
3611
          names. This should also be output as meta tags:</para>
3616
          names. This should also be output as meta tags:</para>
3612
3617
3613
        <programlisting>
3618
        <programlisting>
3614
&lt;meta name="somefield" content="Some textual data" /&gt;
3619
&lt;meta name="somefield" content="Some textual data" /&gt;
3615
</programlisting>
3620
</programlisting>
...
...
3632
3637
3633
    <sect2 id="RCL.PROGRAM.FILTERS.PAGES">
3638
    <sect2 id="RCL.PROGRAM.FILTERS.PAGES">
3634
        <title>Page numbers</title>
3639
        <title>Page numbers</title>
3635
3640
3636
        <para>The indexer will interpret <literal>^L</literal> characters
3641
        <para>The indexer will interpret <literal>^L</literal> characters
3637
          in the filter output as indicating page breaks, and will record
3642
          in the handler output as indicating page breaks, and will record
3638
          them. At query time, this allows starting a viewer on the right
3643
          them. At query time, this allows starting a viewer on the right
3639
          page for a hit or a snippet. Currently, only the PDF, Postscript
3644
          page for a hit or a snippet. Currently, only the PDF, Postscript
3640
          and DVI filters generate page breaks.</para>
3645
          and DVI handlers generate page breaks.</para>
3641
3646
3642
      </sect2>
3647
      </sect2>
3643
3648
3644
    </sect1>
3649
    </sect1>
3645
3650
...
...
3649
      <para><literal>Fields</literal> are named pieces of information
3654
      <para><literal>Fields</literal> are named pieces of information
3650
      in or about documents, like <literal>title</literal>,
3655
      in or about documents, like <literal>title</literal>,
3651
      <literal>author</literal>, <literal>abstract</literal>.</para> 
3656
      <literal>author</literal>, <literal>abstract</literal>.</para> 
3652
3657
3653
      <para>The field values for documents can appear in several ways
3658
      <para>The field values for documents can appear in several ways
3654
      during indexing: either output by filters
3659
      during indexing: either output by input handlers
3655
      as <literal>meta</literal> fields in the HTML header section, or
3660
      as <literal>meta</literal> fields in the HTML header section, or
3656
      extracted from file extended attributes, or added as attributes
3661
      extracted from file extended attributes, or added as attributes
3657
      of the <literal>Doc</literal> object when using the API, or
3662
      of the <literal>Doc</literal> object when using the API, or
3658
      again synthetized internally by &RCL;.</para>
3663
      again synthetized internally by &RCL;.</para>
3659
3664
3660
      <para>The &RCL; query language allows searching for text in a
3665
      <para>The &RCL; query language allows searching for text in a
3661
      specific field.</para>
3666
      specific field.</para>
3662
3667
3663
      <para>&RCL; defines a number of default fields. Additional
3668
      <para>&RCL; defines a number of default fields. Additional
3664
      ones can be output by filters, and described in the
3669
      ones can be output by handlers, and described in the
3665
      <filename>fields</filename> configuration file.</para>
3670
      <filename>fields</filename> configuration file.</para>
3666
3671
3667
      <para>Fields can be:</para>
3672
      <para>Fields can be:</para>
3668
      <itemizedlist>
3673
      <itemizedlist>
3669
3674
...
...
3901
        
3906
        
3902
        <sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB">
3907
        <sect5 id="RCL.PROGRAM.PYTHON.RECOLL.CLASSES.DB">
3903
          <title>The Db class</title>
3908
          <title>The Db class</title>
3904
3909
3905
          <para>A Db object is created by
3910
          <para>A Db object is created by
3906
            a <literal>connect()</literal> function and holds a 
3911
            a <literal>connect()</literal> call and holds a 
3907
            connection to a Recoll index.</para>
3912
            connection to a Recoll index.</para>
3908
          <variablelist>
3913
          <variablelist>
3909
            <title>Methods</title>
3914
            <title>Methods</title>
3910
            <varlistentry>
3915
            <varlistentry>
3911
              <term>Db.close()</term>
3916
              <term>Db.close()</term>
...
...
4379
    <guilabel>File</guilabel> menu. The list is stored in the
4384
    <guilabel>File</guilabel> menu. The list is stored in the
4380
    <filename>missing</filename> text file inside the configuration
4385
    <filename>missing</filename> text file inside the configuration
4381
    directory.</para>
4386
    directory.</para>
4382
4387
4383
      <para>A list of common file types which need external
4388
      <para>A list of common file types which need external
4384
        commands follows. Many of the filters need the
4389
        commands follows. Many of the handlers need the
4385
        <command>iconv</command> command, which is not always listed as a
4390
        <command>iconv</command> command, which is not always listed as a
4386
        dependancy.</para> 
4391
        dependancy.</para> 
4387
4392
4388
      <para>Please note that, due to the relatively dynamic nature of this
4393
      <para>Please note that, due to the relatively dynamic nature of this
4389
    information, the most up to date version is now kept on &RCLAPPS;
4394
    information, the most up to date version is now kept on &RCLAPPS;
...
...
4396
        are sometimes outdated, or not the best version for &RCL;, so you
4401
        are sometimes outdated, or not the best version for &RCL;, so you
4397
        should take a look at &RCLAPPS; if a file
4402
        should take a look at &RCLAPPS; if a file
4398
        type is important to you.</para>
4403
        type is important to you.</para>
4399
4404
4400
      <para>As of &RCL; release 1.14, a number of XML-based formats that
4405
      <para>As of &RCL; release 1.14, a number of XML-based formats that
4401
        were handled by ad hoc filter code now use the
4406
        were handled by ad hoc handler code now use the
4402
        <command>xsltproc</command> command, which usually comes with  
4407
        <command>xsltproc</command> command, which usually comes with  
4403
      <application>libxslt</application>. These are: abiword, fb2
4408
      <application>libxslt</application>. These are: abiword, fb2
4404
      (ebooks), kword, openoffice, svg.</para> 
4409
      (ebooks), kword, openoffice, svg.</para> 
4405
4410
4406
      <para>Now for the list:</para>
4411
      <para>Now for the list:</para>
...
...
4423
        <command>antiword</command>. It is also useful to have
4428
        <command>antiword</command>. It is also useful to have
4424
        <command>wvWare</command> installed as it may be 
4429
        <command>wvWare</command> installed as it may be 
4425
        be used as a fallback for some files which
4430
        be used as a fallback for some files which
4426
        <command>antiword</command> does not handle.</para></listitem>
4431
        <command>antiword</command> does not handle.</para></listitem>
4427
4432
4428
        <listitem><para>MS Excel and PowerPoint need <command>
4433
        <listitem><para>MS Excel and PowerPoint are processed by
4429
            catdoc</command>.</para></listitem>
4434
        internal <command>Python</command> handlers.</para></listitem>
4430
4435
4431
        <listitem><para>MS Open XML (docx) needs <command>
4436
        <listitem><para>MS Open XML (docx) needs <command>
4432
         xsltproc</command>.</para></listitem>
4437
         xsltproc</command>.</para></listitem>
4433
4438
4434
        <listitem><para>Wordperfect files need <command>wpd2html</command>
4439
        <listitem><para>Wordperfect files need <command>wpd2html</command>
...
...
4449
4454
4450
        <listitem><para>djvu files need <command>djvutxt</command> and
4455
        <listitem><para>djvu files need <command>djvutxt</command> and
4451
        <command>djvused</command> from the
4456
        <command>djvused</command> from the
4452
        <application>DjVuLibre</application> package.</para></listitem>
4457
        <application>DjVuLibre</application> package.</para></listitem>
4453
          
4458
          
4454
        <listitem><para>Audio files: &RCL; releases before 1.13
4459
        <listitem><para>Audio files: &RCL; releases 1.14 and later use
4455
          used the <command>id3info</command> command from the <application>
4456
          id3lib</application> package to extract mp3 tag information,
4457
          <command>metaflac</command> (standard flac tools) for flac files,
4458
          and <command>ogginfo</command> (vorbis tools) for ogg
4459
          files. Releases 1.14 and later use a single
4460
          <application>Python</application> filter based 
4460
        a single <application>Python</application> handler based 
4461
          on <application>mutagen</application> for all audio file
4461
        on <application>mutagen</application> for all audio file
4462
          types.</para>
4462
        types.</para>
4463
    </listitem>
4463
    </listitem>
4464
4464
4465
        <listitem><para>Pictures: &RCL; uses the 
4465
        <listitem><para>Pictures: &RCL; uses the 
4466
         <application>Exiftool</application>
4466
         <application>Exiftool</application>
4467
         <application>Perl</application> package to extract tag
4467
         <application>Perl</application> package to extract tag
...
...
4469
         there may not be much interest in indexing the technical tags
4469
         there may not be much interest in indexing the technical tags
4470
         (image size, aperture, etc.). This is only of interest if you
4470
         (image size, aperture, etc.). This is only of interest if you
4471
         store personal tags or textual descriptions inside the image
4471
         store personal tags or textual descriptions inside the image
4472
         files.</para></listitem>
4472
         files.</para></listitem>
4473
4473
4474
    <listitem><para>chm: files in microsoft help format need Python and
4474
    <listitem><para>chm: files in Microsoft help format need Python and
4475
          the <application>pychm</application> module (which needs 
4475
          the <application>pychm</application> module (which needs 
4476
          <application>chmlib</application>).</para></listitem>
4476
          <application>chmlib</application>).</para></listitem>
4477
4477
4478
    <listitem><para>ICS: up to &RCL; 1.13, iCalendar files need 
4478
    <listitem><para>ICS: up to &RCL; 1.13, iCalendar files need 
4479
        <application>Python</application>
4479
        <application>Python</application>
...
...
4496
        </listitem>
4496
        </listitem>
4497
4497
4498
        <listitem><para>Konqueror webarchive format with Python (uses the
4498
        <listitem><para>Konqueror webarchive format with Python (uses the
4499
        Tarfile module).</para></listitem>
4499
        Tarfile module).</para></listitem>
4500
4500
4501
        <listitem><para>mimehtml web archive format (support based on the email
4501
        <listitem><para>Mimehtml web archive format (support based on
4502
          filter, which introduces some mild weirdness, but still
4502
        the email handler, which introduces some mild weirdness, but
4503
          usable).</para></listitem>
4503
        still usable).</para></listitem>
4504
4504
4505
        </itemizedlist>
4505
        </itemizedlist>
4506
4506
4507
        <para>Text, HTML, email folders, and Scribus files are
4507
        <para>Text, HTML, email folders, and Scribus files are
4508
        processed internally. <application>Lyx</application> is used to
4508
        processed internally. <application>Lyx</application> is used to
4509
        index Lyx files. Many filters need <command>iconv</command> and the
4509
        index Lyx files. Many handlers need <command>iconv</command> and the
4510
        standard <command>sed</command> and <command>awk</command>.
4510
        standard <command>sed</command> and <command>awk</command>.
4511
        </para>
4511
        </para>
4512
4512
4513
    </sect1>
4513
    </sect1>
4514
4514
...
...
4992
      <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES">
4992
      <varlistentry id="RCL.INSTALL.CONFIG.RECOLLCONF.ZIPSKIPPEDNAMES">
4993
        <term><varname>zipSkippedNames</varname></term>
4993
        <term><varname>zipSkippedNames</varname></term>
4994
        <listitem><para>A space-separated list of patterns for
4994
        <listitem><para>A space-separated list of patterns for
4995
               names of files or directories that should be ignored
4995
               names of files or directories that should be ignored
4996
               inside zip archives. This is used directly by the zip
4996
               inside zip archives. This is used directly by the zip
4997
               filter, and has a function similar to skippedNames, but
4997
               handler, and has a function similar to skippedNames, but
4998
               works independantly. Can be redefined for filesystem
4998
               works independantly. Can be redefined for filesystem
4999
               subdirectories. For versions up to 1.19, you will need
4999
               subdirectories. For versions up to 1.19, you will need
5000
               to update the Zip filter and install a supplementary
5000
               to update the Zip handler and install a supplementary
5001
               Python module. The details are
5001
               Python module. The details are
5002
               described <ulink url="https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members">on
5002
               described <ulink url="https://bitbucket.org/medoc/recoll/wiki/Filtering%20out%20Zip%20archive%20members">on
5003
          the &RCL; wiki</ulink>.
5003
          the &RCL; wiki</ulink>.
5004
        </para></listitem>
5004
        </para></listitem>
5005
      </varlistentry>
5005
      </varlistentry>
...
...
5550
             indexer (default class 3, no data).</para>
5550
             indexer (default class 3, no data).</para>
5551
              </listitem>
5551
              </listitem>
5552
            </varlistentry>
5552
            </varlistentry>
5553
5553
5554
            <varlistentry><term><varname>filtermaxseconds</varname></term>
5554
            <varlistentry><term><varname>filtermaxseconds</varname></term>
5555
              <listitem><para>Maximum filter execution time, after which it
5555
              <listitem><para>Maximum handler execution time, after which it
5556
            is aborted. Some postscript programs just loop...</para> 
5556
            is aborted. Some postscript programs just loop...</para> 
5557
              </listitem>
5557
              </listitem>
5558
            </varlistentry>
5558
            </varlistentry>
5559
          <varlistentry><term><varname>filtersdir</varname></term>
5559
          <varlistentry><term><varname>filtersdir</varname></term>
5560
            <listitem><para>A directory to search for the external
5560
            <listitem><para>A directory to search for the external
5561
            filter scripts used to index some types of files. The
5561
            input handler scripts used to index some types of files. The
5562
            value should not be changed, except if you want to modify
5562
            value should not be changed, except if you want to modify
5563
            one of the default scripts. The value can be redefined for
5563
            one of the default scripts. The value can be redefined for
5564
            any sub-directory. </para>
5564
            any sub-directory. </para>
5565
            </listitem>
5565
            </listitem>
5566
          </varlistentry>
5566
          </varlistentry>
...
...
5676
            canonical names used inside the <literal>[prefixes]</literal>
5676
            canonical names used inside the <literal>[prefixes]</literal>
5677
            and <literal>[stored]</literal> sections</para>
5677
            and <literal>[stored]</literal> sections</para>
5678
            </listitem>
5678
            </listitem>
5679
          </varlistentry>
5679
          </varlistentry>
5680
          <varlistentry>
5680
          <varlistentry>
5681
            <term>filter-specific sections</term>
5681
            <term>handler-specific sections</term>
5682
            <listitem><para>Some filters may need specific
5682
            <listitem><para>Some input handlers may need specific
5683
            configuration for handling fields. Only the email message filter
5683
            configuration for handling fields. Only the email message handler
5684
            currently has such a section (named
5684
            currently has such a section (named
5685
            <literal>[mail]</literal>). It allows indexing arbitrary email
5685
            <literal>[mail]</literal>). It allows indexing arbitrary email
5686
            headers in addition to the ones indexed by default. Other such
5686
            headers in addition to the ones indexed by default. Other such
5687
            sections may appear in the future.</para>
5687
            sections may appear in the future.</para>
5688
            </listitem>
5688
            </listitem>
...
...
5692
5692
5693
      <para>Here follows a small example of a personal
5693
      <para>Here follows a small example of a personal
5694
      <filename>fields</filename> 
5694
      <filename>fields</filename> 
5695
      file. This would extract a specific email header and
5695
      file. This would extract a specific email header and
5696
      use it as a searchable field, with data displayable inside result
5696
      use it as a searchable field, with data displayable inside result
5697
      lists. (Side note: as the email filter does no decoding on the values,
5697
      lists. (Side note: as the email handler does no decoding on the values,
5698
      only plain ascii headers can be indexed, and only the
5698
      only plain ascii headers can be indexed, and only the
5699
      first occurrence will be used for headers that occur several times).
5699
      first occurrence will be used for headers that occur several times).
5700
5700
5701
<programlisting>[prefixes]
5701
<programlisting>[prefixes]
5702
# Index mailmytag contents (with the given prefix)
5702
# Index mailmytag contents (with the given prefix)
...
...
6005
        (you can also create a category). Categories may be used
6005
        (you can also create a category). Categories may be used
6006
        for filtering in advanced search.</para>
6006
        for filtering in advanced search.</para>
6007
            </listitem>
6007
            </listitem>
6008
          </itemizedlist>
6008
          </itemizedlist>
6009
6009
6010
          <para>The <replaceable>rclblob</replaceable> filter should
6010
          <para>The <replaceable>rclblob</replaceable> handler should
6011
            be an executable program or script which exists inside
6011
            be an executable program or script which exists inside
6012
            <filename>/usr/[local/]share/recoll/filters</filename>. It
6012
            <filename>/usr/[local/]share/recoll/filters</filename>. It
6013
            will be given a file name as argument and should output the
6013
            will be given a file name as argument and should output the
6014
            text or html contents on the standard output.</para>
6014
            text or html contents on the standard output.</para>
6015
6015
6016
          <para>The <link linkend="RCL.PROGRAM.FILTERS">filter
6016
          <para>The <link linkend="RCL.PROGRAM.FILTERS">filter
6017
              programming</link> section describes in more detail how
6017
              programming</link> section describes in more detail how
6018
              to write a filter.</para> 
6018
              to write an input handler.</para> 
6019
6019
6020
        </sect3>
6020
        </sect3>
6021
6021
6022
      </sect2>
6022
      </sect2>
6023
6023