Switch to unified view

a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml
...
...
2322
      the older kind. Most of these new filters are written in
2322
      the older kind. Most of these new filters are written in
2323
      <application>Python</application>, using a common module to
2323
      <application>Python</application>, using a common module to
2324
      handle the protocol.</para>
2324
      handle the protocol.</para>
2325
      </listitem>
2325
      </listitem>
2326
    </itemizedlist>
2326
    </itemizedlist>
2327
      The following will just describe the simple filters, if you are
2327
      The following will just describe the simple filters. If you can
2328
      programmer enough to write one of the other kind, it shouldn't be too
2328
      program and want to write one of the other kind, it shouldn't be too
2329
      difficult to make sense of one of the existing modules (ie:
2329
      difficult to make sense of one of the existing modules. For example,
2330
      rclzip).</para> 
2330
      look at <command>rclzip</command> which uses Zip file paths as
2331
      internal identifiers (<literal>ipath</literal>), and
2332
      <command>rclinfo</command>, which uses an integer index.</para> 
2333
2334
      <sect2 id="rcl.program.filters.simple">
2335
        <title>Simple filters</title>
2331
2336
2332
      <para>&RCL; simple filters are usually shell-scripts, but this is in
2337
      <para>&RCL; simple filters are usually shell-scripts, but this is in
2333
        no way necessary. These programs are extremely simple and most
2338
        no way necessary. Extracting the text from the native format is the
2334
        of the difficulty lies in extracting the text from the native
2339
        difficult part. Outputting the format expected by &RCL; is
2335
        format, not outputting what is expected by &RCL;. Happily
2336
        enough, most document formats already have translators or text
2340
        trivial. Happily enough, most document formats have translators or
2337
        extractors which handle the difficult part and can be called
2341
        text extractors which can be called from the filter. In some cases
2338
        from the filter. In some case the output of the translating
2342
        the output of the translating program is completely appropriate,
2339
        program is appropriate, and no intermediate shell-script is
2343
        and no intermediate shell-script is needed.</para>
2340
        needed.</para> 
2341
2344
2342
        <para>Filters are called with a single argument which is the
2345
        <para>Filters are called with a single argument which is the
2343
        source file name. They should output the result to stdout.</para>
2346
        source file name. They should output the result to stdout.</para>
2344
2347
2348
      <para>When writing a filter, you should decide if it will output
2349
      plain text or html. Plain text is simpler, but you will not be able
2350
      to add metadata or vary the output character encoding (this will be
2351
      defined in a configuration file). Additionally, some formatting may
2352
      easier to preserve when previewing html. Actually the deciding factor
2353
      is metadata: &RCL; has a way to <link linkend="rcl.program.filters.html">
2354
      extract metadata from the html header and use it for field 
2355
      searches.</link>.</para>
2356
2345
        <para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
2357
      <para>The <literal>RECOLL_FILTER_FORPREVIEW</literal> environment
2346
        environment variable (values <literal>yes</literal>,
2358
        variable (values <literal>yes</literal>, <literal>no</literal>)
2347
        <literal>no</literal>) tells the filter if the operation is
2359
        tells the filter if the operation is for indexing or
2348
        for indexing or previewing. Some filters use this to output a
2360
        previewing. Some filters use this to output a slightly different
2349
        slightly different format. This is not essential.</para>
2361
        format, for example stripping uninteresting repeated keywords (ie:
2362
        <literal>Subject:</literal> for email) when indexing. This is not
2363
        essential.</para>
2364
2365
      <para>You should look to one of the simple filters, for exemple
2366
        <literal>rclps</literal> for a starting point.</para>
2367
2368
        <para>Don't forget to make your filter executable before 
2369
         testing !</para>
2370
2371
      </sect2>
2372
2373
      <sect2 id="rcl.program.filters.association">
2374
        <title>Telling &RCL; about the filter</title>
2375
2376
      <para>There are two elements that link a file to the filter which
2377
      should process it: the association of file to mime type and the
2378
      association of a mime type with a filter.</para>
2379
2380
      <para>The association of files to mime types is mostly based on
2381
        name suffixes. The types are defined inside the
2382
        <link linkend="rcl.install.config.mimeconf">
2383
        <filename>mimemap</filename> file</link>. Example:
2384
<programlisting>
2385
2386
.doc = application/msword
2387
</programlisting>
2388
       If no suffix association is found for the file name, &RCL; will try
2389
       to execute the <command>file -i</command> command to determine a
2390
       mime type.</para>
2350
2391
2351
      <para>The association of file types to filters is performed in
2392
      <para>The association of file types to filters is performed in
2393
      the <link linkend="rcl.install.config.mimemap">
2352
      the <filename>mimeconf</filename> file. A sample:</para>
2394
      <filename>mimeconf</filename> file</link>. A sample will probably be
2395
      of better help than a long explanation:</para>
2353
<programlisting>
2396
<programlisting>
2354
2397
2355
[index]
2398
[index]
2356
application/msword = exec antiword -t -i 1 -m UTF-8;\
2399
application/msword = exec antiword -t -i 1 -m UTF-8;\
2357
     mimetype = text/plain ; charset=utf-8
2400
     mimetype = text/plain ; charset=utf-8
...
...
2390
      <listitem><para><literal>application/x-chm</literal> is processed
2433
      <listitem><para><literal>application/x-chm</literal> is processed
2391
          by a persistant filter. This is determined by the
2434
          by a persistant filter. This is determined by the
2392
          <literal>execm</literal> keyword.</para>
2435
          <literal>execm</literal> keyword.</para>
2393
      </listitem>
2436
      </listitem>
2394
    </itemizedlist>
2437
    </itemizedlist>
2395
      The easiest way to write a new filter is probably to start from an
2438
       </para> 
2396
      existing one.</para> 
2397
2439
2398
      <para>Filters which output <literal>text/plain</literal> text
2440
      </sect2>
2399
      are generally simpler, but they cannot specify the character set
2400
      and other metadata, so they are limited to cases where these
2401
      elements are not needed.</para>
2402
2403
2441
2404
    <sect2 id="rcl.program.filters.html">
2442
    <sect2 id="rcl.program.filters.html">
2405
        <title>Filter HTML output</title>
2443
        <title>Filter HTML output</title>
2406
2444
2407
        <para>The output HTML could be very minimal like the following
2445
        <para>The output HTML could be very minimal like the following