Switch to unified view

a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml
...
...
3035
    </sect1> <!-- rcl.search.desktop -->
3035
    </sect1> <!-- rcl.search.desktop -->
3036
3036
3037
  </chapter> <!-- Search -->
3037
  </chapter> <!-- Search -->
3038
3038
3039
3039
3040
  <chapter id="rcl.program">
3040
    <chapter id="rcl.program">
3041
    <title>Programming interface</title>
3041
      <title>Programming interface</title>
3042
3042
3043
    <para>&RCL; has an Application Programming Interface, usable both
3043
      <para>&RCL; has an Application Programming Interface, usable both
3044
    for indexing and searching, currently accessible from the
3044
        for indexing and searching, currently accessible from the
3045
    <application>Python</application> language.</para>
3045
        <application>Python</application> language.</para>
3046
3046
3047
    <para>Another less radical way to extend the application is to
3047
      <para>Another less radical way to extend the application is to
3048
    write filters for new types of documents.</para>
3048
        write filters for new types of documents.</para>
3049
3049
3050
    <para>The processing of metadata attributes for documents
3050
      <para>The processing of metadata attributes for documents
3051
    (<literal>fields</literal>) is highly configurable.</para>
3051
        (<literal>fields</literal>) is highly configurable.</para>
3052
3052
3053
3054
3053
    <sect1 id="rcl.program.filters">
3055
      <sect1 id="rcl.program.filters">
3054
        <title>Writing a document filter</title>
3056
        <title>Writing a document filter</title>
3055
3057
3056
      <para>&RCL; filters are executable programs which 
3058
        <para>&RCL; filters cooperate to translate from the multitude
3057
        translate from a specific format (ie:
3059
        of input document formats, simple ones
3058
        <application>openoffice</application>,
3060
        as <application>opendocument</application>, 
3061
          <application>acrobat</application>), or compound ones such
3062
          as <application>Zip</application>
3059
        <application>acrobat</application>, etc.) to the &RCL;
3063
          or <application>Email</application>, into the final &RCL;
3060
        indexing input format, which may be
3064
          indexing input format, which may
3061
        <literal>text/plain</literal> or
3065
          be <literal>text/plain</literal>
3062
        <literal>text/html</literal>.</para> 
3066
          or <literal>text/html</literal>. Most filters are executable
3067
          programs or scripts. A few filters are coded in C++ and live
3068
          inside <command>recollindex</command>. This latter kind will not
3069
          be described here.</para>
3063
3070
3064
      <para>As of &RCL; 1.13, there are two kinds of filters:
3071
        <para>There are currently (1.18 and since 1.13) two kinds of
3072
        external executable filters:
3065
        <itemizedlist>
3073
          <itemizedlist>
3066
    <listitem><para>Simple filters (the old ones) run once and
3074
      <listitem><para>Simple filters (<literal>exec</literal>
3075
          filters) run once and
3067
      exit. They can be bare programs like
3076
            exit. They can be bare programs
3068
      <application>antiword</application>, or shell-scripts using other
3077
            like <application>antiword</application>, or scripts
3069
    programs. They are very simple to write, because they just need
3078
          using other programs. They are very simple to write,
3070
    to output the converted to the standard output.</para>
3079
          because they just need to print the converted document
3080
          to the standard output. Their output can
3081
          be <literal>text/plain</literal>
3082
          or <literal>text/html</literal>.</para>
3071
      </listitem>
3083
        </listitem>
3072
    <listitem><para>Multiple filters, new in 1.13, run as long as
3084
      <listitem><para>Multiple filters (<literal>execm</literal>
3073
    their master process (ie: recollindex) is active. They can
3085
          filters), run as long as
3074
    process multiple files (sparing the process startup time which
3086
          their master process (<command>recollindex</command>) is
3075
    can be very significant), or multiple documents per file (ie: for
3087
          active. They can process multiple files (sparing the
3088
          process startup time which can be very significant),
3089
          or multiple documents per file (e.g.: for zip or chm
3076
      zip or chm files). They communicate with the indexer through a
3090
            files). They communicate with the indexer through a
3077
      simple protocol, but are nevertheless a bit more complicated than
3091
            simple protocol, but are nevertheless a bit more
3078
    the older kind. Most of these new filters are written in
3092
          complicated than the older kind. Most of new
3093
          filters are written
3079
      <application>Python</application>, using a common module to
3094
            in <application>Python</application>, using a common
3080
    handle the protocol.</para>
3095
          module to handle the protocol. There is an
3096
          exception, <command>rclimg</command> which is written
3097
          in Perl. The subdocuments output by these filters can
3098
          be directly indexable (text or HTML), or they can be
3099
          other simple or compound documents that will need to
3100
          be processed by another filter.</para>
3081
      </listitem>
3101
        </listitem>
3082
    </itemizedlist>
3102
      </itemizedlist>
3083
      The following will just describe the simple filters. If you can
3103
        </para>
3084
      program and want to write one of the other kind, it shouldn't be too
3104
3085
      difficult to make sense of one of the existing modules. For example,
3105
        <para>In both cases, filters deal with regular file system
3086
      look at <command>rclzip</command> which uses Zip file paths as
3106
          files, and can process either a single document, or a
3087
      internal identifiers (<literal>ipath</literal>), and
3107
          linear list of documents in each file. &RCL; is responsible
3088
      <command>rclinfo</command>, which uses an integer index.</para> 
3108
          for performing up to date checks, deal with more complex
3109
          embedding and other upper level issues.</para>
3110
3111
        <para>In the extreme case of a simple filter returning a
3112
          document in <literal>text/plain</literal> format, no
3113
          metadata can be transferred from the filter to the
3114
          indexer. Generic metadata, like document size or
3115
          modification date, will be gathered and stored by the
3116
          indexer.</para> 
3117
3118
        <para>Filters that produce  <literal>text/html</literal>
3119
          format can return an arbitrary amount of metadata inside HTML
3120
          <literal>meta</literal> tags. These will be processed
3121
          according to the directives found in 
3122
          the <link linkend="rcl.program.fields">
3123
            <filename>fields</filename> configuration
3124
            file</link>.</para>
3125
3126
        <para>The filters that can handle multiple documents per file
3127
          return a single piece of data to identify each document inside
3128
          the file. This piece of data, called
3129
          an <literal>ipath element</literal> will be sent back by
3130
          &RCL; to extract the document at query time, for previewing,
3131
          or for creating a temporary file to be opened by a
3132
          viewer.</para>  
3133
3134
        <para>The following section describes the simple
3135
          filters, and the next one gives a few explanations about
3136
          the <literal>execm</literal> ones. You could conceivably
3137
          write a simple filter with only the elements in the
3138
          manual. This will not be the case for the other ones, for
3139
          which you will have to look at the code.</para>
3089
3140
3090
      <sect2 id="rcl.program.filters.simple">
3141
      <sect2 id="rcl.program.filters.simple">
3091
        <title>Simple filters</title>
3142
        <title>Simple filters</title>
3092
3143
3093
      <para>&RCL; simple filters are usually shell-scripts, but this is in
3144
      <para>&RCL; simple filters are usually shell-scripts, but this is in
...
...
3123
3174
3124
        <para>Don't forget to make your filter executable before 
3175
        <para>Don't forget to make your filter executable before 
3125
         testing !</para>
3176
         testing !</para>
3126
3177
3127
      </sect2>
3178
      </sect2>
3179
3180
      <sect2 id="rcl.program.filters.multiple">
3181
        <title>"Multiple" filters</title>
3182
3183
        <para>If you can program and want to write
3184
          an <literal>execm</literal> filter, it should not be too
3185
          difficult to make sense of one of the existing modules. For
3186
          example, look at <command>rclzip</command> which uses Zip
3187
          file paths as identifiers (<literal>ipath</literal>),
3188
          and <command>rclics</command>, which uses an integer
3189
          index. Also have a look at the comments inside
3190
          the <filename>internfile/mh_execm.h</filename> file and
3191
          possibly at the corresponding module.</para>
3192
3193
        <para><literal>execm</literal> filters sometimes need to make
3194
          a choice for the nature of the <literal>ipath</literal>
3195
          elements that they use in communication with the
3196
          indexer. Here are a few guidelines:
3197
          <itemizedlist>
3198
            <listitem><para>Use ASCII or UTF-8 (if the identifier is an
3199
                integer print it, for example, like printf %d would
3200
                do).</para></listitem>
3201
            <listitem><para>If at all possible, the data should make some
3202
              kind of sense when printed to a log file to help with 
3203
                debugging.</para></listitem>
3204
            <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
3205
                separator to store a complex path internally (for
3206
                deeper embedding). Colons inside
3207
                the <literal>ipath</literal> elements output by a
3208
                filter will be escaped, but would be a bad choice as a
3209
                filter-specific separator (mostly, again, for
3210
                debugging issues).</para></listitem>
3211
          </itemizedlist>
3212
          In any case, the main goal is that it should
3213
          be easy for the filter to extract the target document, given
3214
          the file name and the <literal>ipath</literal>
3215
          element.</para>
3216
3217
        <para><literal>execm</literal> filters will also produce
3218
          a document with a null <literal>ipath</literal>
3219
          element. Depending on the type of document, this may have
3220
          some associated data (e.g. the body of an email message), or
3221
          none (typical for an archive file). If it is empty, this
3222
          document will be useful anyway for some operations, as the
3223
          parent of the actual data documents.</para>
3128
3224
3129
      <sect2 id="rcl.program.filters.association">
3225
      <sect2 id="rcl.program.filters.association">
3130
        <title>Telling &RCL; about the filter</title>
3226
        <title>Telling &RCL; about the filter</title>
3131
3227
3132
      <para>There are two elements that link a file to the filter which
3228
      <para>There are two elements that link a file to the filter which