--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@@ -3037,55 +3037,106 @@
</chapter> <!-- Search -->
- <chapter id="rcl.program">
- <title>Programming interface</title>
-
- <para>&RCL; has an Application Programming Interface, usable both
- for indexing and searching, currently accessible from the
- <application>Python</application> language.</para>
-
- <para>Another less radical way to extend the application is to
- write filters for new types of documents.</para>
-
- <para>The processing of metadata attributes for documents
- (<literal>fields</literal>) is highly configurable.</para>
-
- <sect1 id="rcl.program.filters">
+ <chapter id="rcl.program">
+ <title>Programming interface</title>
+
+ <para>&RCL; has an Application Programming Interface, usable both
+ for indexing and searching, currently accessible from the
+ <application>Python</application> language.</para>
+
+ <para>Another less radical way to extend the application is to
+ write filters for new types of documents.</para>
+
+ <para>The processing of metadata attributes for documents
+ (<literal>fields</literal>) is highly configurable.</para>
+
+
+
+ <sect1 id="rcl.program.filters">
<title>Writing a document filter</title>
- <para>&RCL; filters are executable programs which
- translate from a specific format (ie:
- <application>openoffice</application>,
- <application>acrobat</application>, etc.) to the &RCL;
- indexing input format, which may be
- <literal>text/plain</literal> or
- <literal>text/html</literal>.</para>
-
- <para>As of &RCL; 1.13, there are two kinds of filters:
- <itemizedlist>
- <listitem><para>Simple filters (the old ones) run once and
- exit. They can be bare programs like
- <application>antiword</application>, or shell-scripts using other
- programs. They are very simple to write, because they just need
- to output the converted to the standard output.</para>
- </listitem>
- <listitem><para>Multiple filters, new in 1.13, run as long as
- their master process (ie: recollindex) is active. They can
- process multiple files (sparing the process startup time which
- can be very significant), or multiple documents per file (ie: for
- zip or chm files). They communicate with the indexer through a
- simple protocol, but are nevertheless a bit more complicated than
- the older kind. Most of these new filters are written in
- <application>Python</application>, using a common module to
- handle the protocol.</para>
- </listitem>
- </itemizedlist>
- The following will just describe the simple filters. If you can
- program and want to write one of the other kind, it shouldn't be too
- difficult to make sense of one of the existing modules. For example,
- look at <command>rclzip</command> which uses Zip file paths as
- internal identifiers (<literal>ipath</literal>), and
- <command>rclinfo</command>, which uses an integer index.</para>
+ <para>&RCL; filters cooperate to translate from the multitude
+ of input document formats, simple ones
+ as <application>opendocument</application>,
+ <application>acrobat</application>), or compound ones such
+ as <application>Zip</application>
+ or <application>Email</application>, into the final &RCL;
+ indexing input format, which may
+ be <literal>text/plain</literal>
+ or <literal>text/html</literal>. Most filters are executable
+ programs or scripts. A few filters are coded in C++ and live
+ inside <command>recollindex</command>. This latter kind will not
+ be described here.</para>
+
+ <para>There are currently (1.18 and since 1.13) two kinds of
+ external executable filters:
+ <itemizedlist>
+ <listitem><para>Simple filters (<literal>exec</literal>
+ filters) run once and
+ exit. They can be bare programs
+ like <application>antiword</application>, or scripts
+ using other programs. They are very simple to write,
+ because they just need to print the converted document
+ to the standard output. Their output can
+ be <literal>text/plain</literal>
+ or <literal>text/html</literal>.</para>
+ </listitem>
+ <listitem><para>Multiple filters (<literal>execm</literal>
+ filters), run as long as
+ their master process (<command>recollindex</command>) is
+ active. They can process multiple files (sparing the
+ process startup time which can be very significant),
+ or multiple documents per file (e.g.: for zip or chm
+ files). They communicate with the indexer through a
+ simple protocol, but are nevertheless a bit more
+ complicated than the older kind. Most of new
+ filters are written
+ in <application>Python</application>, using a common
+ module to handle the protocol. There is an
+ exception, <command>rclimg</command> which is written
+ in Perl. The subdocuments output by these filters can
+ be directly indexable (text or HTML), or they can be
+ other simple or compound documents that will need to
+ be processed by another filter.</para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ <para>In both cases, filters deal with regular file system
+ files, and can process either a single document, or a
+ linear list of documents in each file. &RCL; is responsible
+ for performing up to date checks, deal with more complex
+ embedding and other upper level issues.</para>
+
+ <para>In the extreme case of a simple filter returning a
+ document in <literal>text/plain</literal> format, no
+ metadata can be transferred from the filter to the
+ indexer. Generic metadata, like document size or
+ modification date, will be gathered and stored by the
+ indexer.</para>
+
+ <para>Filters that produce <literal>text/html</literal>
+ format can return an arbitrary amount of metadata inside HTML
+ <literal>meta</literal> tags. These will be processed
+ according to the directives found in
+ the <link linkend="rcl.program.fields">
+ <filename>fields</filename> configuration
+ file</link>.</para>
+
+ <para>The filters that can handle multiple documents per file
+ return a single piece of data to identify each document inside
+ the file. This piece of data, called
+ an <literal>ipath element</literal> will be sent back by
+ &RCL; to extract the document at query time, for previewing,
+ or for creating a temporary file to be opened by a
+ viewer.</para>
+
+ <para>The following section describes the simple
+ filters, and the next one gives a few explanations about
+ the <literal>execm</literal> ones. You could conceivably
+ write a simple filter with only the elements in the
+ manual. This will not be the case for the other ones, for
+ which you will have to look at the code.</para>
<sect2 id="rcl.program.filters.simple">
<title>Simple filters</title>
@@ -3125,6 +3176,51 @@
testing !</para>
</sect2>
+
+ <sect2 id="rcl.program.filters.multiple">
+ <title>"Multiple" filters</title>
+
+ <para>If you can program and want to write
+ an <literal>execm</literal> filter, it should not be too
+ difficult to make sense of one of the existing modules. For
+ example, look at <command>rclzip</command> which uses Zip
+ file paths as identifiers (<literal>ipath</literal>),
+ and <command>rclics</command>, which uses an integer
+ index. Also have a look at the comments inside
+ the <filename>internfile/mh_execm.h</filename> file and
+ possibly at the corresponding module.</para>
+
+ <para><literal>execm</literal> filters sometimes need to make
+ a choice for the nature of the <literal>ipath</literal>
+ elements that they use in communication with the
+ indexer. Here are a few guidelines:
+ <itemizedlist>
+ <listitem><para>Use ASCII or UTF-8 (if the identifier is an
+ integer print it, for example, like printf %d would
+ do).</para></listitem>
+ <listitem><para>If at all possible, the data should make some
+ kind of sense when printed to a log file to help with
+ debugging.</para></listitem>
+ <listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
+ separator to store a complex path internally (for
+ deeper embedding). Colons inside
+ the <literal>ipath</literal> elements output by a
+ filter will be escaped, but would be a bad choice as a
+ filter-specific separator (mostly, again, for
+ debugging issues).</para></listitem>
+ </itemizedlist>
+ In any case, the main goal is that it should
+ be easy for the filter to extract the target document, given
+ the file name and the <literal>ipath</literal>
+ element.</para>
+
+ <para><literal>execm</literal> filters will also produce
+ a document with a null <literal>ipath</literal>
+ element. Depending on the type of document, this may have
+ some associated data (e.g. the body of an email message), or
+ none (typical for an archive file). If it is empty, this
+ document will be useful anyway for some operations, as the
+ parent of the actual data documents.</para>
<sect2 id="rcl.program.filters.association">
<title>Telling &RCL; about the filter</title>