|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
3035 |
</sect1> <!-- rcl.search.desktop -->
|
3035 |
</sect1> <!-- rcl.search.desktop -->
|
3036 |
|
3036 |
|
3037 |
</chapter> <!-- Search -->
|
3037 |
</chapter> <!-- Search -->
|
3038 |
|
3038 |
|
3039 |
|
3039 |
|
3040 |
<chapter id="rcl.program">
|
3040 |
<chapter id="rcl.program">
|
3041 |
<title>Programming interface</title>
|
3041 |
<title>Programming interface</title>
|
3042 |
|
3042 |
|
3043 |
<para>&RCL; has an Application Programming Interface, usable both
|
3043 |
<para>&RCL; has an Application Programming Interface, usable both
|
3044 |
for indexing and searching, currently accessible from the
|
3044 |
for indexing and searching, currently accessible from the
|
3045 |
<application>Python</application> language.</para>
|
3045 |
<application>Python</application> language.</para>
|
3046 |
|
3046 |
|
3047 |
<para>Another less radical way to extend the application is to
|
3047 |
<para>Another less radical way to extend the application is to
|
3048 |
write filters for new types of documents.</para>
|
3048 |
write filters for new types of documents.</para>
|
3049 |
|
3049 |
|
3050 |
<para>The processing of metadata attributes for documents
|
3050 |
<para>The processing of metadata attributes for documents
|
3051 |
(<literal>fields</literal>) is highly configurable.</para>
|
3051 |
(<literal>fields</literal>) is highly configurable.</para>
|
3052 |
|
3052 |
|
|
|
3053 |
|
|
|
3054 |
|
3053 |
<sect1 id="rcl.program.filters">
|
3055 |
<sect1 id="rcl.program.filters">
|
3054 |
<title>Writing a document filter</title>
|
3056 |
<title>Writing a document filter</title>
|
3055 |
|
3057 |
|
3056 |
<para>&RCL; filters are executable programs which
|
3058 |
<para>&RCL; filters cooperate to translate from the multitude
|
3057 |
translate from a specific format (ie:
|
3059 |
of input document formats, simple ones
|
3058 |
<application>openoffice</application>,
|
3060 |
as <application>opendocument</application>,
|
|
|
3061 |
<application>acrobat</application>), or compound ones such
|
|
|
3062 |
as <application>Zip</application>
|
3059 |
<application>acrobat</application>, etc.) to the &RCL;
|
3063 |
or <application>Email</application>, into the final &RCL;
|
3060 |
indexing input format, which may be
|
3064 |
indexing input format, which may
|
3061 |
<literal>text/plain</literal> or
|
3065 |
be <literal>text/plain</literal>
|
3062 |
<literal>text/html</literal>.</para>
|
3066 |
or <literal>text/html</literal>. Most filters are executable
|
|
|
3067 |
programs or scripts. A few filters are coded in C++ and live
|
|
|
3068 |
inside <command>recollindex</command>. This latter kind will not
|
|
|
3069 |
be described here.</para>
|
3063 |
|
3070 |
|
3064 |
<para>As of &RCL; 1.13, there are two kinds of filters:
|
3071 |
<para>There are currently (1.18 and since 1.13) two kinds of
|
|
|
3072 |
external executable filters:
|
3065 |
<itemizedlist>
|
3073 |
<itemizedlist>
|
3066 |
<listitem><para>Simple filters (the old ones) run once and
|
3074 |
<listitem><para>Simple filters (<literal>exec</literal>
|
|
|
3075 |
filters) run once and
|
3067 |
exit. They can be bare programs like
|
3076 |
exit. They can be bare programs
|
3068 |
<application>antiword</application>, or shell-scripts using other
|
3077 |
like <application>antiword</application>, or scripts
|
3069 |
programs. They are very simple to write, because they just need
|
3078 |
using other programs. They are very simple to write,
|
3070 |
to output the converted to the standard output.</para>
|
3079 |
because they just need to print the converted document
|
|
|
3080 |
to the standard output. Their output can
|
|
|
3081 |
be <literal>text/plain</literal>
|
|
|
3082 |
or <literal>text/html</literal>.</para>
|
3071 |
</listitem>
|
3083 |
</listitem>
|
3072 |
<listitem><para>Multiple filters, new in 1.13, run as long as
|
3084 |
<listitem><para>Multiple filters (<literal>execm</literal>
|
3073 |
their master process (ie: recollindex) is active. They can
|
3085 |
filters), run as long as
|
3074 |
process multiple files (sparing the process startup time which
|
3086 |
their master process (<command>recollindex</command>) is
|
3075 |
can be very significant), or multiple documents per file (ie: for
|
3087 |
active. They can process multiple files (sparing the
|
|
|
3088 |
process startup time which can be very significant),
|
|
|
3089 |
or multiple documents per file (e.g.: for zip or chm
|
3076 |
zip or chm files). They communicate with the indexer through a
|
3090 |
files). They communicate with the indexer through a
|
3077 |
simple protocol, but are nevertheless a bit more complicated than
|
3091 |
simple protocol, but are nevertheless a bit more
|
3078 |
the older kind. Most of these new filters are written in
|
3092 |
complicated than the older kind. Most of new
|
|
|
3093 |
filters are written
|
3079 |
<application>Python</application>, using a common module to
|
3094 |
in <application>Python</application>, using a common
|
3080 |
handle the protocol.</para>
|
3095 |
module to handle the protocol. There is an
|
|
|
3096 |
exception, <command>rclimg</command> which is written
|
|
|
3097 |
in Perl. The subdocuments output by these filters can
|
|
|
3098 |
be directly indexable (text or HTML), or they can be
|
|
|
3099 |
other simple or compound documents that will need to
|
|
|
3100 |
be processed by another filter.</para>
|
3081 |
</listitem>
|
3101 |
</listitem>
|
3082 |
</itemizedlist>
|
3102 |
</itemizedlist>
|
3083 |
The following will just describe the simple filters. If you can
|
3103 |
</para>
|
3084 |
program and want to write one of the other kind, it shouldn't be too
|
3104 |
|
3085 |
difficult to make sense of one of the existing modules. For example,
|
3105 |
<para>In both cases, filters deal with regular file system
|
3086 |
look at <command>rclzip</command> which uses Zip file paths as
|
3106 |
files, and can process either a single document, or a
|
3087 |
internal identifiers (<literal>ipath</literal>), and
|
3107 |
linear list of documents in each file. &RCL; is responsible
|
3088 |
<command>rclinfo</command>, which uses an integer index.</para>
|
3108 |
for performing up to date checks, deal with more complex
|
|
|
3109 |
embedding and other upper level issues.</para>
|
|
|
3110 |
|
|
|
3111 |
<para>In the extreme case of a simple filter returning a
|
|
|
3112 |
document in <literal>text/plain</literal> format, no
|
|
|
3113 |
metadata can be transferred from the filter to the
|
|
|
3114 |
indexer. Generic metadata, like document size or
|
|
|
3115 |
modification date, will be gathered and stored by the
|
|
|
3116 |
indexer.</para>
|
|
|
3117 |
|
|
|
3118 |
<para>Filters that produce <literal>text/html</literal>
|
|
|
3119 |
format can return an arbitrary amount of metadata inside HTML
|
|
|
3120 |
<literal>meta</literal> tags. These will be processed
|
|
|
3121 |
according to the directives found in
|
|
|
3122 |
the <link linkend="rcl.program.fields">
|
|
|
3123 |
<filename>fields</filename> configuration
|
|
|
3124 |
file</link>.</para>
|
|
|
3125 |
|
|
|
3126 |
<para>The filters that can handle multiple documents per file
|
|
|
3127 |
return a single piece of data to identify each document inside
|
|
|
3128 |
the file. This piece of data, called
|
|
|
3129 |
an <literal>ipath element</literal> will be sent back by
|
|
|
3130 |
&RCL; to extract the document at query time, for previewing,
|
|
|
3131 |
or for creating a temporary file to be opened by a
|
|
|
3132 |
viewer.</para>
|
|
|
3133 |
|
|
|
3134 |
<para>The following section describes the simple
|
|
|
3135 |
filters, and the next one gives a few explanations about
|
|
|
3136 |
the <literal>execm</literal> ones. You could conceivably
|
|
|
3137 |
write a simple filter with only the elements in the
|
|
|
3138 |
manual. This will not be the case for the other ones, for
|
|
|
3139 |
which you will have to look at the code.</para>
|
3089 |
|
3140 |
|
3090 |
<sect2 id="rcl.program.filters.simple">
|
3141 |
<sect2 id="rcl.program.filters.simple">
|
3091 |
<title>Simple filters</title>
|
3142 |
<title>Simple filters</title>
|
3092 |
|
3143 |
|
3093 |
<para>&RCL; simple filters are usually shell-scripts, but this is in
|
3144 |
<para>&RCL; simple filters are usually shell-scripts, but this is in
|
|
... |
|
... |
3123 |
|
3174 |
|
3124 |
<para>Don't forget to make your filter executable before
|
3175 |
<para>Don't forget to make your filter executable before
|
3125 |
testing !</para>
|
3176 |
testing !</para>
|
3126 |
|
3177 |
|
3127 |
</sect2>
|
3178 |
</sect2>
|
|
|
3179 |
|
|
|
3180 |
<sect2 id="rcl.program.filters.multiple">
|
|
|
3181 |
<title>"Multiple" filters</title>
|
|
|
3182 |
|
|
|
3183 |
<para>If you can program and want to write
|
|
|
3184 |
an <literal>execm</literal> filter, it should not be too
|
|
|
3185 |
difficult to make sense of one of the existing modules. For
|
|
|
3186 |
example, look at <command>rclzip</command> which uses Zip
|
|
|
3187 |
file paths as identifiers (<literal>ipath</literal>),
|
|
|
3188 |
and <command>rclics</command>, which uses an integer
|
|
|
3189 |
index. Also have a look at the comments inside
|
|
|
3190 |
the <filename>internfile/mh_execm.h</filename> file and
|
|
|
3191 |
possibly at the corresponding module.</para>
|
|
|
3192 |
|
|
|
3193 |
<para><literal>execm</literal> filters sometimes need to make
|
|
|
3194 |
a choice for the nature of the <literal>ipath</literal>
|
|
|
3195 |
elements that they use in communication with the
|
|
|
3196 |
indexer. Here are a few guidelines:
|
|
|
3197 |
<itemizedlist>
|
|
|
3198 |
<listitem><para>Use ASCII or UTF-8 (if the identifier is an
|
|
|
3199 |
integer print it, for example, like printf %d would
|
|
|
3200 |
do).</para></listitem>
|
|
|
3201 |
<listitem><para>If at all possible, the data should make some
|
|
|
3202 |
kind of sense when printed to a log file to help with
|
|
|
3203 |
debugging.</para></listitem>
|
|
|
3204 |
<listitem><para>&RCL; uses a colon (<literal>:</literal>) as a
|
|
|
3205 |
separator to store a complex path internally (for
|
|
|
3206 |
deeper embedding). Colons inside
|
|
|
3207 |
the <literal>ipath</literal> elements output by a
|
|
|
3208 |
filter will be escaped, but would be a bad choice as a
|
|
|
3209 |
filter-specific separator (mostly, again, for
|
|
|
3210 |
debugging issues).</para></listitem>
|
|
|
3211 |
</itemizedlist>
|
|
|
3212 |
In any case, the main goal is that it should
|
|
|
3213 |
be easy for the filter to extract the target document, given
|
|
|
3214 |
the file name and the <literal>ipath</literal>
|
|
|
3215 |
element.</para>
|
|
|
3216 |
|
|
|
3217 |
<para><literal>execm</literal> filters will also produce
|
|
|
3218 |
a document with a null <literal>ipath</literal>
|
|
|
3219 |
element. Depending on the type of document, this may have
|
|
|
3220 |
some associated data (e.g. the body of an email message), or
|
|
|
3221 |
none (typical for an archive file). If it is empty, this
|
|
|
3222 |
document will be useful anyway for some operations, as the
|
|
|
3223 |
parent of the actual data documents.</para>
|
3128 |
|
3224 |
|
3129 |
<sect2 id="rcl.program.filters.association">
|
3225 |
<sect2 id="rcl.program.filters.association">
|
3130 |
<title>Telling &RCL; about the filter</title>
|
3226 |
<title>Telling &RCL; about the filter</title>
|
3131 |
|
3227 |
|
3132 |
<para>There are two elements that link a file to the filter which
|
3228 |
<para>There are two elements that link a file to the filter which
|