|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
2322 |
the older kind. Most of these new filters are written in
|
2322 |
the older kind. Most of these new filters are written in
|
2323 |
<application>Python</application>, using a common module to
|
2323 |
<application>Python</application>, using a common module to
|
2324 |
handle the protocol.</para>
|
2324 |
handle the protocol.</para>
|
2325 |
</listitem>
|
2325 |
</listitem>
|
2326 |
</itemizedlist>
|
2326 |
</itemizedlist>
|
2327 |
The following will just describe the simple filters, if you are
|
2327 |
The following will just describe the simple filters. If you can
|
2328 |
programmer enough to write one of the other kind, it shouldn't be too
|
2328 |
program and want to write one of the other kind, it shouldn't be too
|
2329 |
difficult to make sense of one of the existing modules (ie:
|
2329 |
difficult to make sense of one of the existing modules. For example,
|
2330 |
rclzip).</para>
|
2330 |
look at <command>rclzip</command> which uses Zip file paths as
|
|
|
2331 |
internal identifiers (<literal>ipath</literal>), and
|
|
|
2332 |
<command>rclinfo</command>, which uses an integer index.</para>
|
|
|
2333 |
|
|
|
2334 |
<sect2 id="rcl.program.filters.simple">
|
|
|
2335 |
<title>Simple filters</title>
|
2331 |
|
2336 |
|
2332 |
<para>&RCL; simple filters are usually shell-scripts, but this is in
|
2337 |
<para>&RCL; simple filters are usually shell-scripts, but this is in
|
2333 |
no way necessary. These programs are extremely simple and most
|
2338 |
no way necessary. Extracting the text from the native format is the
|
2334 |
of the difficulty lies in extracting the text from the native
|
2339 |
difficult part. Outputting the format expected by &RCL; is
|
2335 |
format, not outputting what is expected by &RCL;. Happily
|
|
|
2336 |
enough, most document formats already have translators or text
|
2340 |
trivial. Happily enough, most document formats have translators or
|
2337 |
extractors which handle the difficult part and can be called
|
2341 |
text extractors which can be called from the filter. In some cases
|
2338 |
from the filter. In some case the output of the translating
|
2342 |
the output of the translating program is completely appropriate,
|
2339 |
program is appropriate, and no intermediate shell-script is
|
2343 |
and no intermediate shell-script is needed.</para>
|
2340 |
needed.</para>
|
|
|
2341 |
|
2344 |
|
2342 |
<para>Filters are called with a single argument which is the
|
2345 |
<para>Filters are called with a single argument which is the
|
2343 |
source file name. They should output the result to stdout.</para>
|
2346 |
source file name. They should output the result to stdout.</para>
|
2344 |
|
2347 |
|
|
|
2348 |
<para>When writing a filter, you should decide if it will output
|
|
|
2349 |
plain text or html. Plain text is simpler, but you will not be able
|
|
|
2350 |
to add metadata or vary the output character encoding (this will be
|
|
|
2351 |
defined in a configuration file). Additionally, some formatting may
|
|
|
2352 |
easier to preserve when previewing html. Actually the deciding factor
|
|
|
2353 |
is metadata: &RCL; has a way to <link linkend="rcl.program.filters.html">
|
|
|
2354 |
extract metadata from the html header and use it for field
|
|
|
2355 |
searches.</link>.</para>
|
|
|
2356 |
|
2345 |
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
|
2357 |
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal> environment
|
2346 |
environment variable (values <literal>yes</literal>,
|
2358 |
variable (values <literal>yes</literal>, <literal>no</literal>)
|
2347 |
<literal>no</literal>) tells the filter if the operation is
|
2359 |
tells the filter if the operation is for indexing or
|
2348 |
for indexing or previewing. Some filters use this to output a
|
2360 |
previewing. Some filters use this to output a slightly different
|
2349 |
slightly different format. This is not essential.</para>
|
2361 |
format, for example stripping uninteresting repeated keywords (ie:
|
|
|
2362 |
<literal>Subject:</literal> for email) when indexing. This is not
|
|
|
2363 |
essential.</para>
|
|
|
2364 |
|
|
|
2365 |
<para>You should look to one of the simple filters, for exemple
|
|
|
2366 |
<literal>rclps</literal> for a starting point.</para>
|
|
|
2367 |
|
|
|
2368 |
<para>Don't forget to make your filter executable before
|
|
|
2369 |
testing !</para>
|
|
|
2370 |
|
|
|
2371 |
</sect2>
|
|
|
2372 |
|
|
|
2373 |
<sect2 id="rcl.program.filters.association">
|
|
|
2374 |
<title>Telling &RCL; about the filter</title>
|
|
|
2375 |
|
|
|
2376 |
<para>There are two elements that link a file to the filter which
|
|
|
2377 |
should process it: the association of file to mime type and the
|
|
|
2378 |
association of a mime type with a filter.</para>
|
|
|
2379 |
|
|
|
2380 |
<para>The association of files to mime types is mostly based on
|
|
|
2381 |
name suffixes. The types are defined inside the
|
|
|
2382 |
<link linkend="rcl.install.config.mimeconf">
|
|
|
2383 |
<filename>mimemap</filename> file</link>. Example:
|
|
|
2384 |
<programlisting>
|
|
|
2385 |
|
|
|
2386 |
.doc = application/msword
|
|
|
2387 |
</programlisting>
|
|
|
2388 |
If no suffix association is found for the file name, &RCL; will try
|
|
|
2389 |
to execute the <command>file -i</command> command to determine a
|
|
|
2390 |
mime type.</para>
|
2350 |
|
2391 |
|
2351 |
<para>The association of file types to filters is performed in
|
2392 |
<para>The association of file types to filters is performed in
|
|
|
2393 |
the <link linkend="rcl.install.config.mimemap">
|
2352 |
the <filename>mimeconf</filename> file. A sample:</para>
|
2394 |
<filename>mimeconf</filename> file</link>. A sample will probably be
|
|
|
2395 |
of better help than a long explanation:</para>
|
2353 |
<programlisting>
|
2396 |
<programlisting>
|
2354 |
|
2397 |
|
2355 |
[index]
|
2398 |
[index]
|
2356 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
2399 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
2357 |
mimetype = text/plain ; charset=utf-8
|
2400 |
mimetype = text/plain ; charset=utf-8
|
|
... |
|
... |
2390 |
<listitem><para><literal>application/x-chm</literal> is processed
|
2433 |
<listitem><para><literal>application/x-chm</literal> is processed
|
2391 |
by a persistant filter. This is determined by the
|
2434 |
by a persistant filter. This is determined by the
|
2392 |
<literal>execm</literal> keyword.</para>
|
2435 |
<literal>execm</literal> keyword.</para>
|
2393 |
</listitem>
|
2436 |
</listitem>
|
2394 |
</itemizedlist>
|
2437 |
</itemizedlist>
|
2395 |
The easiest way to write a new filter is probably to start from an
|
2438 |
</para>
|
2396 |
existing one.</para>
|
|
|
2397 |
|
2439 |
|
2398 |
<para>Filters which output <literal>text/plain</literal> text
|
2440 |
</sect2>
|
2399 |
are generally simpler, but they cannot specify the character set
|
|
|
2400 |
and other metadata, so they are limited to cases where these
|
|
|
2401 |
elements are not needed.</para>
|
|
|
2402 |
|
|
|
2403 |
|
2441 |
|
2404 |
<sect2 id="rcl.program.filters.html">
|
2442 |
<sect2 id="rcl.program.filters.html">
|
2405 |
<title>Filter HTML output</title>
|
2443 |
<title>Filter HTML output</title>
|
2406 |
|
2444 |
|
2407 |
<para>The output HTML could be very minimal like the following
|
2445 |
<para>The output HTML could be very minimal like the following
|