--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@@ -24,11 +24,12 @@
Dockes</holder>
</copyright>
- <releaseinfo>$Id: usermanual.sgml,v 1.44 2007-06-08 16:46:53 dockes Exp $</releaseinfo>
+ <releaseinfo>$Id: usermanual.sgml,v 1.45 2007-06-26 16:58:25 dockes Exp $</releaseinfo>
<abstract>
<para>This document introduces full text search notions
- and describes the installation and use of the &RCL; application.</para>
+ and describes the installation and use of the &RCL;
+ application. It currently describes &RCL; 1.9.</para>
</abstract>
@@ -770,30 +771,6 @@
<replaceable>live</replaceable> or
<replaceable>unplugged</replaceable> but not
<replaceable>potatoes</replaceable> (in any part of the document).</para>
-
- <para>The first element <literal>author:"john doe"</literal> is
- a phrase search limited to a specific field. Phrase searches are
- specified as usual by enclosing the words in double quotes. The
- field specification appears before the colon (of course this is
- not limited to phrases, <literal>author:Balzac</literal> would
- be ok too). &RCL; currently manages the following fields:</para>
-
- <itemizedlist>
- <listitem><para><literal>title</literal>,
- <literal>subject</literal> or <literal>caption</literal> are
- synonyms which specify data to be searched for in the
- document title or subject.</para>
- </listitem>
- <listitem><para><literal>author</literal> or
- <literal>from</literal> for searching the documents originators.</para>
- </listitem>
- <listitem><para><literal>keyword</literal> for searching the
- document specified keywords (few documents actually have any).</para>
- </listitem>
- </itemizedlist>
-
- <para>The query language is currently the only way to use the
- &RCL; field search capability.</para>
<para>All elements in the search entry are normally combined
with an implicit AND. It is possible to specify that elements be
@@ -817,8 +794,54 @@
<para>An entry preceded by a <literal>-</literal> specifies a
term that should <emphasis>not</emphasis> appear.</para>
+ <para>The first element in the above exemple,
+ <literal>author:"john doe"</literal> is a phrase search limited
+ to a specific field. Phrase searches are specified as usual by
+ enclosing the words in double quotes. The field specification
+ appears before the colon (of course this is not limited to
+ phrases, <literal>author:Balzac</literal> would be ok
+ too). &RCL; currently manages the following fields:</para>
+ <itemizedlist>
+ <listitem><para><literal>title</literal>,
+ <literal>subject</literal> or <literal>caption</literal> are
+ synonyms which specify data to be searched for in the
+ document title or subject.</para>
+ </listitem>
+ <listitem><para><literal>author</literal> or
+ <literal>from</literal> for searching the documents originators.</para>
+ </listitem>
+ <listitem><para><literal>keyword</literal> for searching the
+ document specified keywords (few documents actually have any).</para>
+ </listitem>
+ </itemizedlist>
+
+ <para>As of release 1.9, the filters have the possibility to
+ create other fields with arbitrary names. No standard filters
+ use this possibility yet.</para>
+
+ <para>There are two other elements which may be specified
+ through the field syntax, but are somewhat special:</para>
+ <itemizedlist>
+ <listitem><para><literal>ext</literal> for specifying the file
+ name extension (Ex: <literal>ext:html</literal>)</para>
+ </listitem>
+ <listitem><para><literal>mime</literal> for specifying the
+ mime type. This one is quite special because you can specify
+ several values which will be OR'ed (the normal default for the
+ language is AND). Ex: <literal>mime:text/plain
+ mime:text/html</literal>. Specifying an explicit boolean
+ operator or negation (<literal>-</literal>) before a
+ <literal>mime</literal> specification is not supported and
+ will produce strange results.</para>
+ </listitem>
+ </itemizedlist>
+ <para>The query language is currently the only way to use the
+ &RCL; field search capability.</para>
+
<para>Words inside phrases and capitalized words are not
- stem-expanded. Wildcards may be used anywhere.</para>
+ stem-expanded. Wildcards may be used anywhere inside a term.
+ Specifying a wild-card on the left of a term can produce a very
+ slow search.</para>
<para>You can use the <literal>show query</literal> link at the
top of the result list to check the exact query which was
@@ -2089,36 +2112,91 @@
will be given a file name as argument and should output the
text contents in html format on the standard output.</para>
- <para>The html could be very minimal like the following
- example:</para>
- <programlisting><html><head>
+ <para>You can find more details about writing a &RCL; filter
+ in the <link linkend="rcl.extending.filters">section about
+ writing filters</link></para>
+ </sect3>
+
+ </sect2>
+
+ </sect1>
+
+ <sect1 id="rcl.extending">
+ <title>Extending &RCL;</title>
+
+ <sect2 id="rcl.extending.filters">
+ <title>Writing a document filter</title>
+
+ <para>&RCL; filters are executable programs which
+ translate from a specific format (ie:
+ <application>openoffice</application>,
+ <application>acrobat</application>, etc.) to the &RCL;
+ indexing input format, which was chosen to be HTML.</para>
+
+ <para>&RCL; filters are usually shell-scripts, but this is in
+ no way necessary. These programs are extremely simple and most
+ of the difficulty lies in extracting the text from the native
+ format, not outputting what is expected by &RCL;. Happily
+ enough, most document formats already have translators or text
+ extractors which handle the difficult part and can be called
+ from the filter.</para>
+
+ <para>Filters are called with a single argument which is the
+ source file name. They should output the result to stdout.</para>
+
+ <para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
+ environment variable (values <literal>yes</literal>,
+ <literal>no</literal>) tells the filter if the operation is
+ for indexing or previewing. Some filters use this to output a
+ slightly different format. This is not essential.</para>
+
+ <para>The output HTML could be very minimal like the following
+ example:</para>
+
+ <programlisting><html><head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
</head>
<body>some text content</body></html>
</programlisting>
- <para>You should take care to escape some characters inside
+ <para>You should take care to escape some characters inside
the text by transforming them into appropriate
entities. "<literal>&</literal>" should be transformed into
"<literal>&amp;</literal>", "<literal><</literal>"
should be transformed into "<literal>&lt;</literal>".</para>
- <para>The character set needs to be specified in the
+ <para>The character set needs to be specified in the
header. It does not need to be UTF-8 (&RCL; will take care
of translating it), but it must be accurate for good
results.</para>
- <para>&RCL; will also make use of other header fields if
+ <para>&RCL; will also make use of other header fields if
they are present: <literal>title</literal>,
- <literal>description</literal>, <literal>keywords</literal>.
- <para>
- <para>The easiest way to write a new filter is probably to start
+ <literal>description</literal>,
+ <literal>keywords</literal>.</para>
+
+ <para>As of &RCL; release 1.9, filters also have the
+ possibility to "invent" field names. This should be output as
+ meta tags:</para>
+
+ <programlisting>
+<meta name="somefield" content="Some textual data" />
+</programlisting>
+
+ <para>In this case, a correspondance between field name and
+ &XAP; prefix should also be added to the
+ <filename>mimeconf</filename> file. See the existing entries
+ for inspiration. The field can then be used inside the query
+ language to narrow searches.</para>
+
+ <para>The easiest way to write a new filter is probably to start
from an existing one.</para>
- </sect3>
-
+
+
</sect2>
</sect1>
+
</chapter>
</book>