|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
22 |
<year>2005</year>
|
22 |
<year>2005</year>
|
23 |
<holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
|
23 |
<holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
|
24 |
Dockes</holder>
|
24 |
Dockes</holder>
|
25 |
</copyright>
|
25 |
</copyright>
|
26 |
|
26 |
|
27 |
<releaseinfo>$Id: usermanual.sgml,v 1.44 2007-06-08 16:46:53 dockes Exp $</releaseinfo>
|
27 |
<releaseinfo>$Id: usermanual.sgml,v 1.45 2007-06-26 16:58:25 dockes Exp $</releaseinfo>
|
28 |
|
28 |
|
29 |
<abstract>
|
29 |
<abstract>
|
30 |
<para>This document introduces full text search notions
|
30 |
<para>This document introduces full text search notions
|
31 |
and describes the installation and use of the &RCL; application.</para>
|
31 |
and describes the installation and use of the &RCL;
|
|
|
32 |
application. It currently describes &RCL; 1.9.</para>
|
32 |
</abstract>
|
33 |
</abstract>
|
33 |
|
34 |
|
34 |
|
35 |
|
35 |
</bookinfo>
|
36 |
</bookinfo>
|
36 |
|
37 |
|
|
... |
|
... |
768 |
and containing either <replaceable>beatles</replaceable> or
|
769 |
and containing either <replaceable>beatles</replaceable> or
|
769 |
<replaceable>lennon</replaceable> and either
|
770 |
<replaceable>lennon</replaceable> and either
|
770 |
<replaceable>live</replaceable> or
|
771 |
<replaceable>live</replaceable> or
|
771 |
<replaceable>unplugged</replaceable> but not
|
772 |
<replaceable>unplugged</replaceable> but not
|
772 |
<replaceable>potatoes</replaceable> (in any part of the document).</para>
|
773 |
<replaceable>potatoes</replaceable> (in any part of the document).</para>
|
773 |
|
|
|
774 |
<para>The first element <literal>author:"john doe"</literal> is
|
|
|
775 |
a phrase search limited to a specific field. Phrase searches are
|
|
|
776 |
specified as usual by enclosing the words in double quotes. The
|
|
|
777 |
field specification appears before the colon (of course this is
|
|
|
778 |
not limited to phrases, <literal>author:Balzac</literal> would
|
|
|
779 |
be ok too). &RCL; currently manages the following fields:</para>
|
|
|
780 |
|
|
|
781 |
<itemizedlist>
|
|
|
782 |
<listitem><para><literal>title</literal>,
|
|
|
783 |
<literal>subject</literal> or <literal>caption</literal> are
|
|
|
784 |
synonyms which specify data to be searched for in the
|
|
|
785 |
document title or subject.</para>
|
|
|
786 |
</listitem>
|
|
|
787 |
<listitem><para><literal>author</literal> or
|
|
|
788 |
<literal>from</literal> for searching the documents originators.</para>
|
|
|
789 |
</listitem>
|
|
|
790 |
<listitem><para><literal>keyword</literal> for searching the
|
|
|
791 |
document specified keywords (few documents actually have any).</para>
|
|
|
792 |
</listitem>
|
|
|
793 |
</itemizedlist>
|
|
|
794 |
|
|
|
795 |
<para>The query language is currently the only way to use the
|
|
|
796 |
&RCL; field search capability.</para>
|
|
|
797 |
|
774 |
|
798 |
<para>All elements in the search entry are normally combined
|
775 |
<para>All elements in the search entry are normally combined
|
799 |
with an implicit AND. It is possible to specify that elements be
|
776 |
with an implicit AND. It is possible to specify that elements be
|
800 |
OR'ed instead, as in <replaceable>Beatles</replaceable>
|
777 |
OR'ed instead, as in <replaceable>Beatles</replaceable>
|
801 |
<literal>OR</literal> <replaceable>Lennon</replaceable>. The
|
778 |
<literal>OR</literal> <replaceable>Lennon</replaceable>. The
|
|
... |
|
... |
815 |
parenthesis, they are not supported for now.</para>
|
792 |
parenthesis, they are not supported for now.</para>
|
816 |
|
793 |
|
817 |
<para>An entry preceded by a <literal>-</literal> specifies a
|
794 |
<para>An entry preceded by a <literal>-</literal> specifies a
|
818 |
term that should <emphasis>not</emphasis> appear.</para>
|
795 |
term that should <emphasis>not</emphasis> appear.</para>
|
819 |
|
796 |
|
|
|
797 |
<para>The first element in the above exemple,
|
|
|
798 |
<literal>author:"john doe"</literal> is a phrase search limited
|
|
|
799 |
to a specific field. Phrase searches are specified as usual by
|
|
|
800 |
enclosing the words in double quotes. The field specification
|
|
|
801 |
appears before the colon (of course this is not limited to
|
|
|
802 |
phrases, <literal>author:Balzac</literal> would be ok
|
|
|
803 |
too). &RCL; currently manages the following fields:</para>
|
|
|
804 |
<itemizedlist>
|
|
|
805 |
<listitem><para><literal>title</literal>,
|
|
|
806 |
<literal>subject</literal> or <literal>caption</literal> are
|
|
|
807 |
synonyms which specify data to be searched for in the
|
|
|
808 |
document title or subject.</para>
|
|
|
809 |
</listitem>
|
|
|
810 |
<listitem><para><literal>author</literal> or
|
|
|
811 |
<literal>from</literal> for searching the documents originators.</para>
|
|
|
812 |
</listitem>
|
|
|
813 |
<listitem><para><literal>keyword</literal> for searching the
|
|
|
814 |
document specified keywords (few documents actually have any).</para>
|
|
|
815 |
</listitem>
|
|
|
816 |
</itemizedlist>
|
|
|
817 |
|
|
|
818 |
<para>As of release 1.9, the filters have the possibility to
|
|
|
819 |
create other fields with arbitrary names. No standard filters
|
|
|
820 |
use this possibility yet.</para>
|
|
|
821 |
|
|
|
822 |
<para>There are two other elements which may be specified
|
|
|
823 |
through the field syntax, but are somewhat special:</para>
|
|
|
824 |
<itemizedlist>
|
|
|
825 |
<listitem><para><literal>ext</literal> for specifying the file
|
|
|
826 |
name extension (Ex: <literal>ext:html</literal>)</para>
|
|
|
827 |
</listitem>
|
|
|
828 |
<listitem><para><literal>mime</literal> for specifying the
|
|
|
829 |
mime type. This one is quite special because you can specify
|
|
|
830 |
several values which will be OR'ed (the normal default for the
|
|
|
831 |
language is AND). Ex: <literal>mime:text/plain
|
|
|
832 |
mime:text/html</literal>. Specifying an explicit boolean
|
|
|
833 |
operator or negation (<literal>-</literal>) before a
|
|
|
834 |
<literal>mime</literal> specification is not supported and
|
|
|
835 |
will produce strange results.</para>
|
|
|
836 |
</listitem>
|
|
|
837 |
</itemizedlist>
|
|
|
838 |
<para>The query language is currently the only way to use the
|
|
|
839 |
&RCL; field search capability.</para>
|
|
|
840 |
|
820 |
<para>Words inside phrases and capitalized words are not
|
841 |
<para>Words inside phrases and capitalized words are not
|
821 |
stem-expanded. Wildcards may be used anywhere.</para>
|
842 |
stem-expanded. Wildcards may be used anywhere inside a term.
|
|
|
843 |
Specifying a wild-card on the left of a term can produce a very
|
|
|
844 |
slow search.</para>
|
822 |
|
845 |
|
823 |
<para>You can use the <literal>show query</literal> link at the
|
846 |
<para>You can use the <literal>show query</literal> link at the
|
824 |
top of the result list to check the exact query which was
|
847 |
top of the result list to check the exact query which was
|
825 |
finally executed by Xapian.</para>
|
848 |
finally executed by Xapian.</para>
|
826 |
|
849 |
|
|
... |
|
... |
2087 |
be an executable program or script which exists inside
|
2110 |
be an executable program or script which exists inside
|
2088 |
<filename>/usr/[local/]share/recoll/filters</filename>. It
|
2111 |
<filename>/usr/[local/]share/recoll/filters</filename>. It
|
2089 |
will be given a file name as argument and should output the
|
2112 |
will be given a file name as argument and should output the
|
2090 |
text contents in html format on the standard output.</para>
|
2113 |
text contents in html format on the standard output.</para>
|
2091 |
|
2114 |
|
|
|
2115 |
<para>You can find more details about writing a &RCL; filter
|
|
|
2116 |
in the <link linkend="rcl.extending.filters">section about
|
|
|
2117 |
writing filters</link></para>
|
|
|
2118 |
</sect3>
|
|
|
2119 |
|
|
|
2120 |
</sect2>
|
|
|
2121 |
|
|
|
2122 |
</sect1>
|
|
|
2123 |
|
|
|
2124 |
<sect1 id="rcl.extending">
|
|
|
2125 |
<title>Extending &RCL;</title>
|
|
|
2126 |
|
|
|
2127 |
<sect2 id="rcl.extending.filters">
|
|
|
2128 |
<title>Writing a document filter</title>
|
|
|
2129 |
|
|
|
2130 |
<para>&RCL; filters are executable programs which
|
|
|
2131 |
translate from a specific format (ie:
|
|
|
2132 |
<application>openoffice</application>,
|
|
|
2133 |
<application>acrobat</application>, etc.) to the &RCL;
|
|
|
2134 |
indexing input format, which was chosen to be HTML.</para>
|
|
|
2135 |
|
|
|
2136 |
<para>&RCL; filters are usually shell-scripts, but this is in
|
|
|
2137 |
no way necessary. These programs are extremely simple and most
|
|
|
2138 |
of the difficulty lies in extracting the text from the native
|
|
|
2139 |
format, not outputting what is expected by &RCL;. Happily
|
|
|
2140 |
enough, most document formats already have translators or text
|
|
|
2141 |
extractors which handle the difficult part and can be called
|
|
|
2142 |
from the filter.</para>
|
|
|
2143 |
|
|
|
2144 |
<para>Filters are called with a single argument which is the
|
|
|
2145 |
source file name. They should output the result to stdout.</para>
|
|
|
2146 |
|
|
|
2147 |
<para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
|
|
|
2148 |
environment variable (values <literal>yes</literal>,
|
|
|
2149 |
<literal>no</literal>) tells the filter if the operation is
|
|
|
2150 |
for indexing or previewing. Some filters use this to output a
|
|
|
2151 |
slightly different format. This is not essential.</para>
|
|
|
2152 |
|
2092 |
<para>The html could be very minimal like the following
|
2153 |
<para>The output HTML could be very minimal like the following
|
2093 |
example:</para>
|
2154 |
example:</para>
|
|
|
2155 |
|
2094 |
<programlisting><html><head>
|
2156 |
<programlisting><html><head>
|
2095 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
2157 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
2096 |
</head>
|
2158 |
</head>
|
2097 |
<body>some text content</body></html>
|
2159 |
<body>some text content</body></html>
|
2098 |
</programlisting>
|
2160 |
</programlisting>
|
2099 |
|
2161 |
|
2100 |
<para>You should take care to escape some characters inside
|
2162 |
<para>You should take care to escape some characters inside
|
2101 |
the text by transforming them into appropriate
|
2163 |
the text by transforming them into appropriate
|
2102 |
entities. "<literal>&</literal>" should be transformed into
|
2164 |
entities. "<literal>&</literal>" should be transformed into
|
2103 |
"<literal>&amp;</literal>", "<literal><</literal>"
|
2165 |
"<literal>&amp;</literal>", "<literal><</literal>"
|
2104 |
should be transformed into "<literal>&lt;</literal>".</para>
|
2166 |
should be transformed into "<literal>&lt;</literal>".</para>
|
2105 |
|
2167 |
|
2106 |
<para>The character set needs to be specified in the
|
2168 |
<para>The character set needs to be specified in the
|
2107 |
header. It does not need to be UTF-8 (&RCL; will take care
|
2169 |
header. It does not need to be UTF-8 (&RCL; will take care
|
2108 |
of translating it), but it must be accurate for good
|
2170 |
of translating it), but it must be accurate for good
|
2109 |
results.</para>
|
2171 |
results.</para>
|
2110 |
|
2172 |
|
2111 |
<para>&RCL; will also make use of other header fields if
|
2173 |
<para>&RCL; will also make use of other header fields if
|
2112 |
they are present: <literal>title</literal>,
|
2174 |
they are present: <literal>title</literal>,
|
2113 |
<literal>description</literal>, <literal>keywords</literal>.
|
2175 |
<literal>description</literal>,
|
2114 |
<para>
|
2176 |
<literal>keywords</literal>.</para>
|
|
|
2177 |
|
|
|
2178 |
<para>As of &RCL; release 1.9, filters also have the
|
|
|
2179 |
possibility to "invent" field names. This should be output as
|
|
|
2180 |
meta tags:</para>
|
|
|
2181 |
|
|
|
2182 |
<programlisting>
|
|
|
2183 |
<meta name="somefield" content="Some textual data" />
|
|
|
2184 |
</programlisting>
|
|
|
2185 |
|
|
|
2186 |
<para>In this case, a correspondance between field name and
|
|
|
2187 |
&XAP; prefix should also be added to the
|
|
|
2188 |
<filename>mimeconf</filename> file. See the existing entries
|
|
|
2189 |
for inspiration. The field can then be used inside the query
|
|
|
2190 |
language to narrow searches.</para>
|
|
|
2191 |
|
2115 |
<para>The easiest way to write a new filter is probably to start
|
2192 |
<para>The easiest way to write a new filter is probably to start
|
2116 |
from an existing one.</para>
|
2193 |
from an existing one.</para>
|
2117 |
</sect3>
|
|
|
2118 |
|
2194 |
|
|
|
2195 |
|
2119 |
</sect2>
|
2196 |
</sect2>
|
2120 |
|
2197 |
|
2121 |
</sect1>
|
2198 |
</sect1>
|
|
|
2199 |
|
2122 |
</chapter>
|
2200 |
</chapter>
|
2123 |
|
2201 |
|
2124 |
</book>
|
2202 |
</book>
|
2125 |
|
2203 |
|