Switch to unified view

a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml
...
...
22
      <year>2005</year>
22
      <year>2005</year>
23
      <holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
23
      <holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
24
      Dockes</holder>
24
      Dockes</holder>
25
    </copyright>
25
    </copyright>
26
26
27
    <releaseinfo>$Id: usermanual.sgml,v 1.44 2007-06-08 16:46:53 dockes Exp $</releaseinfo>
27
    <releaseinfo>$Id: usermanual.sgml,v 1.45 2007-06-26 16:58:25 dockes Exp $</releaseinfo>
28
28
29
    <abstract>
29
    <abstract>
30
      <para>This document introduces full text search notions
30
      <para>This document introduces full text search notions
31
      and describes the installation and use of the &RCL; application.</para>
31
      and describes the installation and use of the &RCL;
32
      application. It currently describes &RCL; 1.9.</para>
32
    </abstract>
33
    </abstract>
33
34
34
35
35
  </bookinfo>
36
  </bookinfo>
36
  
37
  
...
...
768
      and containing either <replaceable>beatles</replaceable> or
769
      and containing either <replaceable>beatles</replaceable> or
769
      <replaceable>lennon</replaceable> and either
770
      <replaceable>lennon</replaceable> and either
770
      <replaceable>live</replaceable> or
771
      <replaceable>live</replaceable> or
771
      <replaceable>unplugged</replaceable> but not
772
      <replaceable>unplugged</replaceable> but not
772
      <replaceable>potatoes</replaceable> (in any part of the document).</para>
773
      <replaceable>potatoes</replaceable> (in any part of the document).</para>
773
774
      <para>The first element <literal>author:"john doe"</literal> is
775
      a phrase search limited to a specific field. Phrase searches are
776
      specified as usual by enclosing the words in double quotes. The
777
      field specification appears before the colon (of course this is
778
      not limited to phrases, <literal>author:Balzac</literal> would
779
      be ok too). &RCL; currently manages the following fields:</para>
780
781
      <itemizedlist>
782
  <listitem><para><literal>title</literal>,
783
  <literal>subject</literal> or <literal>caption</literal> are
784
  synonyms which specify data to be searched for in the
785
  document title or subject.</para>
786
  </listitem>
787
  <listitem><para><literal>author</literal> or
788
  <literal>from</literal> for searching the documents originators.</para>
789
  </listitem>
790
  <listitem><para><literal>keyword</literal> for searching the
791
  document specified keywords (few documents actually have any).</para>
792
  </listitem>
793
      </itemizedlist>
794
795
      <para>The query language is currently the only way to use the
796
      &RCL; field search capability.</para>
797
774
798
      <para>All elements in the search entry are normally combined
775
      <para>All elements in the search entry are normally combined
799
      with an implicit AND. It is possible to specify that elements be
776
      with an implicit AND. It is possible to specify that elements be
800
      OR'ed instead, as in <replaceable>Beatles</replaceable>
777
      OR'ed instead, as in <replaceable>Beatles</replaceable>
801
      <literal>OR</literal> <replaceable>Lennon</replaceable>. The
778
      <literal>OR</literal> <replaceable>Lennon</replaceable>. The
...
...
815
      parenthesis, they are not supported for now.</para>
792
      parenthesis, they are not supported for now.</para>
816
793
817
      <para>An entry preceded by a <literal>-</literal> specifies a
794
      <para>An entry preceded by a <literal>-</literal> specifies a
818
      term that should <emphasis>not</emphasis> appear.</para>
795
      term that should <emphasis>not</emphasis> appear.</para>
819
796
797
      <para>The first element in the above exemple,
798
      <literal>author:"john doe"</literal> is a phrase search limited
799
      to a specific field. Phrase searches are specified as usual by
800
      enclosing the words in double quotes. The field specification
801
      appears before the colon (of course this is not limited to
802
      phrases, <literal>author:Balzac</literal> would be ok
803
      too). &RCL; currently manages the following fields:</para>
804
      <itemizedlist>
805
  <listitem><para><literal>title</literal>,
806
  <literal>subject</literal> or <literal>caption</literal> are
807
  synonyms which specify data to be searched for in the
808
  document title or subject.</para>
809
  </listitem>
810
  <listitem><para><literal>author</literal> or
811
  <literal>from</literal> for searching the documents originators.</para>
812
  </listitem>
813
  <listitem><para><literal>keyword</literal> for searching the
814
  document specified keywords (few documents actually have any).</para>
815
  </listitem>
816
      </itemizedlist>
817
818
      <para>As of release 1.9, the filters have the possibility to
819
      create other fields with arbitrary names. No standard filters
820
      use this possibility yet.</para>
821
822
      <para>There are two other elements which may be specified
823
      through the field syntax, but are somewhat special:</para>
824
      <itemizedlist>
825
  <listitem><para><literal>ext</literal> for specifying the file
826
  name extension (Ex: <literal>ext:html</literal>)</para>
827
  </listitem>
828
  <listitem><para><literal>mime</literal> for specifying the
829
  mime type. This one is quite special because you can specify
830
  several values which will be OR'ed (the normal default for the
831
  language is AND). Ex: <literal>mime:text/plain
832
  mime:text/html</literal>. Specifying an explicit boolean
833
  operator or negation (<literal>-</literal>) before a
834
  <literal>mime</literal> specification is not supported and
835
  will produce strange results.</para>
836
  </listitem>
837
      </itemizedlist>
838
      <para>The query language is currently the only way to use the
839
      &RCL; field search capability.</para>
840
820
      <para>Words inside phrases and capitalized words are not
841
      <para>Words inside phrases and capitalized words are not
821
      stem-expanded. Wildcards may be used anywhere.</para>
842
      stem-expanded. Wildcards may be used anywhere inside a term.
843
      Specifying a wild-card on the left of a term can produce a very
844
      slow search.</para>
822
845
823
      <para>You can use the <literal>show query</literal> link at the
846
      <para>You can use the <literal>show query</literal> link at the
824
      top of the result list to check the exact query which was
847
      top of the result list to check the exact query which was
825
      finally executed by Xapian.</para>
848
      finally executed by Xapian.</para>
826
849
...
...
2087
      be an executable program or script which exists inside
2110
      be an executable program or script which exists inside
2088
      <filename>/usr/[local/]share/recoll/filters</filename>. It
2111
      <filename>/usr/[local/]share/recoll/filters</filename>. It
2089
      will be given a file name as argument and should output the
2112
      will be given a file name as argument and should output the
2090
      text contents in html format on the standard output.</para>
2113
      text contents in html format on the standard output.</para>
2091
2114
2115
    <para>You can find more details about writing a &RCL; filter
2116
    in the <link linkend="rcl.extending.filters">section about
2117
    writing filters</link></para>
2118
  </sect3>
2119
2120
      </sect2>
2121
2122
    </sect1>
2123
2124
    <sect1 id="rcl.extending">
2125
      <title>Extending &RCL;</title>
2126
      
2127
      <sect2 id="rcl.extending.filters">
2128
  <title>Writing a document filter</title>
2129
2130
  <para>&RCL; filters are executable programs which 
2131
  translate from a specific format (ie:
2132
  <application>openoffice</application>,
2133
  <application>acrobat</application>, etc.) to the &RCL;
2134
  indexing input format, which was chosen to be HTML.</para>
2135
2136
  <para>&RCL; filters are usually shell-scripts, but this is in
2137
  no way necessary. These programs are extremely simple and most
2138
  of the difficulty lies in extracting the text from the native
2139
  format, not outputting what is expected by &RCL;. Happily
2140
  enough, most document formats already have translators or text
2141
  extractors which handle the difficult part and can be called
2142
  from the filter.</para>
2143
2144
  <para>Filters are called with a single argument which is the
2145
  source file name. They should output the result to stdout.</para>
2146
2147
  <para>The <literal>RECOLL_FILTER_FORPREVIEW</literal>
2148
  environment variable (values <literal>yes</literal>,
2149
  <literal>no</literal>) tells the filter if the operation is
2150
  for indexing or previewing. Some filters use this to output a
2151
  slightly different format. This is not essential.</para>
2152
2092
      <para>The html could be very minimal like the following
2153
    <para>The output HTML could be very minimal like the following
2093
      example:</para>
2154
    example:</para>
2155
2094
      <programlisting>&lt;html>&lt;head>
2156
    <programlisting>&lt;html>&lt;head>
2095
&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2157
&lt;meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2096
&lt/head>
2158
&lt/head>
2097
&lt;body>some text content&lt;/body>&lt;/html>
2159
&lt;body>some text content&lt;/body>&lt;/html>
2098
          </programlisting>
2160
          </programlisting>
2099
2161
2100
      <para>You should take care to escape some characters inside
2162
    <para>You should take care to escape some characters inside
2101
      the text by transforming them into appropriate
2163
      the text by transforming them into appropriate
2102
      entities. "<literal>&amp;</literal>" should be transformed into
2164
      entities. "<literal>&amp;</literal>" should be transformed into
2103
      "<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
2165
      "<literal>&amp;amp;</literal>", "<literal>&lt;</literal>"
2104
      should be transformed into "<literal>&amp;lt;</literal>".</para>
2166
      should be transformed into "<literal>&amp;lt;</literal>".</para>
2105
2167
2106
      <para>The character set needs to be specified in the
2168
    <para>The character set needs to be specified in the
2107
      header. It does not need to be UTF-8 (&RCL; will take care
2169
      header. It does not need to be UTF-8 (&RCL; will take care
2108
      of translating it), but it must be accurate for good
2170
      of translating it), but it must be accurate for good
2109
      results.</para>
2171
      results.</para>
2110
2172
2111
      <para>&RCL; will also make use of other header fields if
2173
    <para>&RCL; will also make use of other header fields if
2112
      they are present: <literal>title</literal>,
2174
      they are present: <literal>title</literal>,
2113
    <literal>description</literal>, <literal>keywords</literal>.
2175
    <literal>description</literal>,
2114
          <para>
2176
    <literal>keywords</literal>.</para>
2177
2178
  <para>As of &RCL; release 1.9, filters also have the
2179
  possibility to "invent" field names. This should be output as
2180
  meta tags:</para>
2181
2182
  <programlisting>
2183
&lt;meta name="somefield" content="Some textual data" /&gt;
2184
</programlisting>
2185
  
2186
  <para>In this case, a correspondance between field name and
2187
  &XAP; prefix should also be added to the
2188
  <filename>mimeconf</filename> file. See the existing entries
2189
  for inspiration. The field can then be used inside the query
2190
  language to narrow searches.</para>
2191
2115
          <para>The easiest way to write a new filter is probably to start
2192
  <para>The easiest way to write a new filter is probably to start
2116
          from an existing one.</para>
2193
          from an existing one.</para>
2117
  </sect3>
2118
2194
2195
  
2119
      </sect2>
2196
      </sect2>
2120
2197
2121
    </sect1>
2198
    </sect1>
2199
2122
  </chapter>
2200
  </chapter>
2123
2201
2124
</book>
2202
</book>
2125
2203