|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
138 |
languages in the same index is possible, and useful in
|
138 |
languages in the same index is possible, and useful in
|
139 |
practice, but does introduce possibilities of confusion. &RCL;
|
139 |
practice, but does introduce possibilities of confusion. &RCL;
|
140 |
currently makes no attempt at automatic language recognition.</para>
|
140 |
currently makes no attempt at automatic language recognition.</para>
|
141 |
|
141 |
|
142 |
<para>&RCL; has many parameters which define exactly what to
|
142 |
<para>&RCL; has many parameters which define exactly what to
|
143 |
index, and how to classify and decode the source
|
143 |
index, and how to classify and decode the source documents. These
|
144 |
documents. These are kept in <link
|
|
|
145 |
linkend="rcl.indexing.config">configuration files</link>. A
|
144 |
are kept in <link linkend="rcl.indexing.config">configuration
|
146 |
default configuration is copied into a standard location
|
145 |
files</link>. A default configuration is copied into a standard
|
147 |
(usually something like
|
146 |
location (usually something like
|
148 |
<filename>/usr/[local/]share/recoll/examples</filename>)
|
147 |
<filename>/usr/[local/]share/recoll/examples</filename>) during
|
149 |
during installation. The default parameters from this file may
|
148 |
installation. The default parameters from this file may be
|
150 |
be overridden by values that you set inside your personal
|
149 |
overridden by values that you set inside your personal
|
151 |
configuration, found by default in the
|
150 |
configuration, found by default in the <filename>.recoll</filename>
|
152 |
<filename>.recoll</filename> sub-directory of your home
|
151 |
sub-directory of your home directory. The default configuration
|
153 |
directory. The default configuration will index your home
|
152 |
will index your home directory with default parameters and should
|
154 |
directory with default parameters and should be sufficient for
|
|
|
155 |
giving &RCL; a try, but you may want to adjust it
|
153 |
be sufficient for giving &RCL; a try, but you may want to adjust it
|
|
|
154 |
later, which can be done either by editing the text files or by
|
|
|
155 |
using configuration menus in the <command>recoll</command>
|
156 |
later.</para>
|
156 |
GUI</para>
|
157 |
|
157 |
|
158 |
<para><link linkend="rcl.indexing.periodic.exec">Indexing</link>
|
158 |
<para><link linkend="rcl.indexing.periodic.exec">Indexing</link>
|
159 |
is started automatically the first time you execute the
|
159 |
is started automatically the first time you execute the
|
160 |
<command>recoll</command> search graphical user interface, or by
|
160 |
<command>recoll</command> search graphical user interface, or by
|
161 |
executing the <command>recollindex</command> command.</para>
|
161 |
executing the <command>recollindex</command> command.</para>
|
|
... |
|
... |
182 |
<title>Introduction</title>
|
182 |
<title>Introduction</title>
|
183 |
|
183 |
|
184 |
<para>Indexing is the process by which the set of documents is
|
184 |
<para>Indexing is the process by which the set of documents is
|
185 |
analyzed and the data entered into the database. &RCL; indexing
|
185 |
analyzed and the data entered into the database. &RCL; indexing
|
186 |
is normally incremental: documents will only be processed if
|
186 |
is normally incremental: documents will only be processed if
|
187 |
they have been modified. On the first execution, of course, all
|
187 |
they have been modified. On the first execution, all
|
188 |
documents will need processing. A full index build can be forced
|
188 |
documents will need processing. A full index build can be forced
|
189 |
later by specifying an option to the indexing command
|
189 |
later by specifying an option to the indexing command
|
190 |
(<command>recollindex -z</command>).</para>
|
190 |
(<command>recollindex -z</command>).</para>
|
191 |
|
191 |
|
192 |
<para>&RCL; indexing can be performed with two different
|
192 |
<para>&RCL; indexing can be performed with two different
|
|
... |
|
... |
236 |
deep, and &RCL; has no problem processing, for example, an ms-word
|
236 |
deep, and &RCL; has no problem processing, for example, an ms-word
|
237 |
document which would be an attachment to an email message part of
|
237 |
document which would be an attachment to an email message part of
|
238 |
a folder file archived inside a zip file...</para>
|
238 |
a folder file archived inside a zip file...</para>
|
239 |
|
239 |
|
240 |
<para>&RCL; indexing processes plain text, HTML, openoffice
|
240 |
<para>&RCL; indexing processes plain text, HTML, openoffice
|
241 |
and e-mail files internally (a few more actually).</para>
|
241 |
and e-mail files, and a few others internally.</para>
|
242 |
|
242 |
|
243 |
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
243 |
<para>Other file types (ie: postscript, pdf, ms-word, rtf ...)
|
244 |
need external applications for preprocessing. The list is in the
|
244 |
need external applications for preprocessing. The list is in the
|
245 |
<link linkend="rcl.install.external"> installation</link>
|
245 |
<link linkend="rcl.install.external"> installation</link>
|
246 |
section. After every indexing operation, &RCL; updates a list of
|
246 |
section. After every indexing operation, &RCL; updates a list of
|
|
... |
|
... |
340 |
run, and it can always be destroyed safely.</para>
|
340 |
run, and it can always be destroyed safely.</para>
|
341 |
|
341 |
|
342 |
<sect2 id="rcl.indexing.storage.format">
|
342 |
<sect2 id="rcl.indexing.storage.format">
|
343 |
<title>Xapian index formats</title>
|
343 |
<title>Xapian index formats</title>
|
344 |
|
344 |
|
345 |
<para>If your first installation of &RCL; was 1.9.0 or more
|
345 |
<para>&XAP; versions usually support several formats for index
|
346 |
recent, you can skip this section.</para>
|
346 |
storage. A given major &XAP; version will have a current format,
|
347 |
|
347 |
used to create new indexes, and will also support the format from
|
348 |
<para>&XAP; has had two possible index formats for quite some
|
348 |
the previous major version.</para>
|
349 |
time. The "old" one named <literal>Quartz</literal>, and the
|
|
|
350 |
new one named <literal>Flint</literal>. &XAP; 0.9 used
|
|
|
351 |
<literal>Quartz</literal> by default, but could use
|
|
|
352 |
<literal>Flint</literal> if a specific environment variable
|
|
|
353 |
(<literal>XAPIAN_PREFER_FLINT</literal>) was set. &XAP; 1.0
|
|
|
354 |
still supports <literal>Quartz</literal> but will use
|
|
|
355 |
<literal>Flint</literal> by default for new index
|
|
|
356 |
creations.</para>
|
|
|
357 |
|
|
|
358 |
<para>The number of disk accesses performed during indexing
|
|
|
359 |
has been much optimized in the new <literal>Flint</literal>
|
|
|
360 |
engine and you may see indexing times improved by 50% in some
|
|
|
361 |
cases (compared to <literal>Quartz</literal>), typically for
|
|
|
362 |
big indexes where disk accesses dominate the indexing
|
|
|
363 |
time. There is also a more modest improvement of index
|
|
|
364 |
size.</para>
|
|
|
365 |
|
349 |
|
366 |
<para>&XAP; will not convert automatically an existing index
|
350 |
<para>&XAP; will not convert automatically an existing index
|
367 |
from the <literal>Quartz</literal> to the
|
351 |
from the older format to the newer one. If you want to upgrade to
|
368 |
<literal>Flint</literal> format. If you have an older index
|
352 |
the new format, or if a very old index needs to be converted
|
369 |
and want to take advantage of the new format (which can be
|
353 |
because its format is not supported any more, you will have to
|
370 |
done without setting the environment variable as of &RCL;
|
|
|
371 |
1.8.2 and &XAP; 1.0.0), you will have to explicitly delete
|
|
|
372 |
the old index, then run a normal indexing process.</para>
|
354 |
explicitly delete the old index, then run a normal indexing
|
|
|
355 |
process.</para>
|
373 |
|
356 |
|
374 |
<para>Unfortunately, using the <literal>-z</literal> option to
|
357 |
<para>Unfortunately, using the <literal>-z</literal> option to
|
375 |
<command>recollindex</command> is not sufficient to change the
|
358 |
<command>recollindex</command> is not sufficient to change the
|
376 |
format, you have to delete all files inside the index
|
359 |
format, you will have to delete all files inside the index
|
377 |
directory (typically <filename>~/.recoll/xapiandb</filename>)
|
360 |
directory (typically <filename>~/.recoll/xapiandb</filename>)
|
378 |
before starting indexing.</para>
|
361 |
before starting the indexing.</para>
|
379 |
|
362 |
|
380 |
</sect2>
|
363 |
</sect2>
|
381 |
|
364 |
|
382 |
<sect2 id="rcl.indexing.storage.security">
|
365 |
<sect2 id="rcl.indexing.storage.security">
|
383 |
<title>Security aspects</title>
|
366 |
<title>Security aspects</title>
|
|
... |
|
... |
385 |
<para>The &RCL; index does not hold copies of the indexed
|
368 |
<para>The &RCL; index does not hold copies of the indexed
|
386 |
documents. But it does hold enough data to allow for an almost
|
369 |
documents. But it does hold enough data to allow for an almost
|
387 |
complete reconstruction. If confidential data is indexed,
|
370 |
complete reconstruction. If confidential data is indexed,
|
388 |
access to the database directory should be restricted. </para>
|
371 |
access to the database directory should be restricted. </para>
|
389 |
|
372 |
|
390 |
<para>As of version 1.4, &RCL; will create the configuration
|
373 |
<para>&RCL; (since version 1.4) will create the configuration
|
391 |
directory with a mode of 0700 (access by owner only). As the
|
374 |
directory with a mode of 0700 (access by owner only). As the
|
392 |
index data directory is by default a sub-directory of the
|
375 |
index data directory is by default a sub-directory of the
|
393 |
configuration directory, this should result in appropriate
|
376 |
configuration directory, this should result in appropriate
|
394 |
protection.</para>
|
377 |
protection.</para>
|
395 |
|
378 |
|
|
... |
|
... |
509 |
|
492 |
|
510 |
<sect2 id="rcl.indexing.periodic.exec">
|
493 |
<sect2 id="rcl.indexing.periodic.exec">
|
511 |
<title>Running indexing</title>
|
494 |
<title>Running indexing</title>
|
512 |
|
495 |
|
513 |
<para>Indexing is performed either by the
|
496 |
<para>Indexing is performed either by the
|
514 |
<command>recollindex</command> program, or by the
|
497 |
<command>recollindex</command> program, or by the indexing thread
|
515 |
indexing thread inside the <command>recoll</command>
|
498 |
inside the <command>recoll</command> program (start it from the
|
516 |
program (use the <guimenu>File</guimenu> menu). Both programs
|
499 |
<guimenu>File</guimenu> menu). Both programs will use the
|
517 |
will use the <literal>RECOLL_CONFDIR</literal>
|
500 |
<literal>RECOLL_CONFDIR</literal> variable or accept a
|
518 |
variable or accept a <literal>-c</literal>
|
501 |
<literal>-c</literal> <replaceable>confdir</replaceable> option
|
519 |
<replaceable>confdir</replaceable> option to specify a non-default
|
|
|
520 |
configuration directory.</para>
|
502 |
to specify a non-default configuration directory.</para>
|
521 |
|
503 |
|
522 |
<para>Reasons to use either the indexing thread or the
|
504 |
<para>There are reasons to use either the indexing thread or the
|
523 |
<command>recollindex</command> command:
|
505 |
<command>recollindex</command> command, but it is also a matter of
|
|
|
506 |
personal preferences:
|
524 |
<itemizedlist>
|
507 |
<itemizedlist>
|
525 |
<listitem><para>Starting the indexing thread is more convenient,
|
508 |
<listitem><para>Starting the indexing thread is more convenient,
|
526 |
being just one click away.</para>
|
509 |
being just one click away.</para>
|
527 |
</listitem>
|
510 |
</listitem>
|
528 |
<listitem><para>The <command>recollindex</command> command has
|
511 |
<listitem><para>The <command>recollindex</command> command has
|
|
... |
|
... |
532 |
<listitem><para>The <command>recollindex</command> command will
|
515 |
<listitem><para>The <command>recollindex</command> command will
|
533 |
not take down your GUI if it crashes (a rare occurrence,
|
516 |
not take down your GUI if it crashes (a rare occurrence,
|
534 |
but who knows...)</para>
|
517 |
but who knows...)</para>
|
535 |
</listitem>
|
518 |
</listitem>
|
536 |
<listitem><para>The <command>recollindex</command> command uses
|
519 |
<listitem><para>The <command>recollindex</command> command uses
|
537 |
<command>setpriority/nice</command> to lower its priority while
|
520 |
<command>setpriority/nice</command> to lower its priority
|
538 |
indexing
|
521 |
while indexing. When available (and for &RCL; version
|
539 |
(it will also use <command>ionice</command> when this becomes
|
522 |
1.16.2 and newer), it also uses the
|
|
|
523 |
<command>ionice</command> command to lower its IO
|
540 |
more widely available), the thread can't do it, else it would
|
524 |
priority. The thread can't do it, else it would also slow
|
541 |
also slow down the user/search interface.</para>
|
525 |
down the user/search interface.</para>
|
542 |
</listitem>
|
526 |
</listitem>
|
543 |
</itemizedlist>
|
527 |
</itemizedlist>
|
544 |
I'll let the reader decide where my heart belongs...</para>
|
528 |
</para>
|
545 |
|
529 |
|
546 |
<para>If the <command>recoll</command> program finds no index
|
530 |
<para>If the <command>recoll</command> program finds no index
|
547 |
when it starts, it will automatically start indexing (except
|
531 |
when it starts, it will automatically start indexing (except
|
548 |
if canceled).</para>
|
532 |
if canceled).</para>
|
549 |
|
533 |
|
|
... |
|
... |
629 |
<para>The real time indexing support can be customised during package
|
613 |
<para>The real time indexing support can be customised during package
|
630 |
<link linkend="rcl.install.building.build">configuration</link>
|
614 |
<link linkend="rcl.install.building.build">configuration</link>
|
631 |
with the <literal>--with[out]-fam</literal> or
|
615 |
with the <literal>--with[out]-fam</literal> or
|
632 |
<literal>--with[out]-inotify</literal> options. The default is
|
616 |
<literal>--with[out]-inotify</literal> options. The default is
|
633 |
currently to include inotify monitoring on systems that support
|
617 |
currently to include inotify monitoring on systems that support
|
634 |
it.</para>
|
618 |
it, and, as of recoll 1.17, gamin support on FreeBSD.</para>
|
635 |
|
619 |
|
636 |
<para>The <filename>rclmon.sh</filename> script can be used to
|
620 |
<para>The <filename>rclmon.sh</filename> script can be used to
|
637 |
easily start and stop the daemon. It can be found in the
|
621 |
easily start and stop the daemon. It can be found in the
|
638 |
<filename>examples</filename> directory (typically
|
622 |
<filename>examples</filename> directory (typically
|
639 |
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
|
623 |
<filename>/usr/local/[share/]recoll/examples</filename>).</para>
|
|
... |
|
... |
1309 |
|
1293 |
|
1310 |
<sect2 id="rcl.search.sort">
|
1294 |
<sect2 id="rcl.search.sort">
|
1311 |
<title>Sorting search results and collapsing duplicates</title>
|
1295 |
<title>Sorting search results and collapsing duplicates</title>
|
1312 |
|
1296 |
|
1313 |
<para>The documents in a result list are normally sorted in
|
1297 |
<para>The documents in a result list are normally sorted in
|
1314 |
order of relevance. It is possible to specify different sort
|
1298 |
order of relevance. It is possible to specify a different sort
|
1315 |
parameters by using the <guimenu>Sort parameters</guimenu>
|
1299 |
order, either by using the vertical arrows in the GUI toolbox to
|
1316 |
dialog (located in the <guimenu>Tools</guimenu> menu).</para>
|
1300 |
sort by date, or switching to the result table display and clicking
|
1317 |
|
1301 |
on any header. The sort order chosen inside the result table
|
1318 |
<para>The tool sorts a specified number of the most
|
1302 |
remains active if you switch back to the result list, until you
|
1319 |
relevant documents in the result list, according to specified
|
1303 |
click one of the vertical arrows, until both are unchecked (you are
|
1320 |
criteria. The currently available criteria are
|
1304 |
back to sort by relevance).</para>
|
1321 |
<emphasis>date</emphasis> and <emphasis>mime
|
|
|
1322 |
type</emphasis>.</para>
|
|
|
1323 |
|
|
|
1324 |
<para>The sort parameters stay in effect until they are
|
|
|
1325 |
explicitly reset, or the program exits. An activated sort is
|
|
|
1326 |
indicated in the result list header.</para>
|
|
|
1327 |
|
1305 |
|
1328 |
<para>Sort parameters are remembered between program
|
1306 |
<para>Sort parameters are remembered between program
|
1329 |
invocations, but result sorting is normally always inactive
|
1307 |
invocations, but result sorting is normally always inactive
|
1330 |
when the program starts. It is possible to keep the sorting
|
1308 |
when the program starts. It is possible to keep the sorting
|
1331 |
activation state between program invocations by checking the
|
1309 |
activation state between program invocations by checking the
|
|
... |
|
... |
1425 |
(except <guilabel>This exact phrase</guilabel>).</para>
|
1403 |
(except <guilabel>This exact phrase</guilabel>).</para>
|
1426 |
</formalpara>
|
1404 |
</formalpara>
|
1427 |
|
1405 |
|
1428 |
<formalpara><title>AutoPhrases</title>
|
1406 |
<formalpara><title>AutoPhrases</title>
|
1429 |
<para>This option can be set in the preferences dialog. If it is
|
1407 |
<para>This option can be set in the preferences dialog. If it is
|
1430 |
set, a phrase will be automatically built and added to simple
|
1408 |
set, a phrase will be automatically built and added to simple
|
1431 |
searches when looking for <literal>Any terms</literal>. This
|
1409 |
searches when looking for <literal>Any terms</literal>. This
|
1432 |
will not change radically the results, but will give a relevance
|
1410 |
will not change radically the results, but will give a relevance
|
1433 |
boost to the results where the search terms appear as a
|
1411 |
boost to the results where the search terms appear as a
|
1434 |
phrase. Ie: searching for <literal>virtual reality</literal>
|
1412 |
phrase. Ie: searching for <literal>virtual reality</literal>
|
1435 |
will still find all documents where either
|
1413 |
will still find all documents where either
|
1436 |
<literal>virtual</literal> or <literal>reality</literal> or
|
1414 |
<literal>virtual</literal> or <literal>reality</literal> or
|
1437 |
both appear, but those which contain <literal>virtual
|
1415 |
both appear, but those which contain <literal>virtual
|
1438 |
reality</literal> should appear sooner in the list.</para>
|
1416 |
reality</literal> should appear sooner in the list.</para>
|
|
|
1417 |
|
|
|
1418 |
<para>Phrase searches can strongly slow down a query if most of the
|
|
|
1419 |
terms in the phrase are common. This is why the
|
|
|
1420 |
<literal>autophrase</literal> option is off by default for &RCL;
|
|
|
1421 |
versions before 1.17. As of version 1.17,
|
|
|
1422 |
<literal>autophrase</literal> is on by default, but very common
|
|
|
1423 |
terms will be removed from the constructed phrase. The removal
|
|
|
1424 |
threshold can be adjusted from the search preferences.</para>
|
|
|
1425 |
|
|
|
1426 |
<formalpara><title>Phrases and abbreviations</title> <para>As of
|
|
|
1427 |
&RCL; version 1.17, dotted abbreviations like
|
|
|
1428 |
<literal>I.B.M.</literal> are also automatically indexed as a word
|
|
|
1429 |
without the dots: <literal>IBM</literal>. Searching for the word
|
|
|
1430 |
inside a phrase (ie: <literal>"the IBM company"</literal>) will only
|
|
|
1431 |
match the dotted abrreviation if you increase the phrase slack (using the
|
|
|
1432 |
advanced search panel control, or the <literal>o</literal> query
|
|
|
1433 |
language modifier). Literal occurences of the word will be matched
|
|
|
1434 |
normally.</para>
|
|
|
1435 |
|
1439 |
|
1436 |
|
1440 |
</sect3>
|
1437 |
</sect3>
|
1441 |
|
1438 |
|
1442 |
<sect3 id="rcl.search.tips.misc">
|
1439 |
<sect3 id="rcl.search.tips.misc">
|
1443 |
<title>Others</title>
|
1440 |
<title>Others</title>
|
|
... |
|
... |
3404 |
<para>Example of use for skipping text files only in a
|
3401 |
<para>Example of use for skipping text files only in a
|
3405 |
specific directory:</para>
|
3402 |
specific directory:</para>
|
3406 |
<programlisting>
|
3403 |
<programlisting>
|
3407 |
skippedPaths = ~/somedir/∗.txt
|
3404 |
skippedPaths = ~/somedir/∗.txt
|
3408 |
</programlisting>
|
3405 |
</programlisting>
|
|
|
3406 |
<para>The values in the <literal>*skippedPaths</literal>
|
|
|
3407 |
variables are currently matched with
|
|
|
3408 |
<literal>fnmatch(3)</literal>, with the FNM_PATHNAME and
|
|
|
3409 |
FNM_LEADING_DIR flags. This means that '/' characters must
|
|
|
3410 |
be matched explicitely, which is probably
|
|
|
3411 |
unfortunate.</para>
|
|
|
3412 |
|
3409 |
</listitem>
|
3413 |
</listitem>
|
3410 |
</varlistentry>
|
3414 |
</varlistentry>
|
3411 |
|
3415 |
|
3412 |
<varlistentry id="rcl.install.config.recollconf.followlinks">
|
3416 |
<varlistentry id="rcl.install.config.recollconf.followlinks">
|
3413 |
<term><literal>followLinks</literal></term>
|
3417 |
<term><literal>followLinks</literal></term>
|