|
a/src/README |
|
b/src/README |
|
... |
|
... |
6 |
|
6 |
|
7 |
Jean-Francois Dockes
|
7 |
Jean-Francois Dockes
|
8 |
|
8 |
|
9 |
<jfd@recoll.org>
|
9 |
<jfd@recoll.org>
|
10 |
|
10 |
|
11 |
Copyright (c) 2005-2011 Jean-Francois Dockes
|
11 |
Copyright (c) 2005-2012 Jean-Francois Dockes
|
12 |
|
12 |
|
13 |
This document introduces full text search notions and describes the
|
13 |
This document introduces full text search notions and describes the
|
14 |
installation and use of the Recoll application. It currently describes
|
14 |
installation and use of the Recoll application. It currently describes
|
15 |
Recoll 1.16.
|
15 |
Recoll 1.17.
|
16 |
|
16 |
|
17 |
[ Split HTML / Single HTML ]
|
17 |
[ Split HTML / Single HTML ]
|
18 |
|
18 |
|
19 |
----------------------------------------------------------------------
|
19 |
----------------------------------------------------------------------
|
20 |
|
20 |
|
|
... |
|
... |
108 |
|
108 |
|
109 |
4. Programming interface
|
109 |
4. Programming interface
|
110 |
|
110 |
|
111 |
4.1. Writing a document filter
|
111 |
4.1. Writing a document filter
|
112 |
|
112 |
|
|
|
113 |
4.1.1. Simple filters
|
|
|
114 |
|
|
|
115 |
4.1.2. Telling Recoll about the filter
|
|
|
116 |
|
113 |
4.1.1. Filter HTML output
|
117 |
4.1.3. Filter HTML output
|
114 |
|
118 |
|
115 |
4.2. Field data processing
|
119 |
4.2. Field data processing
|
116 |
|
120 |
|
117 |
4.3. API
|
121 |
4.3. API
|
118 |
|
122 |
|
|
... |
|
... |
244 |
something like /usr/[local/]share/recoll/examples) during installation.
|
248 |
something like /usr/[local/]share/recoll/examples) during installation.
|
245 |
The default parameters from this file may be overridden by values that you
|
249 |
The default parameters from this file may be overridden by values that you
|
246 |
set inside your personal configuration, found by default in the .recoll
|
250 |
set inside your personal configuration, found by default in the .recoll
|
247 |
sub-directory of your home directory. The default configuration will index
|
251 |
sub-directory of your home directory. The default configuration will index
|
248 |
your home directory with default parameters and should be sufficient for
|
252 |
your home directory with default parameters and should be sufficient for
|
249 |
giving Recoll a try, but you may want to adjust it later.
|
253 |
giving Recoll a try, but you may want to adjust it later, which can be
|
|
|
254 |
done either by editing the text files or by using configuration menus in
|
|
|
255 |
the recoll GUI
|
250 |
|
256 |
|
251 |
Indexing is started automatically the first time you execute the recoll
|
257 |
Indexing is started automatically the first time you execute the recoll
|
252 |
search graphical user interface, or by executing the recollindex command.
|
258 |
search graphical user interface, or by executing the recollindex command.
|
253 |
|
259 |
|
254 |
Searches are usually performed inside the recoll graphical user interface
|
260 |
Searches are usually performed inside the recoll graphical user interface
|
|
... |
|
... |
264 |
2.1. Introduction
|
270 |
2.1. Introduction
|
265 |
|
271 |
|
266 |
Indexing is the process by which the set of documents is analyzed and the
|
272 |
Indexing is the process by which the set of documents is analyzed and the
|
267 |
data entered into the database. Recoll indexing is normally incremental:
|
273 |
data entered into the database. Recoll indexing is normally incremental:
|
268 |
documents will only be processed if they have been modified. On the first
|
274 |
documents will only be processed if they have been modified. On the first
|
269 |
execution, of course, all documents will need processing. A full index
|
275 |
execution, all documents will need processing. A full index build can be
|
270 |
build can be forced later by specifying an option to the indexing command
|
276 |
forced later by specifying an option to the indexing command (recollindex
|
271 |
(recollindex -z).
|
277 |
-z).
|
272 |
|
278 |
|
273 |
Recoll indexing can be performed with two different methods:
|
279 |
Recoll indexing can be performed with two different methods:
|
274 |
|
280 |
|
275 |
* Periodic indexing: indexing takes place at discrete times, by
|
281 |
* Periodic indexing: indexing takes place at discrete times, by
|
276 |
executing the recollindex command. The typical usage is to have a
|
282 |
executing the recollindex command. The typical usage is to have a
|
|
... |
|
... |
284 |
The choice between the two methods is mostly a matter of preference, and
|
290 |
The choice between the two methods is mostly a matter of preference, and
|
285 |
they can be combined by setting up multiple indexes (ie: use periodic
|
291 |
they can be combined by setting up multiple indexes (ie: use periodic
|
286 |
indexing on a big documentation directory, and real time indexing on a
|
292 |
indexing on a big documentation directory, and real time indexing on a
|
287 |
small home directory). Monitoring a big file system tree can consume
|
293 |
small home directory). Monitoring a big file system tree can consume
|
288 |
significant system resources.
|
294 |
significant system resources.
|
289 |
|
|
|
290 |
|
|
|
291 |
|
295 |
|
292 |
Recoll knows about quite a few different document types. The parameters
|
296 |
Recoll knows about quite a few different document types. The parameters
|
293 |
for document types recognition and processing are set in configuration
|
297 |
for document types recognition and processing are set in configuration
|
294 |
files.
|
298 |
files.
|
295 |
|
299 |
|
|
... |
|
... |
299 |
compound ones. Such hierarchies can go quite deep, and Recoll has no
|
303 |
compound ones. Such hierarchies can go quite deep, and Recoll has no
|
300 |
problem processing, for example, an ms-word document which would be an
|
304 |
problem processing, for example, an ms-word document which would be an
|
301 |
attachment to an email message part of a folder file archived inside a zip
|
305 |
attachment to an email message part of a folder file archived inside a zip
|
302 |
file...
|
306 |
file...
|
303 |
|
307 |
|
304 |
Recoll indexing processes plain text, HTML, openoffice and e-mail files
|
308 |
Recoll indexing processes plain text, HTML, openoffice and e-mail files,
|
305 |
internally (a few more actually).
|
309 |
and a few others internally.
|
306 |
|
310 |
|
307 |
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
|
311 |
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
|
308 |
applications for preprocessing. The list is in the installation section.
|
312 |
applications for preprocessing. The list is in the installation section.
|
309 |
After every indexing operation, Recoll updates a list of commands that
|
313 |
After every indexing operation, Recoll updates a list of commands that
|
310 |
would be needed for indexing existing files types. This list can be
|
314 |
would be needed for indexing existing files types. This list can be
|
|
... |
|
... |
341 |
different areas of the file system to different indexes. For example,
|
345 |
different areas of the file system to different indexes. For example,
|
342 |
if you were to issue the following commands:
|
346 |
if you were to issue the following commands:
|
343 |
|
347 |
|
344 |
export RECOLL_CONFDIR=~/.indexes-email
|
348 |
export RECOLL_CONFDIR=~/.indexes-email
|
345 |
recoll
|
349 |
recoll
|
346 |
|
350 |
|
347 |
|
351 |
|
348 |
Then Recoll would use configuration files stored in ~/.indexes-email/
|
352 |
Then Recoll would use configuration files stored in ~/.indexes-email/
|
349 |
and, (unless specified otherwise in recoll.conf) would look for the
|
353 |
and, (unless specified otherwise in recoll.conf) would look for the
|
350 |
index in ~/.indexes-email/xapiandb/.
|
354 |
index in ~/.indexes-email/xapiandb/.
|
351 |
|
355 |
|
|
... |
|
... |
378 |
|
382 |
|
379 |
----------------------------------------------------------------------
|
383 |
----------------------------------------------------------------------
|
380 |
|
384 |
|
381 |
2.2.1. Xapian index formats
|
385 |
2.2.1. Xapian index formats
|
382 |
|
386 |
|
383 |
If your first installation of Recoll was 1.9.0 or more recent, you can
|
387 |
Xapian versions usually support several formats for index storage. A given
|
384 |
skip this section.
|
388 |
major Xapian version will have a current format, used to create new
|
|
|
389 |
indexes, and will also support the format from the previous major version.
|
385 |
|
390 |
|
386 |
Xapian has had two possible index formats for quite some time. The "old"
|
|
|
387 |
one named Quartz, and the new one named Flint. Xapian 0.9 used Quartz by
|
|
|
388 |
default, but could use Flint if a specific environment variable
|
|
|
389 |
(XAPIAN_PREFER_FLINT) was set. Xapian 1.0 still supports Quartz but will
|
|
|
390 |
use Flint by default for new index creations.
|
|
|
391 |
|
|
|
392 |
The number of disk accesses performed during indexing has been much
|
|
|
393 |
optimized in the new Flint engine and you may see indexing times improved
|
|
|
394 |
by 50% in some cases (compared to Quartz), typically for big indexes where
|
|
|
395 |
disk accesses dominate the indexing time. There is also a more modest
|
|
|
396 |
improvement of index size.
|
|
|
397 |
|
|
|
398 |
Xapian will not convert automatically an existing index from the Quartz to
|
391 |
Xapian will not convert automatically an existing index from the older
|
399 |
the Flint format. If you have an older index and want to take advantage of
|
392 |
format to the newer one. If you want to upgrade to the new format, or if a
|
400 |
the new format (which can be done without setting the environment variable
|
393 |
very old index needs to be converted because its format is not supported
|
401 |
as of Recoll 1.8.2 and Xapian 1.0.0), you will have to explicitly delete
|
394 |
any more, you will have to explicitly delete the old index, then run a
|
402 |
the old index, then run a normal indexing process.
|
395 |
normal indexing process.
|
403 |
|
396 |
|
404 |
Unfortunately, using the -z option to recollindex is not sufficient to
|
397 |
Unfortunately, using the -z option to recollindex is not sufficient to
|
405 |
change the format, you have to delete all files inside the index directory
|
398 |
change the format, you will have to delete all files inside the index
|
406 |
(typically ~/.recoll/xapiandb) before starting indexing.
|
399 |
directory (typically ~/.recoll/xapiandb) before starting the indexing.
|
407 |
|
400 |
|
408 |
----------------------------------------------------------------------
|
401 |
----------------------------------------------------------------------
|
409 |
|
402 |
|
410 |
2.2.2. Security aspects
|
403 |
2.2.2. Security aspects
|
411 |
|
404 |
|
412 |
The Recoll index does not hold copies of the indexed documents. But it
|
405 |
The Recoll index does not hold copies of the indexed documents. But it
|
413 |
does hold enough data to allow for an almost complete reconstruction. If
|
406 |
does hold enough data to allow for an almost complete reconstruction. If
|
414 |
confidential data is indexed, access to the database directory should be
|
407 |
confidential data is indexed, access to the database directory should be
|
415 |
restricted.
|
408 |
restricted.
|
416 |
|
409 |
|
417 |
As of version 1.4, Recoll will create the configuration directory with a
|
410 |
Recoll (since version 1.4) will create the configuration directory with a
|
418 |
mode of 0700 (access by owner only). As the index data directory is by
|
411 |
mode of 0700 (access by owner only). As the index data directory is by
|
419 |
default a sub-directory of the configuration directory, this should result
|
412 |
default a sub-directory of the configuration directory, this should result
|
420 |
in appropriate protection.
|
413 |
in appropriate protection.
|
421 |
|
414 |
|
422 |
If you use another setup, you should think of the kind of protection you
|
415 |
If you use another setup, you should think of the kind of protection you
|
|
... |
|
... |
505 |
2.5. Periodic indexing
|
498 |
2.5. Periodic indexing
|
506 |
|
499 |
|
507 |
2.5.1. Running indexing
|
500 |
2.5.1. Running indexing
|
508 |
|
501 |
|
509 |
Indexing is performed either by the recollindex program, or by the
|
502 |
Indexing is performed either by the recollindex program, or by the
|
510 |
indexing thread inside the recoll program (use the File menu). Both
|
503 |
indexing thread inside the recoll program (start it from the File menu).
|
511 |
programs will use the RECOLL_CONFDIR variable or accept a -c confdir
|
504 |
Both programs will use the RECOLL_CONFDIR variable or accept a -c confdir
|
512 |
option to specify a non-default configuration directory.
|
505 |
option to specify a non-default configuration directory.
|
513 |
|
506 |
|
514 |
Reasons to use either the indexing thread or the recollindex command:
|
507 |
There are reasons to use either the indexing thread or the recollindex
|
|
|
508 |
command, but it is also a matter of personal preferences:
|
515 |
|
509 |
|
516 |
* Starting the indexing thread is more convenient, being just one click
|
510 |
* Starting the indexing thread is more convenient, being just one click
|
517 |
away.
|
511 |
away.
|
518 |
|
512 |
|
519 |
* The recollindex command has more options, especially the one to reset
|
513 |
* The recollindex command has more options, especially the one to reset
|
|
... |
|
... |
521 |
|
515 |
|
522 |
* The recollindex command will not take down your GUI if it crashes (a
|
516 |
* The recollindex command will not take down your GUI if it crashes (a
|
523 |
rare occurrence, but who knows...)
|
517 |
rare occurrence, but who knows...)
|
524 |
|
518 |
|
525 |
* The recollindex command uses setpriority/nice to lower its priority
|
519 |
* The recollindex command uses setpriority/nice to lower its priority
|
526 |
while indexing (it will also use ionice when this becomes more widely
|
520 |
while indexing. When available (and for Recoll version 1.16.2 and
|
|
|
521 |
newer), it also uses the ionice command to lower its IO priority. The
|
527 |
available), the thread can't do it, else it would also slow down the
|
522 |
thread can't do it, else it would also slow down the user/search
|
528 |
user/search interface.
|
523 |
interface.
|
529 |
|
|
|
530 |
I'll let the reader decide where my heart belongs...
|
|
|
531 |
|
524 |
|
532 |
If the recoll program finds no index when it starts, it will automatically
|
525 |
If the recoll program finds no index when it starts, it will automatically
|
533 |
start indexing (except if canceled).
|
526 |
start indexing (except if canceled).
|
534 |
|
527 |
|
535 |
The recollindex indexing process can be interrupted by sending an
|
528 |
The recollindex indexing process can be interrupted by sending an
|
|
... |
|
... |
594 |
index.
|
587 |
index.
|
595 |
|
588 |
|
596 |
The real time indexing support can be customised during package
|
589 |
The real time indexing support can be customised during package
|
597 |
configuration with the --with[out]-fam or --with[out]-inotify options. The
|
590 |
configuration with the --with[out]-fam or --with[out]-inotify options. The
|
598 |
default is currently to include inotify monitoring on systems that support
|
591 |
default is currently to include inotify monitoring on systems that support
|
599 |
it.
|
592 |
it, and, as of recoll 1.17, gamin support on FreeBSD.
|
600 |
|
593 |
|
601 |
The rclmon.sh script can be used to easily start and stop the daemon. It
|
594 |
The rclmon.sh script can be used to easily start and stop the daemon. It
|
602 |
can be found in the examples directory (typically
|
595 |
can be found in the examples directory (typically
|
603 |
/usr/local/[share/]recoll/examples).
|
596 |
/usr/local/[share/]recoll/examples).
|
604 |
|
597 |
|
|
... |
|
... |
608 |
|
601 |
|
609 |
recollconf=$HOME/.recoll-home
|
602 |
recollconf=$HOME/.recoll-home
|
610 |
recolldata=/usr/local/share/recoll
|
603 |
recolldata=/usr/local/share/recoll
|
611 |
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
604 |
RECOLL_CONFDIR=$recollconf $recolldata/examples/rclmon.sh start
|
612 |
|
605 |
|
613 |
fvwm
|
606 |
fvwm
|
614 |
|
607 |
|
615 |
The indexing daemon gets started, then the window manager, for which the
|
608 |
The indexing daemon gets started, then the window manager, for which the
|
616 |
session waits.
|
609 |
session waits.
|
617 |
|
610 |
|
618 |
By default the indexing daemon will monitor the state of the X11 session,
|
611 |
By default the indexing daemon will monitor the state of the X11 session,
|
|
... |
|
... |
622 |
Under KDE, you can place a small script to start recollindex -m under
|
615 |
Under KDE, you can place a small script to start recollindex -m under
|
623 |
$HOME/.kde/Autostart. This will be executed when the session begins.
|
616 |
$HOME/.kde/Autostart. This will be executed when the session begins.
|
624 |
|
617 |
|
625 |
There is a similar mechanism under Gnome (find the session control tool in
|
618 |
There is a similar mechanism under Gnome (find the session control tool in
|
626 |
the menus and use the "Startup programs" tab).
|
619 |
the menus and use the "Startup programs" tab).
|
|
|
620 |
|
|
|
621 |
If you use the daemon completely out of an X11 session, you need to add
|
|
|
622 |
option -x to disable X11 session monitoring (else the daemon will not
|
|
|
623 |
start).
|
627 |
|
624 |
|
628 |
By default, the messages from the indexing daemon will be discarded. You
|
625 |
By default, the messages from the indexing daemon will be discarded. You
|
629 |
may want to change this by setting the daemlogfilename and daemloglevel
|
626 |
may want to change this by setting the daemlogfilename and daemloglevel
|
630 |
configuration parameters. Also the log file will only be truncated when
|
627 |
configuration parameters. Also the log file will only be truncated when
|
631 |
the daemon starts. If the daemon runs permanently, the log file may grow
|
628 |
the daemon starts. If the daemon runs permanently, the log file may grow
|
|
... |
|
... |
880 |
can be resized, and their order can be changed (by dragging). All the
|
877 |
can be resized, and their order can be changed (by dragging). All the
|
881 |
changes are recorded when you quit recoll
|
878 |
changes are recorded when you quit recoll
|
882 |
|
879 |
|
883 |
Hovering over a table row will update the detail area at the bottom of the
|
880 |
Hovering over a table row will update the detail area at the bottom of the
|
884 |
window with the corresponding values. You can click the row to freeze the
|
881 |
window with the corresponding values. You can click the row to freeze the
|
885 |
display. The bottom area is equivalent to a classical result list
|
882 |
display. The bottom area is equivalent to a result list paragraph, with
|
886 |
paragraph, with links for starting a preview or a native application, and
|
883 |
links for starting a preview or a native application, and an equivalent
|
887 |
an equivalent right-click menu. Typing Esc (the Escape key) will unfreeze
|
884 |
right-click menu. Typing Esc (the Escape key) will unfreeze the display.
|
888 |
the display.
|
|
|
889 |
|
885 |
|
890 |
----------------------------------------------------------------------
|
886 |
----------------------------------------------------------------------
|
891 |
|
887 |
|
892 |
3.1.4. The preview window
|
888 |
3.1.4. The preview window
|
893 |
|
889 |
|
|
... |
|
... |
1115 |
----------------------------------------------------------------------
|
1111 |
----------------------------------------------------------------------
|
1116 |
|
1112 |
|
1117 |
3.1.9. Sorting search results and collapsing duplicates
|
1113 |
3.1.9. Sorting search results and collapsing duplicates
|
1118 |
|
1114 |
|
1119 |
The documents in a result list are normally sorted in order of relevance.
|
1115 |
The documents in a result list are normally sorted in order of relevance.
|
1120 |
It is possible to specify different sort parameters by using the Sort
|
1116 |
It is possible to specify a different sort order, either by using the
|
1121 |
parameters dialog (located in the Tools menu).
|
1117 |
vertical arrows in the GUI toolbox to sort by date, or switching to the
|
1122 |
|
1118 |
result table display and clicking on any header. The sort order chosen
|
1123 |
The tool sorts a specified number of the most relevant documents in the
|
1119 |
inside the result table remains active if you switch back to the result
|
1124 |
result list, according to specified criteria. The currently available
|
1120 |
list, until you click one of the vertical arrows, until both are unchecked
|
1125 |
criteria are date and mime type.
|
1121 |
(you are back to sort by relevance).
|
1126 |
|
|
|
1127 |
The sort parameters stay in effect until they are explicitly reset, or the
|
|
|
1128 |
program exits. An activated sort is indicated in the result list header.
|
|
|
1129 |
|
1122 |
|
1130 |
Sort parameters are remembered between program invocations, but result
|
1123 |
Sort parameters are remembered between program invocations, but result
|
1131 |
sorting is normally always inactive when the program starts. It is
|
1124 |
sorting is normally always inactive when the program starts. It is
|
1132 |
possible to keep the sorting activation state between program invocations
|
1125 |
possible to keep the sorting activation state between program invocations
|
1133 |
by checking the Remember sort activation state option in the preferences.
|
1126 |
by checking the Remember sort activation state option in the preferences.
|
|
... |
|
... |
1197 |
but will give a relevance boost to the results where the search terms
|
1190 |
but will give a relevance boost to the results where the search terms
|
1198 |
appear as a phrase. Ie: searching for virtual reality will still find all
|
1191 |
appear as a phrase. Ie: searching for virtual reality will still find all
|
1199 |
documents where either virtual or reality or both appear, but those which
|
1192 |
documents where either virtual or reality or both appear, but those which
|
1200 |
contain virtual reality should appear sooner in the list.
|
1193 |
contain virtual reality should appear sooner in the list.
|
1201 |
|
1194 |
|
|
|
1195 |
Phrase searches can strongly slow down a query if most of the terms in the
|
|
|
1196 |
phrase are common. This is why the autophrase option is off by default for
|
|
|
1197 |
Recoll versions before 1.17. As of version 1.17, autophrase is on by
|
|
|
1198 |
default, but very common terms will be removed from the constructed
|
|
|
1199 |
phrase. The removal threshold can be adjusted from the search preferences.
|
|
|
1200 |
|
|
|
1201 |
Phrases and abbreviations. As of Recoll version 1.17, dotted abbreviations
|
|
|
1202 |
like I.B.M. are also automatically indexed as a word without the dots:
|
|
|
1203 |
IBM. Searching for the word inside a phrase (ie: "the IBM company") will
|
|
|
1204 |
only match the dotted abrreviation if you increase the phrase slack (using
|
|
|
1205 |
the advanced search panel control, or the o query language modifier).
|
|
|
1206 |
Literal occurences of the word will be matched normally.
|
|
|
1207 |
|
1202 |
----------------------------------------------------------------------
|
1208 |
----------------------------------------------------------------------
|
1203 |
|
1209 |
|
1204 |
3.1.10.3. Others
|
1210 |
3.1.10.3. Others
|
1205 |
|
1211 |
|
1206 |
Using fields. You can use the query language and field specifications to
|
1212 |
Using fields. You can use the query language and field specifications to
|
|
... |
|
... |
1245 |
the parameters used for searching and returning results, and what indexes
|
1251 |
the parameters used for searching and returning results, and what indexes
|
1246 |
are searched.
|
1252 |
are searched.
|
1247 |
|
1253 |
|
1248 |
User interface parameters:
|
1254 |
User interface parameters:
|
1249 |
|
1255 |
|
1250 |
* Number of results in a result page:
|
|
|
1251 |
|
|
|
1252 |
* Hide duplicate results: decides if result list entries are shown for
|
|
|
1253 |
identical documents found in different places.
|
|
|
1254 |
|
|
|
1255 |
* Highlight color for query terms: Terms from the user query are
|
1256 |
* Highlight color for query terms: Terms from the user query are
|
1256 |
highlighted in the result list samples and the preview window. The
|
1257 |
highlighted in the result list samples and the preview window. The
|
1257 |
color can be chosen here. Any Qt color string should work (ie red,
|
1258 |
color can be chosen here. Any Qt color string should work (ie red,
|
1258 |
#ff0000). The default is blue.
|
1259 |
#ff0000). The default is blue.
|
1259 |
|
1260 |
|
1260 |
* Result list font: There is quite a lot of information shown in the
|
1261 |
* Style sheet: The name of a Qt style sheet text file which is applied
|
1261 |
result list, and you may want to customize the font and/or font size.
|
1262 |
to the whole Recoll application on startup. The default value is
|
1262 |
The rest of the fonts used by Recoll are determined by your generic Qt
|
1263 |
empty, but there is a skeleton style sheet (recoll.qss) inside the
|
1263 |
config (try the qtconfig command).
|
1264 |
/usr/share/recoll/examples directory. Using a style sheet, you can
|
1264 |
|
1265 |
change most Recoll graphical parameters: colors, fonts, etc. See the
|
1265 |
* Result paragraph format string: allows you to change the presentation
|
1266 |
sample file for a few simple examples.
|
1266 |
of each result list entry. This is described in its own section.
|
|
|
1267 |
|
|
|
1268 |
* Abstract snippet separator: for synthetic abstracts built from index
|
|
|
1269 |
data, which are usually made of several snippets from different parts
|
|
|
1270 |
of the document, this defines the snippet separator, an ellipsis by
|
|
|
1271 |
default.
|
|
|
1272 |
|
1267 |
|
1273 |
* Maximum text size highlighted for preview Inserting highlights on
|
1268 |
* Maximum text size highlighted for preview Inserting highlights on
|
1274 |
search term inside the text before inserting it in the preview window
|
1269 |
search term inside the text before inserting it in the preview window
|
1275 |
involves quite a lot of processing, and can be disabled over the given
|
1270 |
involves quite a lot of processing, and can be disabled over the given
|
1276 |
text size to speed up loading.
|
1271 |
text size to speed up loading.
|
|
|
1272 |
|
|
|
1273 |
* Prefer HTML to plain text for preview if set, Recoll will display HTML
|
|
|
1274 |
as such inside the preview window. If this causes problems with the Qt
|
|
|
1275 |
HTML display, you can uncheck it to display the plain text version
|
|
|
1276 |
instead.
|
|
|
1277 |
|
|
|
1278 |
* Use <PRE> tags instead of <BR> to display plain text as HTML in
|
|
|
1279 |
preview: when displaying plain text inside the preview window, Recoll
|
|
|
1280 |
tries to preserve some of the original text line breaks and
|
|
|
1281 |
indentation. It can either use PRE HTML tags, which will well preserve
|
|
|
1282 |
the indentation but will force horizontal scrolling for long lines, or
|
|
|
1283 |
use BR tags to break at the original line breaks, which will let the
|
|
|
1284 |
editor introduce other line breaks according to the window width, but
|
|
|
1285 |
will lose some of the original indentation.
|
1277 |
|
1286 |
|
1278 |
* Use desktop preferences to choose document editor: if this is checked,
|
1287 |
* Use desktop preferences to choose document editor: if this is checked,
|
1279 |
the xdg-open utility will be used to open files when you click the
|
1288 |
the xdg-open utility will be used to open files when you click the
|
1280 |
Open link in the result list, instead of the application defined in
|
1289 |
Open link in the result list, instead of the application defined in
|
1281 |
mimeview. xdg-open will in term use your desktop preferences to choose
|
1290 |
mimeview. xdg-open will in term use your desktop preferences to choose
|
|
... |
|
... |
1299 |
|
1308 |
|
1300 |
* Remember sort activation state if set, Recoll will remember the sort
|
1309 |
* Remember sort activation state if set, Recoll will remember the sort
|
1301 |
tool stat between invocations. It normally starts with sorting
|
1310 |
tool stat between invocations. It normally starts with sorting
|
1302 |
disabled.
|
1311 |
disabled.
|
1303 |
|
1312 |
|
1304 |
* Prefer HTML to plain text for preview if set, Recoll will display HTML
|
1313 |
Result list parameters:
|
1305 |
as such inside the preview window. If this causes problems with the Qt
|
1314 |
|
1306 |
HTML display, you can uncheck it to display the plain text version
|
1315 |
* Number of results in a result page
|
1307 |
instead.
|
1316 |
|
|
|
1317 |
* Result list font: There is quite a lot of information shown in the
|
|
|
1318 |
result list, and you may want to customize the font and/or font size.
|
|
|
1319 |
The rest of the fonts used by Recoll are determined by your generic Qt
|
|
|
1320 |
config (try the qtconfig command).
|
|
|
1321 |
|
|
|
1322 |
* Edit result list paragraph format string: allows you to change the
|
|
|
1323 |
presentation of each result list entry. See the result list
|
|
|
1324 |
customisation section.
|
|
|
1325 |
|
|
|
1326 |
* Edit result page html header insert: allows you to define text
|
|
|
1327 |
inserted at the end of the result page html header. More detail in the
|
|
|
1328 |
result list customisation section.
|
|
|
1329 |
|
|
|
1330 |
* Date format: allows specifying the format used for displaying dates
|
|
|
1331 |
inside the result list. This should be specified as an strftime()
|
|
|
1332 |
string (man strftime).
|
|
|
1333 |
|
|
|
1334 |
* Abstract snippet separator: for synthetic abstracts built from index
|
|
|
1335 |
data, which are usually made of several snippets from different parts
|
|
|
1336 |
of the document, this defines the snippet separator, an ellipsis by
|
|
|
1337 |
default.
|
1308 |
|
1338 |
|
1309 |
Search parameters:
|
1339 |
Search parameters:
|
|
|
1340 |
|
|
|
1341 |
* Hide duplicate results: decides if result list entries are shown for
|
|
|
1342 |
identical documents found in different places.
|
1310 |
|
1343 |
|
1311 |
* Stemming language: stemming obviously depends on the document's
|
1344 |
* Stemming language: stemming obviously depends on the document's
|
1312 |
language. This listbox will let you chose among the stemming databases
|
1345 |
language. This listbox will let you chose among the stemming databases
|
1313 |
which were built during indexing (this is set in the main
|
1346 |
which were built during indexing (this is set in the main
|
1314 |
configuration file), or later added with recollindex -s (See the
|
1347 |
configuration file), or later added with recollindex -s (See the
|
1315 |
recollindex manual). Stemming languages which are dynamically added
|
1348 |
recollindex manual). Stemming languages which are dynamically added
|
1316 |
will be deleted at the next indexing pass unless they are also added
|
1349 |
will be deleted at the next indexing pass unless they are also added
|
1317 |
in the configuration file.
|
1350 |
in the configuration file.
|
1318 |
|
1351 |
|
1319 |
* Dynamically add phrase to simple searches: a phrase will be
|
1352 |
* Automatically add phrase to simple searches: a phrase will be
|
1320 |
automatically built and added to simple searches when looking for Any
|
1353 |
automatically built and added to simple searches when looking for Any
|
1321 |
terms. This will give a relevance boost to the results where the
|
1354 |
terms. This will give a relevance boost to the results where the
|
1322 |
search terms appear as a phrase (consecutive and in order).
|
1355 |
search terms appear as a phrase (consecutive and in order).
|
|
|
1356 |
|
|
|
1357 |
* Autophrase term frequency threshold percentage: very frequent terms
|
|
|
1358 |
should not be included in automatic phrase searches for performance
|
|
|
1359 |
reasons. The parameter defines the cutoff percentage (percentage of
|
|
|
1360 |
the documents where the term appears).
|
1323 |
|
1361 |
|
1324 |
* Replace abstracts from documents: this decides if we should synthesize
|
1362 |
* Replace abstracts from documents: this decides if we should synthesize
|
1325 |
and display an abstract in place of an explicit abstract found within
|
1363 |
and display an abstract in place of an explicit abstract found within
|
1326 |
the document itself.
|
1364 |
the document itself.
|
1327 |
|
1365 |
|
|
... |
|
... |
1356 |
alternative indexer may also need to implement a way of purging the index
|
1394 |
alternative indexer may also need to implement a way of purging the index
|
1357 |
from stale data,
|
1395 |
from stale data,
|
1358 |
|
1396 |
|
1359 |
----------------------------------------------------------------------
|
1397 |
----------------------------------------------------------------------
|
1360 |
|
1398 |
|
1361 |
3.1.11.1. The result list paragraph format
|
1399 |
3.1.11.1. The result list format
|
1362 |
|
1400 |
|
1363 |
The presentation of each result inside the result list can be customized
|
1401 |
The result list presentation can be exhaustively customized by adjusting
|
1364 |
by setting the result list paragraph format inside the User Interface tab
|
1402 |
two elements:
|
1365 |
of the Query configuration.
|
|
|
1366 |
|
1403 |
|
|
|
1404 |
* The paragraph format
|
|
|
1405 |
|
|
|
1406 |
* Html code inside the header section
|
|
|
1407 |
|
|
|
1408 |
These can be edited from the Result list tab of the Query configuration.
|
|
|
1409 |
|
|
|
1410 |
Newer versions of Recoll (from 1.17) use a WebKit HTML object by default
|
|
|
1411 |
(this may be disabled at build time), and total customisation is possible
|
|
|
1412 |
with full support for CSS and Javascript. Conversely, there are limits to
|
|
|
1413 |
what you can do with the older Qt QTextBrowser, but still, it is possible
|
|
|
1414 |
to decide what data each result will contain, and how it will be
|
|
|
1415 |
displayed.
|
|
|
1416 |
|
|
|
1417 |
No more detail will be given about the header part (only useful with the
|
|
|
1418 |
WebKit build), if there are restrictions to what you can do, they are
|
|
|
1419 |
beyond this author's HTML/CSS/Javascript abilities...
|
|
|
1420 |
|
|
|
1421 |
----------------------------------------------------------------------
|
|
|
1422 |
|
|
|
1423 |
3.1.11.1.1. The paragraph format
|
|
|
1424 |
|
1367 |
This is a Qt HTML string where the following printf-like % substitutions
|
1425 |
This is an arbitrary HTML string where the following printf-like %
|
1368 |
will be performed:
|
1426 |
substitutions will be performed:
|
1369 |
|
1427 |
|
1370 |
* %A. Abstract
|
1428 |
* %A. Abstract
|
1371 |
|
1429 |
|
1372 |
* %D. Date
|
1430 |
* %D. Date
|
1373 |
|
1431 |
|
1374 |
* %I. Icon image name
|
1432 |
* %I. Icon image name. This is normally determined from the mime type.
|
|
|
1433 |
The associations are defined inside the mimeconf configuration file.
|
|
|
1434 |
If a thumbnail for the file is found at the standard Freedesktop
|
|
|
1435 |
location, this will be displayed instead.
|
1375 |
|
1436 |
|
1376 |
* %K. Keywords (if any)
|
1437 |
* %K. Keywords (if any)
|
1377 |
|
1438 |
|
1378 |
* %L. Preview and Edit links
|
1439 |
* %L. Precooked Preview and Edit links
|
1379 |
|
1440 |
|
1380 |
* %M. Mime type
|
1441 |
* %M. Mime type
|
1381 |
|
1442 |
|
1382 |
* %N. result Number
|
1443 |
* %N. result Number inside the result page
|
1383 |
|
1444 |
|
1384 |
* %R. Relevance percentage
|
1445 |
* %R. Relevance percentage
|
1385 |
|
1446 |
|
1386 |
* %S. Size information
|
1447 |
* %S. Size information
|
1387 |
|
1448 |
|
1388 |
* %T. Title
|
1449 |
* %T. Title
|
1389 |
|
1450 |
|
1390 |
* %U. Url
|
1451 |
* %U. Url
|
1391 |
|
1452 |
|
1392 |
The format of the Preview and Edit links is <a href="P%N"> and <a
|
1453 |
The format of the Preview and Edit links is <a href="P%N"> and <a
|
1393 |
href="E%N"> where docnum (%N expands to the document number inside the
|
1454 |
href="E%N"> where docnum (%N) expands to the document number inside the
|
1394 |
result list).
|
1455 |
result page).
|
1395 |
|
1456 |
|
1396 |
In addition to the predefined values above, all strings like %(fieldname)
|
1457 |
In addition to the predefined values above, all strings like %(fieldname)
|
1397 |
will be replaced by the value of the field named fieldname for this
|
1458 |
will be replaced by the value of the field named fieldname for this
|
1398 |
document. Only stored fields can be accessed in this way, the value of
|
1459 |
document. Only stored fields can be accessed in this way, the value of
|
1399 |
indexed but not stored fields is not known at this point in the search
|
1460 |
indexed but not stored fields is not known at this point in the search
|
|
... |
|
... |
1408 |
The default value for the paragraph format string is:
|
1469 |
The default value for the paragraph format string is:
|
1409 |
|
1470 |
|
1410 |
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
1471 |
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
1411 |
%M %D <i>%U</i> %i<br>
|
1472 |
%M %D <i>%U</i> %i<br>
|
1412 |
%A %K
|
1473 |
%A %K
|
1413 |
|
1474 |
|
1414 |
|
1475 |
|
1415 |
You may, for example, try the following for a more web-like experience:
|
1476 |
You may, for example, try the following for a more web-like experience:
|
1416 |
|
1477 |
|
1417 |
<u><b><a href="P%N">%T</a></b></u><br>
|
1478 |
<u><b><a href="P%N">%T</a></b></u><br>
|
1418 |
%A<font color=#008000>%U - %S</font> - %L
|
1479 |
%A<font color=#008000>%U - %S</font> - %L
|
1419 |
|
1480 |
|
1420 |
|
1481 |
|
1421 |
Or the clean looking:
|
1482 |
Or the clean looking:
|
1422 |
|
1483 |
|
1423 |
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
1484 |
<img src="%I" align="left">%L <font color="#900000">%R</font>
|
1424 |
<b>%T</b><br>%S
|
1485 |
<b>%T</b><br>%S
|
1425 |
<font color="#808080"><i>%U</i></font>
|
1486 |
<font color="#808080"><i>%U</i></font>
|
1426 |
<table bgcolor="#e0e0e0">
|
1487 |
<table bgcolor="#e0e0e0">
|
1427 |
<tr><td><div>%A</div></td></tr>
|
1488 |
<tr><td><div>%A</div></td></tr>
|
1428 |
</table>%K
|
1489 |
</table>%K
|
1429 |
|
1490 |
|
1430 |
|
1491 |
|
1431 |
Note that the P%N link in the above paragraph makes the title a preview
|
1492 |
Note that the P%N link in the above paragraph makes the title a preview
|
1432 |
link.
|
1493 |
link.
|
|
|
1494 |
|
|
|
1495 |
These samples, and some others are on the web site, with pictures to show
|
|
|
1496 |
how they look.
|
1433 |
|
1497 |
|
1434 |
It is also possible to define the value of the snippet separator inside
|
1498 |
It is also possible to define the value of the snippet separator inside
|
1435 |
the abstract section.
|
1499 |
the abstract section.
|
1436 |
|
1500 |
|
1437 |
----------------------------------------------------------------------
|
1501 |
----------------------------------------------------------------------
|
|
... |
|
... |
1482 |
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
1546 |
window.location.href = 'recoll://search/query?qtp=a&p=0&q=' +
|
1483 |
encodeURIComponent(t);
|
1547 |
encodeURIComponent(t);
|
1484 |
}
|
1548 |
}
|
1485 |
</script>
|
1549 |
</script>
|
1486 |
....
|
1550 |
....
|
1487 |
<body ondblclick="recollsearch()">
|
1551 |
<body ondblclick="recollsearch()">
|
1488 |
|
1552 |
|
1489 |
----------------------------------------------------------------------
|
1553 |
----------------------------------------------------------------------
|
1490 |
|
1554 |
|
1491 |
3.3. Searching on the command line
|
1555 |
3.3. Searching on the command line
|
1492 |
|
1556 |
|
|
... |
|
... |
1544 |
The query language processor is activated in the GUI simple search entry
|
1608 |
The query language processor is activated in the GUI simple search entry
|
1545 |
when the search mode selector is set to Query Language. It can also be
|
1609 |
when the search mode selector is set to Query Language. It can also be
|
1546 |
used with the KIO slave or the command line search. It broadly has the
|
1610 |
used with the KIO slave or the command line search. It broadly has the
|
1547 |
same capabilities as the complex search interface in the GUI.
|
1611 |
same capabilities as the complex search interface in the GUI.
|
1548 |
|
1612 |
|
1549 |
The language is roughly based on the Xesam user search language
|
1613 |
The language is roughly based on the (seemingly defunct) Xesam user search
|
1550 |
specification.
|
1614 |
language specification.
|
1551 |
|
1615 |
|
1552 |
If the results of a query language search puzzle you and you doubt what
|
1616 |
If the results of a query language search puzzle you and you doubt what
|
1553 |
has been actually searched for, you can use the GUI show query link at the
|
1617 |
has been actually searched for, you can use the GUI show query link at the
|
1554 |
top of the result list to check the exact query which was finally executed
|
1618 |
top of the result list to check the exact query which was finally executed
|
1555 |
by Xapian.
|
1619 |
by Xapian.
|
1556 |
|
1620 |
|
1557 |
Here follows a sample request that we are going to explain:
|
1621 |
Here follows a sample request that we are going to explain:
|
1558 |
|
1622 |
|
1559 |
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
1623 |
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
1560 |
|
1624 |
|
1561 |
|
1625 |
|
1562 |
This would search for all documents with John Doe appearing as a phrase in
|
1626 |
This would search for all documents with John Doe appearing as a phrase in
|
1563 |
the author field (exactly what this is would depend on the document type,
|
1627 |
the author field (exactly what this is would depend on the document type,
|
1564 |
ie: the From: header, for an email message), and containing either beatles
|
1628 |
ie: the From: header, for an email message), and containing either beatles
|
1565 |
or lennon and either live or unplugged but not potatoes (in any part of
|
1629 |
or lennon and either live or unplugged but not potatoes (in any part of
|
|
... |
|
... |
1583 |
|
1647 |
|
1584 |
As usual, words inside quotes define a phrase (the order of words is
|
1648 |
As usual, words inside quotes define a phrase (the order of words is
|
1585 |
significant), so that title:"prejudice pride" is not the same as
|
1649 |
significant), so that title:"prejudice pride" is not the same as
|
1586 |
title:prejudice title:pride, and is unlikely to find a result.
|
1650 |
title:prejudice title:pride, and is unlikely to find a result.
|
1587 |
|
1651 |
|
1588 |
Most Xesam phrase modifiers are unsupported, except for l (small ell) to
|
1652 |
Modifiers can be set on a phrase clause, for exemple to specify a
|
1589 |
disable stemming, and p to turn a phrase into a NEAR (unordered proximity)
|
1653 |
proximity search (unordered). See the modifier section.
|
1590 |
search. Exemple: "prejudice pride"p
|
|
|
1591 |
|
1654 |
|
1592 |
Recoll currently manages the following default fields:
|
1655 |
Recoll currently manages the following default fields:
|
1593 |
|
1656 |
|
1594 |
* title, subject or caption are synonyms which specify data to be
|
1657 |
* title, subject or caption are synonyms which specify data to be
|
1595 |
searched for in the document title or subject.
|
1658 |
searched for in the document title or subject.
|
|
... |
|
... |
1607 |
|
1670 |
|
1608 |
The field syntax also supports a few field-like, but special, criteria:
|
1671 |
The field syntax also supports a few field-like, but special, criteria:
|
1609 |
|
1672 |
|
1610 |
* dir for filtering the results on file location (Ex:
|
1673 |
* dir for filtering the results on file location (Ex:
|
1611 |
dir:/home/me/somedir). -dir also works to find results out of the
|
1674 |
dir:/home/me/somedir). -dir also works to find results out of the
|
1612 |
specified directory, only after release 1.15.8.
|
1675 |
specified directory, only after release 1.15.8. A tilde inside the
|
|
|
1676 |
value will be expanded to the home directory. dir is not a regular
|
|
|
1677 |
field and only one value makes sense in a query (you can't use
|
|
|
1678 |
dir:dir1 OR dir:dir2). Relative paths make sense, for example,
|
|
|
1679 |
dir:share/doc would match either /usr/share/doc or
|
|
|
1680 |
/usr/local/share/doc
|
|
|
1681 |
|
|
|
1682 |
* size for filtering the results on file size. Exemple: size<10000. You
|
|
|
1683 |
can use <, > or = as operators. You can specify a range like the
|
|
|
1684 |
following: size>100 size<1000. The usual k/K, m/M, g/G, t/T can be
|
|
|
1685 |
used as (decimal) multipliers. Ex: size>1k to search for files bigger
|
|
|
1686 |
than 1000 bytes.
|
1613 |
|
1687 |
|
1614 |
* date for searching or filtering on dates. The syntax for the argument
|
1688 |
* date for searching or filtering on dates. The syntax for the argument
|
1615 |
is based on the ISO8601 standard for dates and time intervals. Only
|
1689 |
is based on the ISO8601 standard for dates and time intervals. Only
|
1616 |
dates are supported, no times. The general syntax is 2 elements
|
1690 |
dates are supported, no times. The general syntax is 2 elements
|
1617 |
separated by a / character. Each element can be a date or a period of
|
1691 |
separated by a / character. Each element can be a date or a period of
|
|
... |
|
... |
1826 |
documents per file (ie: for zip or chm files). They communicate with
|
1900 |
documents per file (ie: for zip or chm files). They communicate with
|
1827 |
the indexer through a simple protocol, but are nevertheless a bit more
|
1901 |
the indexer through a simple protocol, but are nevertheless a bit more
|
1828 |
complicated than the older kind. Most of these new filters are written
|
1902 |
complicated than the older kind. Most of these new filters are written
|
1829 |
in Python, using a common module to handle the protocol.
|
1903 |
in Python, using a common module to handle the protocol.
|
1830 |
|
1904 |
|
1831 |
The following will just describe the simple filters, if you are programmer
|
1905 |
The following will just describe the simple filters. If you can program
|
1832 |
enough to write one of the other kind, it shouldn't be too difficult to
|
1906 |
and want to write one of the other kind, it shouldn't be too difficult to
|
1833 |
make sense of one of the existing modules (ie: rclzip).
|
1907 |
make sense of one of the existing modules. For example, look at rclzip
|
|
|
1908 |
which uses Zip file paths as internal identifiers (ipath), and rclinfo,
|
|
|
1909 |
which uses an integer index.
|
|
|
1910 |
|
|
|
1911 |
----------------------------------------------------------------------
|
|
|
1912 |
|
|
|
1913 |
4.1.1. Simple filters
|
1834 |
|
1914 |
|
1835 |
Recoll simple filters are usually shell-scripts, but this is in no way
|
1915 |
Recoll simple filters are usually shell-scripts, but this is in no way
|
1836 |
necessary. These programs are extremely simple and most of the difficulty
|
1916 |
necessary. Extracting the text from the native format is the difficult
|
1837 |
lies in extracting the text from the native format, not outputting what is
|
1917 |
part. Outputting the format expected by Recoll is trivial. Happily enough,
|
1838 |
expected by Recoll. Happily enough, most document formats already have
|
1918 |
most document formats have translators or text extractors which can be
|
1839 |
translators or text extractors which handle the difficult part and can be
|
|
|
1840 |
called from the filter. In some case the output of the translating program
|
1919 |
called from the filter. In some cases the output of the translating
|
1841 |
is appropriate, and no intermediate shell-script is needed.
|
1920 |
program is completely appropriate, and no intermediate shell-script is
|
|
|
1921 |
needed.
|
1842 |
|
1922 |
|
1843 |
Filters are called with a single argument which is the source file name.
|
1923 |
Filters are called with a single argument which is the source file name.
|
1844 |
They should output the result to stdout.
|
1924 |
They should output the result to stdout.
|
1845 |
|
1925 |
|
|
|
1926 |
When writing a filter, you should decide if it will output plain text or
|
|
|
1927 |
html. Plain text is simpler, but you will not be able to add metadata or
|
|
|
1928 |
vary the output character encoding (this will be defined in a
|
|
|
1929 |
configuration file). Additionally, some formatting may easier to preserve
|
|
|
1930 |
when previewing html. Actually the deciding factor is metadata: Recoll has
|
|
|
1931 |
a way to extract metadata from the html header and use it for field
|
|
|
1932 |
searches..
|
|
|
1933 |
|
1846 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
1934 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
1847 |
the filter if the operation is for indexing or previewing. Some filters
|
1935 |
the filter if the operation is for indexing or previewing. Some filters
|
1848 |
use this to output a slightly different format. This is not essential.
|
1936 |
use this to output a slightly different format, for example stripping
|
|
|
1937 |
uninteresting repeated keywords (ie: Subject: for email) when indexing.
|
|
|
1938 |
This is not essential.
|
|
|
1939 |
|
|
|
1940 |
You should look to one of the simple filters, for exemple rclps for a
|
|
|
1941 |
starting point.
|
|
|
1942 |
|
|
|
1943 |
Don't forget to make your filter executable before testing !
|
|
|
1944 |
|
|
|
1945 |
----------------------------------------------------------------------
|
|
|
1946 |
|
|
|
1947 |
4.1.2. Telling Recoll about the filter
|
|
|
1948 |
|
|
|
1949 |
There are two elements that link a file to the filter which should process
|
|
|
1950 |
it: the association of file to mime type and the association of a mime
|
|
|
1951 |
type with a filter.
|
|
|
1952 |
|
|
|
1953 |
The association of files to mime types is mostly based on name suffixes.
|
|
|
1954 |
The types are defined inside the mimemap file. Example:
|
|
|
1955 |
|
|
|
1956 |
|
|
|
1957 |
.doc = application/msword
|
|
|
1958 |
|
|
|
1959 |
If no suffix association is found for the file name, Recoll will try to
|
|
|
1960 |
execute the file -i command to determine a mime type.
|
1849 |
|
1961 |
|
1850 |
The association of file types to filters is performed in the mimeconf
|
1962 |
The association of file types to filters is performed in the mimeconf
|
1851 |
file. A sample:
|
1963 |
file. A sample will probably be of better help than a long explanation:
|
1852 |
|
1964 |
|
|
|
1965 |
|
1853 |
[index]
|
1966 |
[index]
|
1854 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
1967 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
1855 |
mimetype = text/plain ; charset=utf-8
|
1968 |
mimetype = text/plain ; charset=utf-8
|
1856 |
|
1969 |
|
1857 |
application/ogg = exec rclogg
|
1970 |
application/ogg = exec rclogg
|
1858 |
|
1971 |
|
|
... |
|
... |
1874 |
and not output by unrtf in the HTML header section.
|
1987 |
and not output by unrtf in the HTML header section.
|
1875 |
|
1988 |
|
1876 |
* application/x-chm is processed by a persistant filter. This is
|
1989 |
* application/x-chm is processed by a persistant filter. This is
|
1877 |
determined by the execm keyword.
|
1990 |
determined by the execm keyword.
|
1878 |
|
1991 |
|
1879 |
The easiest way to write a new filter is probably to start from an
|
|
|
1880 |
existing one.
|
|
|
1881 |
|
|
|
1882 |
Filters which output text/plain text are generally simpler, but they
|
|
|
1883 |
cannot specify the character set and other metadata, so they are limited
|
|
|
1884 |
to cases where these elements are not needed.
|
|
|
1885 |
|
|
|
1886 |
----------------------------------------------------------------------
|
1992 |
----------------------------------------------------------------------
|
1887 |
|
1993 |
|
1888 |
4.1.1. Filter HTML output
|
1994 |
4.1.3. Filter HTML output
|
1889 |
|
1995 |
|
1890 |
The output HTML could be very minimal like the following example:
|
1996 |
The output HTML could be very minimal like the following example:
|
1891 |
|
1997 |
|
1892 |
<html><head>
|
1998 |
<html><head>
|
1893 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
1999 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
1894 |
</head>
|
2000 |
</head>
|
1895 |
<body>some text content</body></html>
|
2001 |
<body>some text content</body></html>
|
1896 |
|
2002 |
|
1897 |
|
2003 |
|
1898 |
You should take care to escape some characters inside the text by
|
2004 |
You should take care to escape some characters inside the text by
|
1899 |
transforming them into appropriate entities. "&" should be transformed
|
2005 |
transforming them into appropriate entities. "&" should be transformed
|
1900 |
into "&", "<" should be transformed into "<". This is not always
|
2006 |
into "&", "<" should be transformed into "<". This is not always
|
1901 |
properly done by translating programs which output HTML, and of course
|
2007 |
properly done by translating programs which output HTML, and of course
|
|
... |
|
... |
2208 |
confdir specifies a Recoll configuration directory
|
2314 |
confdir specifies a Recoll configuration directory
|
2209 |
(the default is built like for any Recoll program).
|
2315 |
(the default is built like for any Recoll program).
|
2210 |
extra_dbs is a list of external databases (xapian directories)
|
2316 |
extra_dbs is a list of external databases (xapian directories)
|
2211 |
writable decides if we can index new data through this connection
|
2317 |
writable decides if we can index new data through this connection
|
2212 |
|
2318 |
|
2213 |
|
|
|
2214 |
|
|
|
2215 |
----------------------------------------------------------------------
|
2319 |
----------------------------------------------------------------------
|
2216 |
|
2320 |
|
2217 |
4.3.2.3. Example code
|
2321 |
4.3.2.3. Example code
|
2218 |
|
2322 |
|
2219 |
The following sample would query the index with a user language string.
|
2323 |
The following sample would query the index with a user language string.
|
|
... |
|
... |
2239 |
print k, ":", getattr(doc, k).encode('utf-8')
|
2343 |
print k, ":", getattr(doc, k).encode('utf-8')
|
2240 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
2344 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
2241 |
print abs
|
2345 |
print abs
|
2242 |
print
|
2346 |
print
|
2243 |
|
2347 |
|
2244 |
|
2348 |
|
2245 |
|
2349 |
|
2246 |
----------------------------------------------------------------------
|
2350 |
----------------------------------------------------------------------
|
2247 |
|
2351 |
|
2248 |
Chapter 5. Installation and configuration
|
2352 |
Chapter 5. Installation and configuration
|
2249 |
|
2353 |
|
|
... |
|
... |
2470 |
|
2574 |
|
2471 |
* --with-file-command Specify the version of the 'file' command to use
|
2575 |
* --with-file-command Specify the version of the 'file' command to use
|
2472 |
(ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
|
2576 |
(ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
|
2473 |
the gnu version on systems where the native one is bad.
|
2577 |
the gnu version on systems where the native one is bad.
|
2474 |
|
2578 |
|
2475 |
* --without-gui Disable the Qt interface, and auxiliary uses of X11, and
|
2579 |
* --disable-qtgui Disable the Qt interface. Will allow building the
|
2476 |
compile the command line version.
|
2580 |
indexer and the command line search program in absence of a Qt
|
|
|
2581 |
environment.
|
|
|
2582 |
|
|
|
2583 |
* --disable-x11mon Disable X11 connection monitoring inside recollindex.
|
|
|
2584 |
Together with --disable-qtgui, this allows building recoll without Qt
|
|
|
2585 |
and X11.
|
2477 |
|
2586 |
|
2478 |
* Of course the usual autoconf configure options, like --prefix apply.
|
2587 |
* Of course the usual autoconf configure options, like --prefix apply.
|
2479 |
|
2588 |
|
2480 |
Normal procedure:
|
2589 |
Normal procedure:
|
2481 |
|
2590 |
|
2482 |
cd recoll-xxx
|
2591 |
cd recoll-xxx
|
2483 |
configure
|
2592 |
configure
|
2484 |
make
|
2593 |
make
|
2485 |
(practices usual hardship-repelling invocations)
|
2594 |
(practices usual hardship-repelling invocations)
|
2486 |
|
2595 |
|
2487 |
|
2596 |
|
2488 |
There is little auto-configuration. The configure script will mainly link
|
2597 |
There is little auto-configuration. The configure script will mainly link
|
2489 |
one of the system-specific files in the mk directory to mk/sysconf. If
|
2598 |
one of the system-specific files in the mk directory to mk/sysconf. If
|
2490 |
your system is not known yet, it will tell you as much, and you may want
|
2599 |
your system is not known yet, it will tell you as much, and you may want
|
2491 |
to manually copy and modify one of the existing files (the new file name
|
2600 |
to manually copy and modify one of the existing files (the new file name
|
|
... |
|
... |
2511 |
----------------------------------------------------------------------
|
2620 |
----------------------------------------------------------------------
|
2512 |
|
2621 |
|
2513 |
5.4. Configuration overview
|
2622 |
5.4. Configuration overview
|
2514 |
|
2623 |
|
2515 |
Most of the parameters specific to the recoll GUI are set through the
|
2624 |
Most of the parameters specific to the recoll GUI are set through the
|
2516 |
Preferences menu and stored in the standard Qt place ($HOME/.qt/recollrc).
|
2625 |
Preferences menu and stored in the standard Qt place
|
2517 |
You probably do not want to edit this by hand.
|
2626 |
($HOME/.config/Recoll.org/recoll.conf). You probably do not want to edit
|
|
|
2627 |
this by hand.
|
2518 |
|
2628 |
|
2519 |
Recoll indexing options are set inside text configuration files located in
|
2629 |
Recoll indexing options are set inside text configuration files located in
|
2520 |
a configuration directory. There can be several such directories, each of
|
2630 |
a configuration directory. There can be several such directories, each of
|
2521 |
which define the parameters for one index.
|
2631 |
which define the parameters for one index.
|
2522 |
|
2632 |
|
|
... |
|
... |
2556 |
# Space-separated list of directories to index.
|
2666 |
# Space-separated list of directories to index.
|
2557 |
topdirs = ~/docs /usr/share/doc
|
2667 |
topdirs = ~/docs /usr/share/doc
|
2558 |
|
2668 |
|
2559 |
[~/somedirectory-with-utf8-txt-files]
|
2669 |
[~/somedirectory-with-utf8-txt-files]
|
2560 |
defaultcharset = utf-8
|
2670 |
defaultcharset = utf-8
|
2561 |
|
2671 |
|
2562 |
|
2672 |
|
2563 |
There are three kinds of lines:
|
2673 |
There are three kinds of lines:
|
2564 |
|
2674 |
|
2565 |
* Comment (starts with #) or empty.
|
2675 |
* Comment (starts with #) or empty.
|
2566 |
|
2676 |
|
|
... |
|
... |
2615 |
A space-separated list of patterns for names of files or
|
2725 |
A space-separated list of patterns for names of files or
|
2616 |
directories that should be completely ignored. The list defined in
|
2726 |
directories that should be completely ignored. The list defined in
|
2617 |
the default file is:
|
2727 |
the default file is:
|
2618 |
|
2728 |
|
2619 |
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
|
2729 |
skippedNames = #* bin CVS Cache cache* caughtspam tmp .thumbnails .svn \
|
2620 |
*~ .beagle .git .hg .bzr loop.ps .xsession-errors \
|
2730 |
*~ .beagle .git .hg .bzr loop.ps .xsession-errors \
|
2621 |
.recoll* xapiandb recollrc recoll.conf
|
2731 |
.recoll* xapiandb recollrc recoll.conf
|
2622 |
|
2732 |
|
2623 |
The list can be redefined at any sub-directory in the indexed
|
2733 |
The list can be redefined at any sub-directory in the indexed
|
2624 |
area.
|
2734 |
area.
|
2625 |
|
2735 |
|
2626 |
The top-level directories are not affected by this list (that is,
|
2736 |
The top-level directories are not affected by this list (that is,
|
|
... |
|
... |
2650 |
indexed at startup, but not monitored.
|
2760 |
indexed at startup, but not monitored.
|
2651 |
|
2761 |
|
2652 |
Example of use for skipping text files only in a specific
|
2762 |
Example of use for skipping text files only in a specific
|
2653 |
directory:
|
2763 |
directory:
|
2654 |
|
2764 |
|
2655 |
skippedPaths = ~/somedir/*.txt
|
2765 |
skippedPaths = ~/somedir/..txt
|
2656 |
|
2766 |
|
|
|
2767 |
|
|
|
2768 |
skippedPathsFnmPathname
|
|
|
2769 |
|
|
|
2770 |
The values in the *skippedPaths variables are matched by default
|
|
|
2771 |
with fnmatch(3), with the FNM_PATHNAME and FNM_LEADING_DIR flags.
|
|
|
2772 |
This means that '/' characters must be matched explicitely. You
|
|
|
2773 |
can set skippedPathsFnmPathname to 0 to disable the use of
|
|
|
2774 |
FNM_PATHNAME (meaning that /*/dir3 will match /dir1/dir2/dir3).
|
2657 |
|
2775 |
|
2658 |
followLinks
|
2776 |
followLinks
|
2659 |
|
2777 |
|
2660 |
Specifies if the indexer should follow symbolic links while
|
2778 |
Specifies if the indexer should follow symbolic links while
|
2661 |
walking the file tree. The default is to ignore symbolic links to
|
2779 |
walking the file tree. The default is to ignore symbolic links to
|
|
... |
|
... |
2799 |
needed when the index is initialized. If this is not an absolute
|
2917 |
needed when the index is initialized. If this is not an absolute
|
2800 |
path, it will be interpreted relative to the configuration
|
2918 |
path, it will be interpreted relative to the configuration
|
2801 |
directory. The value can have embedded spaces but starting or
|
2919 |
directory. The value can have embedded spaces but starting or
|
2802 |
trailing spaces will be trimmed. You cannot use quotes here.
|
2920 |
trailing spaces will be trimmed. You cannot use quotes here.
|
2803 |
|
2921 |
|
|
|
2922 |
idxstatusfile
|
|
|
2923 |
|
|
|
2924 |
The name of the scratch file where the indexer process updates its
|
|
|
2925 |
status. Default: idxstatus.txt inside the configuration directory.
|
|
|
2926 |
|
2804 |
maxfsoccuppc
|
2927 |
maxfsoccuppc
|
2805 |
|
2928 |
|
2806 |
Maximum file system occupation before we stop indexing. The value
|
2929 |
Maximum file system occupation before we stop indexing. The value
|
2807 |
is a percentage, corresponding to what the "Capacity" df output
|
2930 |
is a percentage, corresponding to what the "Capacity" df output
|
2808 |
column shows. The default value is 0, meaning no checking.
|
2931 |
column shows. The default value is 0, meaning no checking.
|
|
... |
|
... |
2864 |
space-separated list, each entry being a pattern and a time in
|
2987 |
space-separated list, each entry being a pattern and a time in
|
2865 |
seconds, separated by a colon. You can use double quotes if a path
|
2988 |
seconds, separated by a colon. You can use double quotes if a path
|
2866 |
entry contains white space. Example:
|
2989 |
entry contains white space. Example:
|
2867 |
|
2990 |
|
2868 |
mondelaypatterns = *.log:20 "this one has spaces*:10"
|
2991 |
mondelaypatterns = *.log:20 "this one has spaces*:10"
|
2869 |
|
2992 |
|
2870 |
|
2993 |
|
2871 |
monixinterval
|
2994 |
monixinterval
|
2872 |
|
2995 |
|
2873 |
Minimum interval (seconds) for processing the indexing queue. The
|
2996 |
Minimum interval (seconds) for processing the indexing queue. The
|
2874 |
real time monitor does not process each event when it comes in,
|
2997 |
real time monitor does not process each event when it comes in,
|
|
... |
|
... |
3105 |
|
3228 |
|
3106 |
.blob = application/x-blobapp
|
3229 |
.blob = application/x-blobapp
|
3107 |
|
3230 |
|
3108 |
Note that the mime type is made up here, and you could call it
|
3231 |
Note that the mime type is made up here, and you could call it
|
3109 |
diesel/oil just the same.
|
3232 |
diesel/oil just the same.
|
3110 |
|
|
|
3111 |
* In $RECOLL_CONFDIR/mimeview under the [view] section, add:
|
3233 |
* In $RECOLL_CONFDIR/mimeview under the [view] section, add:
|
3112 |
|
3234 |
|
3113 |
application/x-blobapp = blobviewer %f
|
3235 |
application/x-blobapp = blobviewer %f
|
3114 |
|
3236 |
|
3115 |
We are supposing that blobviewer wants a file name parameter here, you
|
3237 |
We are supposing that blobviewer wants a file name parameter here, you
|