|
a/src/README |
|
b/src/README |
|
... |
|
... |
132 |
|
132 |
|
133 |
3.8.2. The KDE Kicker Recoll applet
|
133 |
3.8.2. The KDE Kicker Recoll applet
|
134 |
|
134 |
|
135 |
4. Programming interface
|
135 |
4. Programming interface
|
136 |
|
136 |
|
137 |
4.1. Writing a document filter
|
137 |
4.1. Writing a document input handler
|
138 |
|
138 |
|
139 |
4.1.1. Simple filters
|
139 |
4.1.1. Simple input handlers
|
140 |
|
140 |
|
141 |
4.1.2. "Multiple" filters
|
141 |
4.1.2. "Multiple" handlers
|
142 |
|
142 |
|
143 |
4.1.3. Telling Recoll about the filter
|
143 |
4.1.3. Telling Recoll about the handler
|
144 |
|
144 |
|
145 |
4.1.4. Filter HTML output
|
145 |
4.1.4. Input handler HTML output
|
146 |
|
146 |
|
147 |
4.1.5. Page numbers
|
147 |
4.1.5. Page numbers
|
148 |
|
148 |
|
149 |
4.2. Field data processing
|
149 |
4.2. Field data processing
|
150 |
|
150 |
|
|
... |
|
... |
257 |
index, but the result is not nice, as all formatting, punctuation and
|
257 |
index, but the result is not nice, as all formatting, punctuation and
|
258 |
capitalization are lost).
|
258 |
capitalization are lost).
|
259 |
|
259 |
|
260 |
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
260 |
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
261 |
files with different character sets, encodings, and languages into the
|
261 |
files with different character sets, encodings, and languages into the
|
262 |
same index. It has input filters for many document types.
|
262 |
same index. It has can process many document types.
|
263 |
|
263 |
|
264 |
Stemming is the process by which Recoll reduces words to their radicals so
|
264 |
Stemming is the process by which Recoll reduces words to their radicals so
|
265 |
that searching does not depend, for example, on a word being singular or
|
265 |
that searching does not depend, for example, on a word being singular or
|
266 |
plural (floor, floors), or on a verb tense (flooring, floored). Because
|
266 |
plural (floor, floors), or on a verb tense (flooring, floored). Because
|
267 |
the mechanisms used for stemming depend on the specific grammatical rules
|
267 |
the mechanisms used for stemming depend on the specific grammatical rules
|
|
... |
|
... |
416 |
to be indexed. In the latter case, any type not in the list will be
|
416 |
to be indexed. In the latter case, any type not in the list will be
|
417 |
ignored.
|
417 |
ignored.
|
418 |
|
418 |
|
419 |
Excluding types can be done by adding wildcard name patterns to the
|
419 |
Excluding types can be done by adding wildcard name patterns to the
|
420 |
skippedNames list, which can be done from the GUI Index configuration
|
420 |
skippedNames list, which can be done from the GUI Index configuration
|
421 |
menu. It is also possible to exclude a mime type independantly of the file
|
421 |
menu. For versions 1.20 and later, you can alternatively set the
|
422 |
name by associating it with the rclnull filter. This can be done by
|
422 |
excludedmimetypes list in the configuration file. This can be redefined
|
423 |
editing the mimeconf configuration file.
|
423 |
for subdirectories.
|
424 |
|
424 |
|
425 |
In order to define a positive list, You need to edit the main
|
425 |
You can also define an exclusive list of MIME types to be indexed (no
|
426 |
configuration file (recoll.conf) and set the indexedmimetypes
|
426 |
others will be indexed), by settting the indexedmimetypes configuration
|
427 |
configuration variable. Example:
|
427 |
variable. Example:
|
428 |
|
428 |
|
429 |
indexedmimetypes = text/html application/pdf
|
429 |
indexedmimetypes = text/html application/pdf
|
430 |
|
430 |
|
431 |
|
431 |
|
432 |
It is possible to redefine this parameter for subdirectories. Example:
|
432 |
It is possible to redefine this parameter for subdirectories. Example:
|
|
... |
|
... |
434 |
[/path/to/my/dir]
|
434 |
[/path/to/my/dir]
|
435 |
indexedmimetypes = application/pdf
|
435 |
indexedmimetypes = application/pdf
|
436 |
|
436 |
|
437 |
|
437 |
|
438 |
(When using sections like this, don't forget that they remain in effect
|
438 |
(When using sections like this, don't forget that they remain in effect
|
439 |
until the end of the file or another section indicator). There is no GUI
|
439 |
until the end of the file or another section indicator).
|
440 |
way to edit the parameter, because this option runs contrary to Recoll
|
440 |
|
441 |
main goal which is to help you find information, independantly of how it
|
441 |
excludedmimetypes or indexedmimetypes, can be set either by editing the
|
442 |
may be stored.
|
442 |
main configuration file (recoll.conf), or from the GUI index configuration
|
|
|
443 |
tool.
|
443 |
|
444 |
|
444 |
2.1.4. Recovery
|
445 |
2.1.4. Recovery
|
445 |
|
446 |
|
446 |
In the rare case where the index becomes corrupted (which can signal
|
447 |
In the rare case where the index becomes corrupted (which can signal
|
447 |
itself by weird search results or crashes), the index files need to be
|
448 |
itself by weird search results or crashes), the index files need to be
|
|
... |
|
... |
700 |
A freedesktop standard defines a few special attributes, which are handled
|
701 |
A freedesktop standard defines a few special attributes, which are handled
|
701 |
as such by Recoll:
|
702 |
as such by Recoll:
|
702 |
|
703 |
|
703 |
mime_type
|
704 |
mime_type
|
704 |
|
705 |
|
705 |
If set, this overrides any other determination of the file mime
|
706 |
If set, this overrides any other determination of the file MIME
|
706 |
type.
|
707 |
type.
|
707 |
|
708 |
|
708 |
charset
|
709 |
charset
|
709 |
If set, this defines the file character set (mostly useful for
|
710 |
If set, this defines the file character set (mostly useful for
|
710 |
plain text files).
|
711 |
plain text files).
|
|
... |
|
... |
1016 |
default, Recoll lets the desktop choose the appropriate application for
|
1017 |
default, Recoll lets the desktop choose the appropriate application for
|
1017 |
most document types (there is a short list of exceptions, see further). If
|
1018 |
most document types (there is a short list of exceptions, see further). If
|
1018 |
you prefer to completely customize the choice of applications, you can
|
1019 |
you prefer to completely customize the choice of applications, you can
|
1019 |
uncheck the Use desktop preferences option in the GUI preferences dialog,
|
1020 |
uncheck the Use desktop preferences option in the GUI preferences dialog,
|
1020 |
and click the Choose editor applications button to adjust the predefined
|
1021 |
and click the Choose editor applications button to adjust the predefined
|
1021 |
Recoll choices. The tool accepts multiple selections of mime types (e.g.
|
1022 |
Recoll choices. The tool accepts multiple selections of MIME types (e.g.
|
1022 |
to set up the editor for the dozens of office file types).
|
1023 |
to set up the editor for the dozens of office file types).
|
1023 |
|
1024 |
|
1024 |
Even when Use desktop preferences is checked, there is a small list of
|
1025 |
Even when Use desktop preferences is checked, there is a small list of
|
1025 |
exceptions, for mime types where the Recoll choice should override the
|
1026 |
exceptions, for MIME types where the Recoll choice should override the
|
1026 |
desktop one. These are applications which are well integrated with Recoll,
|
1027 |
desktop one. These are applications which are well integrated with Recoll,
|
1027 |
especially evince for viewing PDF and Postscript files because of its
|
1028 |
especially evince for viewing PDF and Postscript files because of its
|
1028 |
support for opening the document at a specific page and passing a search
|
1029 |
support for opening the document at a specific page and passing a search
|
1029 |
string as an argument. Of course, you can edit the list (in the GUI
|
1030 |
string as an argument. Of course, you can edit the list (in the GUI
|
1030 |
preferences) if you would prefer to lose the functionality and use the
|
1031 |
preferences) if you would prefer to lose the functionality and use the
|
|
... |
|
... |
1240 |
|
1241 |
|
1241 |
1. The first tab lets you specify terms to search for, and permits
|
1242 |
1. The first tab lets you specify terms to search for, and permits
|
1242 |
specifying multiple clauses which are combined to build the search.
|
1243 |
specifying multiple clauses which are combined to build the search.
|
1243 |
|
1244 |
|
1244 |
2. The second tab lets filter the results according to file size, date of
|
1245 |
2. The second tab lets filter the results according to file size, date of
|
1245 |
modification, mime type, or location.
|
1246 |
modification, MIME type, or location.
|
1246 |
|
1247 |
|
1247 |
Click on the Start Search button in the advanced search dialog, or type
|
1248 |
Click on the Start Search button in the advanced search dialog, or type
|
1248 |
Enter in any text field to start the search. The button in the main window
|
1249 |
Enter in any text field to start the search. The button in the main window
|
1249 |
always performs a simple search.
|
1250 |
always performs a simple search.
|
1250 |
|
1251 |
|
|
... |
|
... |
1303 |
o The next section allows filtering the results by file size. There are
|
1304 |
o The next section allows filtering the results by file size. There are
|
1304 |
two entries for minimum and maximum size. Enter decimal numbers. You
|
1305 |
two entries for minimum and maximum size. Enter decimal numbers. You
|
1305 |
can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
|
1306 |
can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
|
1306 |
respectively.
|
1307 |
respectively.
|
1307 |
|
1308 |
|
1308 |
o The next section allows filtering the results by their mime types, or
|
1309 |
o The next section allows filtering the results by their MIME types, or
|
1309 |
mime categories (ie: media/text/message/etc.).
|
1310 |
MIME categories (ie: media/text/message/etc.).
|
1310 |
|
1311 |
|
1311 |
You can transfer the types between two boxes, to define which will be
|
1312 |
You can transfer the types between two boxes, to define which will be
|
1312 |
included or excluded by the search.
|
1313 |
included or excluded by the search.
|
1313 |
|
1314 |
|
1314 |
The state of the file type selection can be saved as the default (the
|
1315 |
The state of the file type selection can be saved as the default (the
|
|
... |
|
... |
1645 |
Open link in the result list, instead of the application defined in
|
1646 |
Open link in the result list, instead of the application defined in
|
1646 |
mimeview. xdg-open will in term use your desktop preferences to choose
|
1647 |
mimeview. xdg-open will in term use your desktop preferences to choose
|
1647 |
an appropriate application.
|
1648 |
an appropriate application.
|
1648 |
|
1649 |
|
1649 |
o Exceptions: when using the desktop preferences for opening documents,
|
1650 |
o Exceptions: when using the desktop preferences for opening documents,
|
1650 |
these are mime types that will still be opened according to Recoll
|
1651 |
these are MIME types that will still be opened according to Recoll
|
1651 |
preferences. This is useful for passing parameters like page numbers
|
1652 |
preferences. This is useful for passing parameters like page numbers
|
1652 |
or search strings to applications that support them (e.g. evince).
|
1653 |
or search strings to applications that support them (e.g. evince).
|
1653 |
This cannot be done with xdg-open which only supports passing one
|
1654 |
This cannot be done with xdg-open which only supports passing one
|
1654 |
parameter.
|
1655 |
parameter.
|
1655 |
|
1656 |
|
|
... |
|
... |
1787 |
|
1788 |
|
1788 |
o %A. Abstract
|
1789 |
o %A. Abstract
|
1789 |
|
1790 |
|
1790 |
o %D. Date
|
1791 |
o %D. Date
|
1791 |
|
1792 |
|
1792 |
o %I. Icon image name. This is normally determined from the mime type.
|
1793 |
o %I. Icon image name. This is normally determined from the MIME type.
|
1793 |
The associations are defined inside the mimeconf configuration file.
|
1794 |
The associations are defined inside the mimeconf configuration file.
|
1794 |
If a thumbnail for the file is found at the standard Freedesktop
|
1795 |
If a thumbnail for the file is found at the standard Freedesktop
|
1795 |
location, this will be displayed instead.
|
1796 |
location, this will be displayed instead.
|
1796 |
|
1797 |
|
1797 |
o %K. Keywords (if any)
|
1798 |
o %K. Keywords (if any)
|
1798 |
|
1799 |
|
1799 |
o %L. Precooked Preview, Edit, and possibly Snippets links
|
1800 |
o %L. Precooked Preview, Edit, and possibly Snippets links
|
1800 |
|
1801 |
|
1801 |
o %M. Mime type
|
1802 |
o %M. MIME type
|
1802 |
|
1803 |
|
1803 |
o %N. result Number inside the result page
|
1804 |
o %N. result Number inside the result page
|
1804 |
|
1805 |
|
1805 |
o %R. Relevance percentage
|
1806 |
o %R. Relevance percentage
|
1806 |
|
1807 |
|
|
... |
|
... |
1822 |
indexed but not stored fields is not known at this point in the search
|
1823 |
indexed but not stored fields is not known at this point in the search
|
1823 |
process (see field configuration). There are currently very few fields
|
1824 |
process (see field configuration). There are currently very few fields
|
1824 |
stored by default, apart from the values above (only author and filename),
|
1825 |
stored by default, apart from the values above (only author and filename),
|
1825 |
so this feature will need some custom local configuration to be useful. An
|
1826 |
so this feature will need some custom local configuration to be useful. An
|
1826 |
example candidate would be the recipient field which is generated by the
|
1827 |
example candidate would be the recipient field which is generated by the
|
1827 |
message filters.
|
1828 |
message input handlers.
|
1828 |
|
1829 |
|
1829 |
The default value for the paragraph format string is:
|
1830 |
The default value for the paragraph format string is:
|
1830 |
|
1831 |
|
1831 |
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
1832 |
<img src="%I" align="left">%R %S %L <b>%T</b><br>
|
1832 |
%M %D <i>%U</i> %i<br>
|
1833 |
%M %D <i>%U</i> %i<br>
|
|
... |
|
... |
1947 |
-b : basic. Just output urls, no mime types or titles
|
1948 |
-b : basic. Just output urls, no mime types or titles
|
1948 |
-Q : no result lines, just the processed query and result count
|
1949 |
-Q : no result lines, just the processed query and result count
|
1949 |
-m : dump the whole document meta[] array for each result
|
1950 |
-m : dump the whole document meta[] array for each result
|
1950 |
-A : output the document abstracts
|
1951 |
-A : output the document abstracts
|
1951 |
-S fld : sort by field <fld>
|
1952 |
-S fld : sort by field <fld>
|
|
|
1953 |
-s stemlang : set stemming language to use (must exist in index...)
|
|
|
1954 |
Use -s "" to turn off stem expansion
|
1952 |
-D : sort descending
|
1955 |
-D : sort descending
|
1953 |
-i <dbdir> : additional index, several can be given
|
1956 |
-i <dbdir> : additional index, several can be given
|
1954 |
-e use url encoding (%xx) for urls
|
1957 |
-e use url encoding (%xx) for urls
|
1955 |
-F <field name list> : output exactly these fields for each result.
|
1958 |
-F <field name list> : output exactly these fields for each result.
|
1956 |
The field values are encoded in base64, output in one line and
|
1959 |
The field values are encoded in base64, output in one line and
|
|
... |
|
... |
2137 |
|
2140 |
|
2138 |
o /2003 all documents from 2003 or older.
|
2141 |
o /2003 all documents from 2003 or older.
|
2139 |
|
2142 |
|
2140 |
Periods can also be specified with small letters (ie: p2y).
|
2143 |
Periods can also be specified with small letters (ie: p2y).
|
2141 |
|
2144 |
|
2142 |
o mime or format for specifying the mime type. This one is quite special
|
2145 |
o mime or format for specifying the MIME type. This one is quite special
|
2143 |
because you can specify several values which will be OR'ed (the normal
|
2146 |
because you can specify several values which will be OR'ed (the normal
|
2144 |
default for the language is AND). Ex: mime:text/plain mime:text/html.
|
2147 |
default for the language is AND). Ex: mime:text/plain mime:text/html.
|
2145 |
Specifying an explicit boolean operator before a mime specification is
|
2148 |
Specifying an explicit boolean operator before a mime specification is
|
2146 |
not supported and will produce strange results. You can filter out
|
2149 |
not supported and will produce strange results. You can filter out
|
2147 |
certain types by using negation (-mime:some/type), and you can use
|
2150 |
certain types by using negation (-mime:some/type), and you can use
|
2148 |
wildcards in the value (mime:text/*). Note that mime is the ONLY field
|
2151 |
wildcards in the value (mime:text/*). Note that mime is the ONLY field
|
2149 |
with an OR default. You do need to use OR with ext terms for example.
|
2152 |
with an OR default. You do need to use OR with ext terms for example.
|
2150 |
|
2153 |
|
2151 |
o type or rclcat for specifying the category (as in
|
2154 |
o type or rclcat for specifying the category (as in
|
2152 |
text/media/presentation/etc.). The classification of mime types in
|
2155 |
text/media/presentation/etc.). The classification of MIME types in
|
2153 |
categories is defined in the Recoll configuration (mimeconf), and can
|
2156 |
categories is defined in the Recoll configuration (mimeconf), and can
|
2154 |
be modified or extended. The default category names are those which
|
2157 |
be modified or extended. The default category names are those which
|
2155 |
permit filtering results in the main GUI screen. Categories are OR'ed
|
2158 |
permit filtering results in the main GUI screen. Categories are OR'ed
|
2156 |
like mime types above. This can't be negated with - either.
|
2159 |
like MIME types above. This can't be negated with - either.
|
2157 |
|
2160 |
|
2158 |
Words inside phrases and capitalized words are not stem-expanded.
|
2161 |
Words inside phrases and capitalized words are not stem-expanded.
|
2159 |
Wildcards may be used anywhere inside a term. Specifying a wild-card on
|
2162 |
Wildcards may be used anywhere inside a term. Specifying a wild-card on
|
2160 |
the left of a term can produce a very slow search (or even an incorrect
|
2163 |
the left of a term can produce a very slow search (or even an incorrect
|
2161 |
one if the expansion is truncated because of excessive size). Also see
|
2164 |
one if the expansion is truncated because of excessive size). Also see
|
2162 |
More about wildcards.
|
2165 |
More about wildcards.
|
2163 |
|
2166 |
|
2164 |
The document filters used while indexing have the possibility to create
|
2167 |
The document input handlers used while indexing have the possibility to
|
2165 |
other fields with arbitrary names, and aliases may be defined in the
|
2168 |
create other fields with arbitrary names, and aliases may be defined in
|
2166 |
configuration, so that the exact field search possibilities may be
|
2169 |
the configuration, so that the exact field search possibilities may be
|
2167 |
different for you if someone took care of the customisation.
|
2170 |
different for you if someone took care of the customisation.
|
2168 |
|
2171 |
|
2169 |
3.5.1. Modifiers
|
2172 |
3.5.1. Modifiers
|
2170 |
|
2173 |
|
2171 |
Some characters are recognized as search modifiers when found immediately
|
2174 |
Some characters are recognized as search modifiers when found immediately
|
|
... |
|
... |
2376 |
Chapter 4. Programming interface
|
2379 |
Chapter 4. Programming interface
|
2377 |
|
2380 |
|
2378 |
Recoll has an Application Programming Interface, usable both for indexing
|
2381 |
Recoll has an Application Programming Interface, usable both for indexing
|
2379 |
and searching, currently accessible from the Python language.
|
2382 |
and searching, currently accessible from the Python language.
|
2380 |
|
2383 |
|
2381 |
Another less radical way to extend the application is to write filters for
|
2384 |
Another less radical way to extend the application is to write input
|
2382 |
new types of documents.
|
2385 |
handlers for new types of documents.
|
2383 |
|
2386 |
|
2384 |
The processing of metadata attributes for documents (fields) is highly
|
2387 |
The processing of metadata attributes for documents (fields) is highly
|
2385 |
configurable.
|
2388 |
configurable.
|
2386 |
|
2389 |
|
2387 |
4.1. Writing a document filter
|
2390 |
4.1. Writing a document input handler
|
2388 |
|
2391 |
|
|
|
2392 |
Terminology
|
|
|
2393 |
|
|
|
2394 |
The small programs or pieces of code which handle the processing of the
|
|
|
2395 |
different document types for Recoll used to be called filters, which is
|
|
|
2396 |
still reflected in the name of the directory which holds them and many
|
|
|
2397 |
configuration variables. They were named this way because one of their
|
|
|
2398 |
primary functions is to filter out the formatting directives and keep the
|
|
|
2399 |
text content. However these modules may have other behaviours, and the
|
|
|
2400 |
term input handler is now progressively substituted in the documentation.
|
|
|
2401 |
filter is still used in many places though.
|
|
|
2402 |
|
2389 |
Recoll filters cooperate to translate from the multitude of input document
|
2403 |
Recoll input handlers cooperate to translate from the multitude of input
|
2390 |
formats, simple ones as opendocument, acrobat), or compound ones such as
|
2404 |
document formats, simple ones as opendocument, acrobat), or compound ones
|
2391 |
Zip or Email, into the final Recoll indexing input format, which may be
|
2405 |
such as Zip or Email, into the final Recoll indexing input format, which
|
2392 |
text/plain or text/html. Most filters are executable programs or scripts.
|
2406 |
is plain text. Most input handlers are executable programs or scripts. A
|
2393 |
A few filters are coded in C++ and live inside recollindex. This latter
|
2407 |
few handlers are coded in C++ and live inside recollindex. This latter
|
2394 |
kind will not be described here.
|
2408 |
kind will not be described here.
|
2395 |
|
2409 |
|
2396 |
There are currently (1.18 and since 1.13) two kinds of external executable
|
2410 |
There are currently (1.18 and since 1.13) two kinds of external executable
|
2397 |
filters:
|
2411 |
input handlers:
|
2398 |
|
2412 |
|
2399 |
o Simple filters (exec filters) run once and exit. They can be bare
|
2413 |
o Simple exec handlers run once and exit. They can be bare programs like
|
2400 |
programs like antiword, or scripts using other programs. They are very
|
2414 |
antiword, or scripts using other programs. They are very simple to
|
2401 |
simple to write, because they just need to print the converted
|
2415 |
write, because they just need to print the converted document to the
|
2402 |
document to the standard output. Their output can be text/plain or
|
2416 |
standard output. Their output can be plain text or HTML. HTML is
|
2403 |
text/html.
|
2417 |
usually preferred because it can store metadata fields and it allows
|
|
|
2418 |
preserving some of the formatting for the GUI preview.
|
2404 |
|
2419 |
|
2405 |
o Multiple filters (execm filters), run as long as their master process
|
2420 |
o Multiple execm handlers can process multiple files (sparing the
|
2406 |
(recollindex) is active. They can process multiple files (sparing the
|
|
|
2407 |
process startup time which can be very significant), or multiple
|
2421 |
process startup time which can be very significant), or multiple
|
2408 |
documents per file (e.g.: for zip or chm files). They communicate with
|
2422 |
documents per file (e.g.: for zip or chm files). They communicate with
|
2409 |
the indexer through a simple protocol, but are nevertheless a bit more
|
2423 |
the indexer through a simple protocol, but are nevertheless a bit more
|
2410 |
complicated than the older kind. Most of new filters are written in
|
2424 |
complicated than the older kind. Most of new handlers are written in
|
2411 |
Python, using a common module to handle the protocol. There is an
|
2425 |
Python, using a common module to handle the protocol. There is an
|
2412 |
exception, rclimg which is written in Perl. The subdocuments output by
|
2426 |
exception, rclimg which is written in Perl. The subdocuments output by
|
2413 |
these filters can be directly indexable (text or HTML), or they can be
|
2427 |
these handlers can be directly indexable (text or HTML), or they can
|
2414 |
other simple or compound documents that will need to be processed by
|
2428 |
be other simple or compound documents that will need to be processed
|
2415 |
another filter.
|
2429 |
by another handler.
|
2416 |
|
2430 |
|
2417 |
In both cases, filters deal with regular file system files, and can
|
2431 |
In both cases, handlers deal with regular file system files, and can
|
2418 |
process either a single document, or a linear list of documents in each
|
2432 |
process either a single document, or a linear list of documents in each
|
2419 |
file. Recoll is responsible for performing up to date checks, deal with
|
2433 |
file. Recoll is responsible for performing up to date checks, deal with
|
2420 |
more complex embedding and other upper level issues.
|
2434 |
more complex embedding and other upper level issues.
|
2421 |
|
2435 |
|
2422 |
In the extreme case of a simple filter returning a document in text/plain
|
2436 |
A simple handler returning a document in text/plain format, can transfer
|
2423 |
format, no metadata can be transferred from the filter to the indexer.
|
2437 |
no metadata to the indexer. Generic metadata, like document size or
|
2424 |
Generic metadata, like document size or modification date, will be
|
2438 |
modification date, will be gathered and stored by the indexer.
|
2425 |
gathered and stored by the indexer.
|
|
|
2426 |
|
2439 |
|
2427 |
Filters that produce text/html format can return an arbitrary amount of
|
2440 |
Handlers that produce text/html format can return an arbitrary amount of
|
2428 |
metadata inside HTML meta tags. These will be processed according to the
|
2441 |
metadata inside HTML meta tags. These will be processed according to the
|
2429 |
directives found in the fields configuration file.
|
2442 |
directives found in the fields configuration file.
|
2430 |
|
2443 |
|
2431 |
The filters that can handle multiple documents per file return a single
|
2444 |
The handlers that can handle multiple documents per file return a single
|
2432 |
piece of data to identify each document inside the file. This piece of
|
2445 |
piece of data to identify each document inside the file. This piece of
|
2433 |
data, called an ipath element will be sent back by Recoll to extract the
|
2446 |
data, called an ipath element will be sent back by Recoll to extract the
|
2434 |
document at query time, for previewing, or for creating a temporary file
|
2447 |
document at query time, for previewing, or for creating a temporary file
|
2435 |
to be opened by a viewer.
|
2448 |
to be opened by a viewer.
|
2436 |
|
2449 |
|
2437 |
The following section describes the simple filters, and the next one gives
|
2450 |
The following section describes the simple handlers, and the next one
|
2438 |
a few explanations about the execm ones. You could conceivably write a
|
2451 |
gives a few explanations about the execm ones. You could conceivably write
|
2439 |
simple filter with only the elements in the manual. This will not be the
|
2452 |
a simple handler with only the elements in the manual. This will not be
|
2440 |
case for the other ones, for which you will have to look at the code.
|
2453 |
the case for the other ones, for which you will have to look at the code.
|
2441 |
|
2454 |
|
2442 |
4.1.1. Simple filters
|
2455 |
4.1.1. Simple input handlers
|
2443 |
|
2456 |
|
2444 |
Recoll simple filters are usually shell-scripts, but this is in no way
|
2457 |
Recoll simple handlers are usually shell-scripts, but this is in no way
|
2445 |
necessary. Extracting the text from the native format is the difficult
|
2458 |
necessary. Extracting the text from the native format is the difficult
|
2446 |
part. Outputting the format expected by Recoll is trivial. Happily enough,
|
2459 |
part. Outputting the format expected by Recoll is trivial. Happily enough,
|
2447 |
most document formats have translators or text extractors which can be
|
2460 |
most document formats have translators or text extractors which can be
|
2448 |
called from the filter. In some cases the output of the translating
|
2461 |
called from the handler. In some cases the output of the translating
|
2449 |
program is completely appropriate, and no intermediate shell-script is
|
2462 |
program is completely appropriate, and no intermediate shell-script is
|
2450 |
needed.
|
2463 |
needed.
|
2451 |
|
2464 |
|
2452 |
Filters are called with a single argument which is the source file name.
|
2465 |
Input handlers are called with a single argument which is the source file
|
2453 |
They should output the result to stdout.
|
2466 |
name. They should output the result to stdout.
|
2454 |
|
2467 |
|
2455 |
When writing a filter, you should decide if it will output plain text or
|
2468 |
When writing a handler, you should decide if it will output plain text or
|
2456 |
HTML. Plain text is simpler, but you will not be able to add metadata or
|
2469 |
HTML. Plain text is simpler, but you will not be able to add metadata or
|
2457 |
vary the output character encoding (this will be defined in a
|
2470 |
vary the output character encoding (this will be defined in a
|
2458 |
configuration file). Additionally, some formatting may be easier to
|
2471 |
configuration file). Additionally, some formatting may be easier to
|
2459 |
preserve when previewing HTML. Actually the deciding factor is metadata:
|
2472 |
preserve when previewing HTML. Actually the deciding factor is metadata:
|
2460 |
Recoll has a way to extract metadata from the HTML header and use it for
|
2473 |
Recoll has a way to extract metadata from the HTML header and use it for
|
2461 |
field searches..
|
2474 |
field searches..
|
2462 |
|
2475 |
|
2463 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
2476 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
2464 |
the filter if the operation is for indexing or previewing. Some filters
|
2477 |
the handler if the operation is for indexing or previewing. Some handlers
|
2465 |
use this to output a slightly different format, for example stripping
|
2478 |
use this to output a slightly different format, for example stripping
|
2466 |
uninteresting repeated keywords (ie: Subject: for email) when indexing.
|
2479 |
uninteresting repeated keywords (ie: Subject: for email) when indexing.
|
2467 |
This is not essential.
|
2480 |
This is not essential.
|
2468 |
|
2481 |
|
2469 |
You should look at one of the simple filters, for example rclps for a
|
2482 |
You should look at one of the simple handlers, for example rclps for a
|
2470 |
starting point.
|
2483 |
starting point.
|
2471 |
|
2484 |
|
2472 |
Don't forget to make your filter executable before testing !
|
2485 |
Don't forget to make your handler executable before testing !
|
2473 |
|
2486 |
|
2474 |
4.1.2. "Multiple" filters
|
2487 |
4.1.2. "Multiple" handlers
|
2475 |
|
2488 |
|
2476 |
If you can program and want to write an execm filter, it should not be too
|
2489 |
If you can program and want to write an execm handler, it should not be
|
2477 |
difficult to make sense of one of the existing modules. For example, look
|
2490 |
too difficult to make sense of one of the existing modules. For example,
|
2478 |
at rclzip which uses Zip file paths as identifiers (ipath), and rclics,
|
2491 |
look at rclzip which uses Zip file paths as identifiers (ipath), and
|
2479 |
which uses an integer index. Also have a look at the comments inside the
|
2492 |
rclics, which uses an integer index. Also have a look at the comments
|
2480 |
internfile/mh_execm.h file and possibly at the corresponding module.
|
2493 |
inside the internfile/mh_execm.h file and possibly at the corresponding
|
|
|
2494 |
module.
|
2481 |
|
2495 |
|
2482 |
execm filters sometimes need to make a choice for the nature of the ipath
|
2496 |
execm handlers sometimes need to make a choice for the nature of the ipath
|
2483 |
elements that they use in communication with the indexer. Here are a few
|
2497 |
elements that they use in communication with the indexer. Here are a few
|
2484 |
guidelines:
|
2498 |
guidelines:
|
2485 |
|
2499 |
|
2486 |
o Use ASCII or UTF-8 (if the identifier is an integer print it, for
|
2500 |
o Use ASCII or UTF-8 (if the identifier is an integer print it, for
|
2487 |
example, like printf %d would do).
|
2501 |
example, like printf %d would do).
|
|
... |
|
... |
2489 |
o If at all possible, the data should make some kind of sense when
|
2503 |
o If at all possible, the data should make some kind of sense when
|
2490 |
printed to a log file to help with debugging.
|
2504 |
printed to a log file to help with debugging.
|
2491 |
|
2505 |
|
2492 |
o Recoll uses a colon (:) as a separator to store a complex path
|
2506 |
o Recoll uses a colon (:) as a separator to store a complex path
|
2493 |
internally (for deeper embedding). Colons inside the ipath elements
|
2507 |
internally (for deeper embedding). Colons inside the ipath elements
|
2494 |
output by a filter will be escaped, but would be a bad choice as a
|
2508 |
output by a handler will be escaped, but would be a bad choice as a
|
2495 |
filter-specific separator (mostly, again, for debugging issues).
|
2509 |
handler-specific separator (mostly, again, for debugging issues).
|
2496 |
|
2510 |
|
2497 |
In any case, the main goal is that it should be easy for the filter to
|
2511 |
In any case, the main goal is that it should be easy for the handler to
|
2498 |
extract the target document, given the file name and the ipath element.
|
2512 |
extract the target document, given the file name and the ipath element.
|
2499 |
|
2513 |
|
2500 |
execm filters will also produce a document with a null ipath element.
|
2514 |
execm handlers will also produce a document with a null ipath element.
|
2501 |
Depending on the type of document, this may have some associated data
|
2515 |
Depending on the type of document, this may have some associated data
|
2502 |
(e.g. the body of an email message), or none (typical for an archive
|
2516 |
(e.g. the body of an email message), or none (typical for an archive
|
2503 |
file). If it is empty, this document will be useful anyway for some
|
2517 |
file). If it is empty, this document will be useful anyway for some
|
2504 |
operations, as the parent of the actual data documents.
|
2518 |
operations, as the parent of the actual data documents.
|
2505 |
|
2519 |
|
2506 |
4.1.3. Telling Recoll about the filter
|
2520 |
4.1.3. Telling Recoll about the handler
|
2507 |
|
2521 |
|
2508 |
There are two elements that link a file to the filter which should process
|
2522 |
There are two elements that link a file to the handler which should
|
2509 |
it: the association of file to mime type and the association of a mime
|
2523 |
process it: the association of file to MIME type and the association of a
|
2510 |
type with a filter.
|
2524 |
MIME type with a handler.
|
2511 |
|
2525 |
|
2512 |
The association of files to mime types is mostly based on name suffixes.
|
2526 |
The association of files to MIME types is mostly based on name suffixes.
|
2513 |
The types are defined inside the mimemap file. Example:
|
2527 |
The types are defined inside the mimemap file. Example:
|
2514 |
|
2528 |
|
2515 |
|
2529 |
|
2516 |
.doc = application/msword
|
2530 |
.doc = application/msword
|
2517 |
|
2531 |
|
2518 |
If no suffix association is found for the file name, Recoll will try to
|
2532 |
If no suffix association is found for the file name, Recoll will try to
|
2519 |
execute the file -i command to determine a mime type.
|
2533 |
execute the file -i command to determine a MIME type.
|
2520 |
|
2534 |
|
2521 |
The association of file types to filters is performed in the mimeconf
|
2535 |
The association of file types to handlers is performed in the mimeconf
|
2522 |
file. A sample will probably be of better help than a long explanation:
|
2536 |
file. A sample will probably be of better help than a long explanation:
|
2523 |
|
2537 |
|
2524 |
|
2538 |
|
2525 |
[index]
|
2539 |
[index]
|
2526 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
2540 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
... |
|
... |
2543 |
|
2557 |
|
2544 |
o text/rtf is processed by unrtf, which outputs text/html. The
|
2558 |
o text/rtf is processed by unrtf, which outputs text/html. The
|
2545 |
iso-8859-1 encoding is specified because it is not the utf-8 default,
|
2559 |
iso-8859-1 encoding is specified because it is not the utf-8 default,
|
2546 |
and not output by unrtf in the HTML header section.
|
2560 |
and not output by unrtf in the HTML header section.
|
2547 |
|
2561 |
|
2548 |
o application/x-chm is processed by a persistant filter. This is
|
2562 |
o application/x-chm is processed by a persistant handler. This is
|
2549 |
determined by the execm keyword.
|
2563 |
determined by the execm keyword.
|
2550 |
|
2564 |
|
2551 |
4.1.4. Filter HTML output
|
2565 |
4.1.4. Input handler HTML output
|
2552 |
|
2566 |
|
2553 |
The output HTML could be very minimal like the following example:
|
2567 |
The output HTML could be very minimal like the following example:
|
2554 |
|
2568 |
|
2555 |
<html>
|
2569 |
<html>
|
2556 |
<head>
|
2570 |
<head>
|
|
... |
|
... |
2598 |
Example:
|
2612 |
Example:
|
2599 |
|
2613 |
|
2600 |
<meta name="date" content="2013-02-24 17:50:00">
|
2614 |
<meta name="date" content="2013-02-24 17:50:00">
|
2601 |
|
2615 |
|
2602 |
|
2616 |
|
2603 |
Filters also have the possibility to "invent" field names. This should
|
2617 |
Input handlers also have the possibility to "invent" field names. This
|
2604 |
also be output as meta tags:
|
2618 |
should also be output as meta tags:
|
2605 |
|
2619 |
|
2606 |
<meta name="somefield" content="Some textual data" />
|
2620 |
<meta name="somefield" content="Some textual data" />
|
2607 |
|
2621 |
|
2608 |
You can embed HTML markup inside the content of custom fields, for
|
2622 |
You can embed HTML markup inside the content of custom fields, for
|
2609 |
improving the display inside result lists. In this case, add a (wildly
|
2623 |
improving the display inside result lists. In this case, add a (wildly
|
|
... |
|
... |
2615 |
As written above, the processing of fields is described in a further
|
2629 |
As written above, the processing of fields is described in a further
|
2616 |
section.
|
2630 |
section.
|
2617 |
|
2631 |
|
2618 |
4.1.5. Page numbers
|
2632 |
4.1.5. Page numbers
|
2619 |
|
2633 |
|
2620 |
The indexer will interpret ^L characters in the filter output as
|
2634 |
The indexer will interpret ^L characters in the handler output as
|
2621 |
indicating page breaks, and will record them. At query time, this allows
|
2635 |
indicating page breaks, and will record them. At query time, this allows
|
2622 |
starting a viewer on the right page for a hit or a snippet. Currently,
|
2636 |
starting a viewer on the right page for a hit or a snippet. Currently,
|
2623 |
only the PDF, Postscript and DVI filters generate page breaks.
|
2637 |
only the PDF, Postscript and DVI handlers generate page breaks.
|
2624 |
|
2638 |
|
2625 |
4.2. Field data processing
|
2639 |
4.2. Field data processing
|
2626 |
|
2640 |
|
2627 |
Fields are named pieces of information in or about documents, like title,
|
2641 |
Fields are named pieces of information in or about documents, like title,
|
2628 |
author, abstract.
|
2642 |
author, abstract.
|
2629 |
|
2643 |
|
2630 |
The field values for documents can appear in several ways during indexing:
|
2644 |
The field values for documents can appear in several ways during indexing:
|
2631 |
either output by filters as meta fields in the HTML header section, or
|
2645 |
either output by input handlers as meta fields in the HTML header section,
|
2632 |
extracted from file extended attributes, or added as attributes of the Doc
|
2646 |
or extracted from file extended attributes, or added as attributes of the
|
2633 |
object when using the API, or again synthetized internally by Recoll.
|
2647 |
Doc object when using the API, or again synthetized internally by Recoll.
|
2634 |
|
2648 |
|
2635 |
The Recoll query language allows searching for text in a specific field.
|
2649 |
The Recoll query language allows searching for text in a specific field.
|
2636 |
|
2650 |
|
2637 |
Recoll defines a number of default fields. Additional ones can be output
|
2651 |
Recoll defines a number of default fields. Additional ones can be output
|
2638 |
by filters, and described in the fields configuration file.
|
2652 |
by handlers, and described in the fields configuration file.
|
2639 |
|
2653 |
|
2640 |
Fields can be:
|
2654 |
Fields can be:
|
2641 |
|
2655 |
|
2642 |
o indexed, meaning that their terms are separately stored in inverted
|
2656 |
o indexed, meaning that their terms are separately stored in inverted
|
2643 |
lists (with a specific prefix), and that a field-specific search is
|
2657 |
lists (with a specific prefix), and that a field-specific search is
|
|
... |
|
... |
2792 |
|
2806 |
|
2793 |
Classes
|
2807 |
Classes
|
2794 |
|
2808 |
|
2795 |
The Db class
|
2809 |
The Db class
|
2796 |
|
2810 |
|
2797 |
A Db object is created by a connect() function and holds a connection to a
|
2811 |
A Db object is created by a connect() call and holds a connection to a
|
2798 |
Recoll index.
|
2812 |
Recoll index.
|
2799 |
|
2813 |
|
2800 |
Methods
|
2814 |
Methods
|
2801 |
|
2815 |
|
2802 |
Db.close()
|
2816 |
Db.close()
|
|
... |
|
... |
3086 |
After an indexing pass, the commands that were found missing can be
|
3100 |
After an indexing pass, the commands that were found missing can be
|
3087 |
displayed from the recoll File menu. The list is stored in the missing
|
3101 |
displayed from the recoll File menu. The list is stored in the missing
|
3088 |
text file inside the configuration directory.
|
3102 |
text file inside the configuration directory.
|
3089 |
|
3103 |
|
3090 |
A list of common file types which need external commands follows. Many of
|
3104 |
A list of common file types which need external commands follows. Many of
|
3091 |
the filters need the iconv command, which is not always listed as a
|
3105 |
the handlers need the iconv command, which is not always listed as a
|
3092 |
dependancy.
|
3106 |
dependancy.
|
3093 |
|
3107 |
|
3094 |
Please note that, due to the relatively dynamic nature of this
|
3108 |
Please note that, due to the relatively dynamic nature of this
|
3095 |
information, the most up to date version is now kept on
|
3109 |
information, the most up to date version is now kept on
|
3096 |
http://www.recoll.org/features.html along with links to the home pages or
|
3110 |
http://www.recoll.org/features.html along with links to the home pages or
|
|
... |
|
... |
3101 |
from the package repositories. However, the packages are sometimes
|
3115 |
from the package repositories. However, the packages are sometimes
|
3102 |
outdated, or not the best version for Recoll, so you should take a look at
|
3116 |
outdated, or not the best version for Recoll, so you should take a look at
|
3103 |
http://www.recoll.org/features.html if a file type is important to you.
|
3117 |
http://www.recoll.org/features.html if a file type is important to you.
|
3104 |
|
3118 |
|
3105 |
As of Recoll release 1.14, a number of XML-based formats that were handled
|
3119 |
As of Recoll release 1.14, a number of XML-based formats that were handled
|
3106 |
by ad hoc filter code now use the xsltproc command, which usually comes
|
3120 |
by ad hoc handler code now use the xsltproc command, which usually comes
|
3107 |
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
|
3121 |
with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
|
3108 |
|
3122 |
|
3109 |
Now for the list:
|
3123 |
Now for the list:
|
3110 |
|
3124 |
|
3111 |
o Openoffice files need unzip and xsltproc.
|
3125 |
o Openoffice files need unzip and xsltproc.
|
|
... |
|
... |
3119 |
|
3133 |
|
3120 |
o MS Word needs antiword. It is also useful to have wvWare installed as
|
3134 |
o MS Word needs antiword. It is also useful to have wvWare installed as
|
3121 |
it may be be used as a fallback for some files which antiword does not
|
3135 |
it may be be used as a fallback for some files which antiword does not
|
3122 |
handle.
|
3136 |
handle.
|
3123 |
|
3137 |
|
3124 |
o MS Excel and PowerPoint need catdoc.
|
3138 |
o MS Excel and PowerPoint are processed by internal Python handlers.
|
3125 |
|
3139 |
|
3126 |
o MS Open XML (docx) needs xsltproc.
|
3140 |
o MS Open XML (docx) needs xsltproc.
|
3127 |
|
3141 |
|
3128 |
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
3142 |
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
3129 |
Ubuntu) package.
|
3143 |
Ubuntu) package.
|
|
... |
|
... |
3138 |
|
3152 |
|
3139 |
o dvi files need dvips.
|
3153 |
o dvi files need dvips.
|
3140 |
|
3154 |
|
3141 |
o djvu files need djvutxt and djvused from the DjVuLibre package.
|
3155 |
o djvu files need djvutxt and djvused from the DjVuLibre package.
|
3142 |
|
3156 |
|
3143 |
o Audio files: Recoll releases before 1.13 used the id3info command from
|
3157 |
o Audio files: Recoll releases 1.14 and later use a single Python
|
3144 |
the id3lib package to extract mp3 tag information, metaflac (standard
|
3158 |
handler based on mutagen for all audio file types.
|
3145 |
flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
|
|
|
3146 |
Releases 1.14 and later use a single Python filter based on mutagen
|
|
|
3147 |
for all audio file types.
|
|
|
3148 |
|
3159 |
|
3149 |
o Pictures: Recoll uses the Exiftool Perl package to extract tag
|
3160 |
o Pictures: Recoll uses the Exiftool Perl package to extract tag
|
3150 |
information. Most image file formats are supported. Note that there
|
3161 |
information. Most image file formats are supported. Note that there
|
3151 |
may not be much interest in indexing the technical tags (image size,
|
3162 |
may not be much interest in indexing the technical tags (image size,
|
3152 |
aperture, etc.). This is only of interest if you store personal tags
|
3163 |
aperture, etc.). This is only of interest if you store personal tags
|
3153 |
or textual descriptions inside the image files.
|
3164 |
or textual descriptions inside the image files.
|
3154 |
|
3165 |
|
3155 |
o chm: files in microsoft help format need Python and the pychm module
|
3166 |
o chm: files in Microsoft help format need Python and the pychm module
|
3156 |
(which needs chmlib).
|
3167 |
(which needs chmlib).
|
3157 |
|
3168 |
|
3158 |
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
|
3169 |
o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
|
3159 |
module. icalendar is not needed for newer versions, which use internal
|
3170 |
module. icalendar is not needed for newer versions, which use internal
|
3160 |
code.
|
3171 |
code.
|
|
... |
|
... |
3166 |
|
3177 |
|
3167 |
o Midi karaoke files need Python and the Midi module
|
3178 |
o Midi karaoke files need Python and the Midi module
|
3168 |
|
3179 |
|
3169 |
o Konqueror webarchive format with Python (uses the Tarfile module).
|
3180 |
o Konqueror webarchive format with Python (uses the Tarfile module).
|
3170 |
|
3181 |
|
3171 |
o mimehtml web archive format (support based on the email filter, which
|
3182 |
o Mimehtml web archive format (support based on the email handler, which
|
3172 |
introduces some mild weirdness, but still usable).
|
3183 |
introduces some mild weirdness, but still usable).
|
3173 |
|
3184 |
|
3174 |
Text, HTML, email folders, and Scribus files are processed internally. Lyx
|
3185 |
Text, HTML, email folders, and Scribus files are processed internally. Lyx
|
3175 |
is used to index Lyx files. Many filters need iconv and the standard sed
|
3186 |
is used to index Lyx files. Many handlers need iconv and the standard sed
|
3176 |
and awk.
|
3187 |
and awk.
|
3177 |
|
3188 |
|
3178 |
5.3. Building from source
|
3189 |
5.3. Building from source
|
3179 |
|
3190 |
|
3180 |
5.3.1. Prerequisites
|
3191 |
5.3.1. Prerequisites
|
|
... |
|
... |
3493 |
|
3504 |
|
3494 |
zipSkippedNames
|
3505 |
zipSkippedNames
|
3495 |
|
3506 |
|
3496 |
A space-separated list of patterns for names of files or
|
3507 |
A space-separated list of patterns for names of files or
|
3497 |
directories that should be ignored inside zip archives. This is
|
3508 |
directories that should be ignored inside zip archives. This is
|
3498 |
used directly by the zip filter, and has a function similar to
|
3509 |
used directly by the zip handler, and has a function similar to
|
3499 |
skippedNames, but works independantly. Can be redefined for
|
3510 |
skippedNames, but works independantly. Can be redefined for
|
3500 |
filesystem subdirectories. For versions up to 1.19, you will need
|
3511 |
filesystem subdirectories. For versions up to 1.19, you will need
|
3501 |
to update the Zip filter and install a supplementary Python
|
3512 |
to update the Zip handler and install a supplementary Python
|
3502 |
module. The details are described on the Recoll wiki.
|
3513 |
module. The details are described on the Recoll wiki.
|
3503 |
|
3514 |
|
3504 |
followLinks
|
3515 |
followLinks
|
3505 |
|
3516 |
|
3506 |
Specifies if the indexer should follow symbolic links while
|
3517 |
Specifies if the indexer should follow symbolic links while
|
|
... |
|
... |
3511 |
sections. It can not be changed below the topdirs level.
|
3522 |
sections. It can not be changed below the topdirs level.
|
3512 |
|
3523 |
|
3513 |
indexedmimetypes
|
3524 |
indexedmimetypes
|
3514 |
|
3525 |
|
3515 |
Recoll normally indexes any file which it knows how to read. This
|
3526 |
Recoll normally indexes any file which it knows how to read. This
|
3516 |
list lets you restrict the indexed mime types to what you specify.
|
3527 |
list lets you restrict the indexed MIME types to what you specify.
|
3517 |
If the variable is unspecified or the list empty (the default),
|
3528 |
If the variable is unspecified or the list empty (the default),
|
3518 |
all supported types are processed. Can be redefined for
|
3529 |
all supported types are processed. Can be redefined for
|
3519 |
subdirectories.
|
3530 |
subdirectories.
|
|
|
3531 |
|
|
|
3532 |
excludedmimetypes
|
|
|
3533 |
|
|
|
3534 |
This list lets you exclude some MIME types from indexing. Can be
|
|
|
3535 |
redefined for subdirectories.
|
3520 |
|
3536 |
|
3521 |
compressedfilemaxkbs
|
3537 |
compressedfilemaxkbs
|
3522 |
|
3538 |
|
3523 |
Size limit for compressed (.gz or .bz2) files. These need to be
|
3539 |
Size limit for compressed (.gz or .bz2) files. These need to be
|
3524 |
decompressed in a temporary directory for identification, which
|
3540 |
decompressed in a temporary directory for identification, which
|
|
... |
|
... |
3548 |
indexallfilenames
|
3564 |
indexallfilenames
|
3549 |
|
3565 |
|
3550 |
Recoll indexes file names in a special section of the database to
|
3566 |
Recoll indexes file names in a special section of the database to
|
3551 |
allow specific file names searches using wild cards. This
|
3567 |
allow specific file names searches using wild cards. This
|
3552 |
parameter decides if file name indexing is performed only for
|
3568 |
parameter decides if file name indexing is performed only for
|
3553 |
files with mime types that would qualify them for full text
|
3569 |
files with MIME types that would qualify them for full text
|
3554 |
indexing, or for all files inside the selected subtrees,
|
3570 |
indexing, or for all files inside the selected subtrees,
|
3555 |
independently of mime type.
|
3571 |
independently of MIME type.
|
3556 |
|
3572 |
|
3557 |
usesystemfilecommand
|
3573 |
usesystemfilecommand
|
3558 |
|
3574 |
|
3559 |
Decide if we use the file -i system command as a final step for
|
3575 |
Decide if we use the file -i system command as a final step for
|
3560 |
determining the mime type for a file (the main procedure uses
|
3576 |
determining the MIME type for a file (the main procedure uses
|
3561 |
suffix associations as defined in the mimemap file). This can be
|
3577 |
suffix associations as defined in the mimemap file). This can be
|
3562 |
useful for files with suffix-less names, but it will also cause
|
3578 |
useful for files with suffix-less names, but it will also cause
|
3563 |
the indexing of many bogus "text" files.
|
3579 |
the indexing of many bogus "text" files.
|
3564 |
|
3580 |
|
3565 |
processwebqueue
|
3581 |
processwebqueue
|
|
... |
|
... |
3768 |
|
3784 |
|
3769 |
webcachemaxmbs
|
3785 |
webcachemaxmbs
|
3770 |
|
3786 |
|
3771 |
This is only used by the web browser plugin indexing code, and
|
3787 |
This is only used by the web browser plugin indexing code, and
|
3772 |
defines the maximum size for the web page cache. Default: 40 MB.
|
3788 |
defines the maximum size for the web page cache. Default: 40 MB.
|
|
|
3789 |
Quite unfortunately, this is only taken into account when creating
|
|
|
3790 |
the cache file. You need to delete the file for a change to be
|
|
|
3791 |
taken into account.
|
3773 |
|
3792 |
|
3774 |
idxflushmb
|
3793 |
idxflushmb
|
3775 |
|
3794 |
|
3776 |
Threshold (megabytes of new text data) where we flush from memory
|
3795 |
Threshold (megabytes of new text data) where we flush from memory
|
3777 |
to disk index. Setting this can help control memory usage. A value
|
3796 |
to disk index. Setting this can help control memory usage. A value
|
|
... |
|
... |
3907 |
These allow defining the ionice class and data used by the indexer
|
3926 |
These allow defining the ionice class and data used by the indexer
|
3908 |
(default class 3, no data).
|
3927 |
(default class 3, no data).
|
3909 |
|
3928 |
|
3910 |
filtermaxseconds
|
3929 |
filtermaxseconds
|
3911 |
|
3930 |
|
3912 |
Maximum filter execution time, after which it is aborted. Some
|
3931 |
Maximum handler execution time, after which it is aborted. Some
|
3913 |
postscript programs just loop...
|
3932 |
postscript programs just loop...
|
3914 |
|
3933 |
|
3915 |
filtersdir
|
3934 |
filtersdir
|
3916 |
|
3935 |
|
3917 |
A directory to search for the external filter scripts used to
|
3936 |
A directory to search for the external input handler scripts used
|
3918 |
index some types of files. The value should not be changed, except
|
3937 |
to index some types of files. The value should not be changed,
|
3919 |
if you want to modify one of the default scripts. The value can be
|
3938 |
except if you want to modify one of the default scripts. The value
|
3920 |
redefined for any sub-directory.
|
3939 |
can be redefined for any sub-directory.
|
3921 |
|
3940 |
|
3922 |
iconsdir
|
3941 |
iconsdir
|
3923 |
|
3942 |
|
3924 |
The name of the directory where recoll result list icons are
|
3943 |
The name of the directory where recoll result list icons are
|
3925 |
stored. You can change this if you want different images.
|
3944 |
stored. You can change this if you want different images.
|
|
... |
|
... |
3996 |
[aliases]
|
4015 |
[aliases]
|
3997 |
|
4016 |
|
3998 |
This section defines lists of synonyms for the canonical names
|
4017 |
This section defines lists of synonyms for the canonical names
|
3999 |
used inside the [prefixes] and [stored] sections
|
4018 |
used inside the [prefixes] and [stored] sections
|
4000 |
|
4019 |
|
4001 |
filter-specific sections
|
4020 |
handler-specific sections
|
4002 |
|
4021 |
|
4003 |
Some filters may need specific configuration for handling fields.
|
4022 |
Some input handlers may need specific configuration for handling
|
4004 |
Only the email message filter currently has such a section (named
|
4023 |
fields. Only the email message handler currently has such a
|
4005 |
[mail]). It allows indexing arbitrary email headers in addition to
|
4024 |
section (named [mail]). It allows indexing arbitrary email headers
|
4006 |
the ones indexed by default. Other such sections may appear in the
|
4025 |
in addition to the ones indexed by default. Other such sections
|
4007 |
future.
|
4026 |
may appear in the future.
|
4008 |
|
4027 |
|
4009 |
Here follows a small example of a personal fields file. This would extract
|
4028 |
Here follows a small example of a personal fields file. This would extract
|
4010 |
a specific email header and use it as a searchable field, with data
|
4029 |
a specific email header and use it as a searchable field, with data
|
4011 |
displayable inside result lists. (Side note: as the email filter does no
|
4030 |
displayable inside result lists. (Side note: as the email handler does no
|
4012 |
decoding on the values, only plain ascii headers can be indexed, and only
|
4031 |
decoding on the values, only plain ascii headers can be indexed, and only
|
4013 |
the first occurrence will be used for headers that occur several times).
|
4032 |
the first occurrence will be used for headers that occur several times).
|
4014 |
|
4033 |
|
4015 |
[prefixes]
|
4034 |
[prefixes]
|
4016 |
# Index mailmytag contents (with the given prefix)
|
4035 |
# Index mailmytag contents (with the given prefix)
|
|
... |
|
... |
4038 |
translations from extended attributes names to Recoll field names. An
|
4057 |
translations from extended attributes names to Recoll field names. An
|
4039 |
empty translation disables use of the corresponding attribute data.
|
4058 |
empty translation disables use of the corresponding attribute data.
|
4040 |
|
4059 |
|
4041 |
5.4.3. The mimemap file
|
4060 |
5.4.3. The mimemap file
|
4042 |
|
4061 |
|
4043 |
mimemap specifies the file name extension to mime type mappings.
|
4062 |
mimemap specifies the file name extension to MIME type mappings.
|
4044 |
|
4063 |
|
4045 |
For file names without an extension, or with an unknown one, the system's
|
4064 |
For file names without an extension, or with an unknown one, the system's
|
4046 |
file -i command will be executed to determine the mime type (this can be
|
4065 |
file -i command will be executed to determine the MIME type (this can be
|
4047 |
switched off inside the main configuration file).
|
4066 |
switched off inside the main configuration file).
|
4048 |
|
4067 |
|
4049 |
The mappings can be specified on a per-subtree basis, which may be useful
|
4068 |
The mappings can be specified on a per-subtree basis, which may be useful
|
4050 |
in some cases. Example: gaim logs have a .txt extension but should be
|
4069 |
in some cases. Example: gaim logs have a .txt extension but should be
|
4051 |
handled specially, which is possible because they are usually all located
|
4070 |
handled specially, which is possible because they are usually all located
|
|
... |
|
... |
4062 |
given Recoll version. Having it there avoids cluttering the more
|
4081 |
given Recoll version. Having it there avoids cluttering the more
|
4063 |
user-oriented and locally customized skippedNames.
|
4082 |
user-oriented and locally customized skippedNames.
|
4064 |
|
4083 |
|
4065 |
5.4.4. The mimeconf file
|
4084 |
5.4.4. The mimeconf file
|
4066 |
|
4085 |
|
4067 |
mimeconf specifies how the different mime types are handled for indexing,
|
4086 |
mimeconf specifies how the different MIME types are handled for indexing,
|
4068 |
and which icons are displayed in the recoll result lists.
|
4087 |
and which icons are displayed in the recoll result lists.
|
4069 |
|
4088 |
|
4070 |
Changing the parameters in the [index] section is probably not a good idea
|
4089 |
Changing the parameters in the [index] section is probably not a good idea
|
4071 |
except if you are a Recoll developer.
|
4090 |
except if you are a Recoll developer.
|
4072 |
|
4091 |
|
|
... |
|
... |
4086 |
|
4105 |
|
4087 |
If Use desktop preferences to choose document editor is checked in the
|
4106 |
If Use desktop preferences to choose document editor is checked in the
|
4088 |
Recoll GUI preferences, all mimeview entries will be ignored except the
|
4107 |
Recoll GUI preferences, all mimeview entries will be ignored except the
|
4089 |
one labelled application/x-all (which is set to use xdg-open by default).
|
4108 |
one labelled application/x-all (which is set to use xdg-open by default).
|
4090 |
|
4109 |
|
4091 |
In this case, the xallexcepts top level variable defines a list of mime
|
4110 |
In this case, the xallexcepts top level variable defines a list of MIME
|
4092 |
type exceptions which will be processed according to the local entries
|
4111 |
type exceptions which will be processed according to the local entries
|
4093 |
instead of being passed to the desktop. This is so that specific Recoll
|
4112 |
instead of being passed to the desktop. This is so that specific Recoll
|
4094 |
options such as a page number or a search string can be passed to
|
4113 |
options such as a page number or a search string can be passed to
|
4095 |
applications that support them, such as the evince viewer.
|
4114 |
applications that support them, such as the evince viewer.
|
4096 |
|
4115 |
|
|
... |
|
... |
4099 |
non-default entries, which will override those from the central
|
4118 |
non-default entries, which will override those from the central
|
4100 |
configuration file.
|
4119 |
configuration file.
|
4101 |
|
4120 |
|
4102 |
All viewer definition entries must be placed under a [view] section.
|
4121 |
All viewer definition entries must be placed under a [view] section.
|
4103 |
|
4122 |
|
4104 |
The keys in the file are normally mime types. You can add an application
|
4123 |
The keys in the file are normally MIME types. You can add an application
|
4105 |
tag to specialize the choice for an area of the filesystem (using a
|
4124 |
tag to specialize the choice for an area of the filesystem (using a
|
4106 |
localfields specification in mimeconf). The syntax for the key is
|
4125 |
localfields specification in mimeconf). The syntax for the key is
|
4107 |
mimetype|tag
|
4126 |
mimetype|tag
|
4108 |
|
4127 |
|
4109 |
The nouncompforviewmts entry, (placed at the top level, outside of the
|
4128 |
The nouncompforviewmts entry, (placed at the top level, outside of the
|
4110 |
[view] section), holds a list of mime types that should not be
|
4129 |
[view] section), holds a list of MIME types that should not be
|
4111 |
uncompressed before starting the viewer (if they are found compressed, ie:
|
4130 |
uncompressed before starting the viewer (if they are found compressed, ie:
|
4112 |
mydoc.doc.gz).
|
4131 |
mydoc.doc.gz).
|
4113 |
|
4132 |
|
4114 |
The right side of each assignment holds a command to be executed for
|
4133 |
The right side of each assignment holds a command to be executed for
|
4115 |
opening the file. The following substitutions are performed:
|
4134 |
opening the file. The following substitutions are performed:
|
|
... |
|
... |
4125 |
o %i. Internal path, for subdocuments of containers. The format depends
|
4144 |
o %i. Internal path, for subdocuments of containers. The format depends
|
4126 |
on the container type. If this appears in the command line, Recoll
|
4145 |
on the container type. If this appears in the command line, Recoll
|
4127 |
will not create a temporary file to extract the subdocument, expecting
|
4146 |
will not create a temporary file to extract the subdocument, expecting
|
4128 |
the called application (possibly a script) to be able to handle it.
|
4147 |
the called application (possibly a script) to be able to handle it.
|
4129 |
|
4148 |
|
4130 |
o %M. Mime type
|
4149 |
o %M. MIME type
|
4131 |
|
4150 |
|
4132 |
o %p. Page index. Only significant for a subset of document types,
|
4151 |
o %p. Page index. Only significant for a subset of document types,
|
4133 |
currently only PDF, Postscript and DVI files. Can be used to start the
|
4152 |
currently only PDF, Postscript and DVI files. Can be used to start the
|
4134 |
editor at the right page for a match or snippet.
|
4153 |
editor at the right page for a match or snippet.
|
4135 |
|
4154 |
|
|
... |
|
... |
4178 |
o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
|
4197 |
o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
|
4179 |
following line:
|
4198 |
following line:
|
4180 |
|
4199 |
|
4181 |
.blob = application/x-blobapp
|
4200 |
.blob = application/x-blobapp
|
4182 |
|
4201 |
|
4183 |
Note that the mime type is made up here, and you could call it
|
4202 |
Note that the MIME type is made up here, and you could call it
|
4184 |
diesel/oil just the same.
|
4203 |
diesel/oil just the same.
|
4185 |
|
4204 |
|
4186 |
o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
|
4205 |
o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
|
4187 |
|
4206 |
|
4188 |
application/x-blobapp = blobviewer %f
|
4207 |
application/x-blobapp = blobviewer %f
|
4189 |
|
4208 |
|
4190 |
We are supposing that blobviewer wants a file name parameter here, you
|
4209 |
We are supposing that blobviewer wants a file name parameter here, you
|
4191 |
would use %u if it liked URLs better.
|
4210 |
would use %u if it liked URLs better.
|
4192 |
|
4211 |
|
4193 |
If you just wanted to change the application used by Recoll to display a
|
4212 |
If you just wanted to change the application used by Recoll to display a
|
4194 |
mime type which it already knows, you would just need to edit mimeview.
|
4213 |
MIME type which it already knows, you would just need to edit mimeview.
|
4195 |
The entries you add in your personal file override those in the central
|
4214 |
The entries you add in your personal file override those in the central
|
4196 |
configuration, which you do not need to alter. mimeview can also be
|
4215 |
configuration, which you do not need to alter. mimeview can also be
|
4197 |
modified from the Gui.
|
4216 |
modified from the Gui.
|
4198 |
|
4217 |
|
4199 |
5.4.7.2. Adding indexing support for a new file type
|
4218 |
5.4.7.2. Adding indexing support for a new file type
|
|
... |
|
... |
4211 |
|
4230 |
|
4212 |
o Under the [icons] section, you should choose an icon to be displayed
|
4231 |
o Under the [icons] section, you should choose an icon to be displayed
|
4213 |
for the files inside the result lists. Icons are normally 64x64 pixels
|
4232 |
for the files inside the result lists. Icons are normally 64x64 pixels
|
4214 |
PNG files which live in /usr/[local/]share/recoll/images.
|
4233 |
PNG files which live in /usr/[local/]share/recoll/images.
|
4215 |
|
4234 |
|
4216 |
o Under the [categories] section, you should add the mime type where it
|
4235 |
o Under the [categories] section, you should add the MIME type where it
|
4217 |
makes sense (you can also create a category). Categories may be used
|
4236 |
makes sense (you can also create a category). Categories may be used
|
4218 |
for filtering in advanced search.
|
4237 |
for filtering in advanced search.
|
4219 |
|
4238 |
|
4220 |
The rclblob filter should be an executable program or script which exists
|
4239 |
The rclblob handler should be an executable program or script which exists
|
4221 |
inside /usr/[local/]share/recoll/filters. It will be given a file name as
|
4240 |
inside /usr/[local/]share/recoll/filters. It will be given a file name as
|
4222 |
argument and should output the text or html contents on the standard
|
4241 |
argument and should output the text or html contents on the standard
|
4223 |
output.
|
4242 |
output.
|
4224 |
|
4243 |
|
4225 |
The filter programming section describes in more detail how to write a
|
4244 |
The filter programming section describes in more detail how to write an
|
4226 |
filter.
|
4245 |
input handler.
|