|
a/src/README |
|
b/src/README |
|
... |
|
... |
76 |
|
76 |
|
77 |
3.11. Search tips, shortcuts
|
77 |
3.11. Search tips, shortcuts
|
78 |
|
78 |
|
79 |
3.12. Customizing the search interface
|
79 |
3.12. Customizing the search interface
|
80 |
|
80 |
|
|
|
81 |
4. Programming interface
|
|
|
82 |
|
|
|
83 |
4.1. Writing a document filter
|
|
|
84 |
|
|
|
85 |
4.1.1. Filter HTML output
|
|
|
86 |
|
|
|
87 |
4.2. Field data processing configuration
|
|
|
88 |
|
|
|
89 |
4.3. API
|
|
|
90 |
|
|
|
91 |
4.3.1. Interface elements
|
|
|
92 |
|
|
|
93 |
4.3.2. Python interface
|
|
|
94 |
|
81 |
4. Installation
|
95 |
5. Installation
|
82 |
|
96 |
|
83 |
4.1. Installing a prebuilt copy
|
97 |
5.1. Installing a prebuilt copy
|
84 |
|
98 |
|
85 |
4.1.1. Installing through a package system
|
99 |
5.1.1. Installing through a package system
|
86 |
|
100 |
|
87 |
4.1.2. Installing a prebuilt Recoll
|
101 |
5.1.2. Installing a prebuilt Recoll
|
88 |
|
102 |
|
89 |
4.2. Supporting packages
|
103 |
5.2. Supporting packages
|
90 |
|
104 |
|
91 |
4.3. Building from source
|
105 |
5.3. Building from source
|
92 |
|
106 |
|
93 |
4.3.1. Prerequisites
|
107 |
5.3.1. Prerequisites
|
94 |
|
108 |
|
95 |
4.3.2. Building
|
109 |
5.3.2. Building
|
96 |
|
110 |
|
97 |
4.3.3. Installation
|
111 |
5.3.3. Installation
|
98 |
|
112 |
|
99 |
4.4. Configuration overview
|
113 |
5.4. Configuration overview
|
100 |
|
114 |
|
101 |
4.4.1. Main configuration file
|
115 |
5.4.1. Main configuration file
|
102 |
|
116 |
|
103 |
4.4.2. The mimemap file
|
117 |
5.4.2. The mimemap file
|
104 |
|
118 |
|
105 |
4.4.3. The mimeconf file
|
119 |
5.4.3. The mimeconf file
|
106 |
|
120 |
|
107 |
4.4.4. The mimeview file
|
121 |
5.4.4. The mimeview file
|
108 |
|
122 |
|
109 |
4.4.5. Examples of configuration adjustments
|
123 |
5.4.5. Examples of configuration adjustments
|
110 |
|
124 |
|
111 |
4.5. The KDE Kicker Recoll applet
|
125 |
5.5. The KDE Kicker Recoll applet
|
112 |
|
|
|
113 |
4.6. Extending Recoll
|
|
|
114 |
|
|
|
115 |
4.6.1. Writing a document filter
|
|
|
116 |
|
126 |
|
117 |
----------------------------------------------------------------------
|
127 |
----------------------------------------------------------------------
|
118 |
|
128 |
|
119 |
Chapter 1. Introduction
|
129 |
Chapter 1. Introduction
|
120 |
|
130 |
|
|
... |
|
... |
254 |
files Most file types, like HTML or word processing files, only hold one
|
264 |
files Most file types, like HTML or word processing files, only hold one
|
255 |
document. Some file types, like mail folder files can hold many
|
265 |
document. Some file types, like mail folder files can hold many
|
256 |
individually indexed documents.
|
266 |
individually indexed documents.
|
257 |
|
267 |
|
258 |
Recoll indexing processes plain text, HTML, openoffice and e-mail files
|
268 |
Recoll indexing processes plain text, HTML, openoffice and e-mail files
|
|
|
269 |
internally.
|
|
|
270 |
|
259 |
internally. Other types (ie: postscript, pdf, ms-word, rtf) need external
|
271 |
Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
|
260 |
applications for preprocessing. The list is in the installation section.
|
272 |
applications for preprocessing. The list is in the installation section.
|
|
|
273 |
After every indexing operation, Recoll updates a list of commands that
|
|
|
274 |
would be needed for indexing existing files types. This list can be
|
|
|
275 |
displayed from the recoll File menu. It is stored in the missing text file
|
|
|
276 |
inside the configuration directory.
|
261 |
|
277 |
|
262 |
Without further configuration, Recoll will index all appropriate files
|
278 |
Without further configuration, Recoll will index all appropriate files
|
263 |
from your home directory, with a reasonable set of defaults.
|
279 |
from your home directory, with a reasonable set of defaults.
|
264 |
|
280 |
|
265 |
In some cases, it may be interesting to index different areas of the file
|
281 |
In some cases, it may be interesting to index different areas of the file
|
|
... |
|
... |
715 |
3.4. The query language
|
731 |
3.4. The query language
|
716 |
|
732 |
|
717 |
The query language processor is activated on the simple search entry when
|
733 |
The query language processor is activated on the simple search entry when
|
718 |
the search mode selector is set to Query Language.
|
734 |
the search mode selector is set to Query Language.
|
719 |
|
735 |
|
|
|
736 |
The language is roughly based on the Xesam user search language
|
|
|
737 |
specification.
|
|
|
738 |
|
720 |
Here follows a sample request that we are going to explain:
|
739 |
Here follows a sample request that we are going to explain:
|
721 |
|
740 |
|
722 |
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
741 |
author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
|
723 |
|
742 |
|
724 |
|
743 |
|
725 |
This would search for all documents with John Doe appearing as a phrase in
|
744 |
This would search for all documents with John Doe appearing as a phrase in
|
726 |
the author field (exactly what this is would depend on the document type,
|
745 |
the author field (exactly what this is would depend on the document type,
|
727 |
ie: the From: header, for an email message), and containing either beatles
|
746 |
ie: the From: header, for an email message), and containing either beatles
|
728 |
or lennon and either live or unplugged but not potatoes (in any part of
|
747 |
or lennon and either live or unplugged but not potatoes (in any part of
|
729 |
the document).
|
748 |
the document).
|
|
|
749 |
|
|
|
750 |
An element is composed of an optional field specification, and a value,
|
|
|
751 |
separated by a colon. Exemple: Beatles, author:balzac, dc:title:grandet
|
|
|
752 |
|
|
|
753 |
The colon, if present, means "contains". Xesam defines other relations,
|
|
|
754 |
which are not supported for now.
|
730 |
|
755 |
|
731 |
All elements in the search entry are normally combined with an implicit
|
756 |
All elements in the search entry are normally combined with an implicit
|
732 |
AND. It is possible to specify that elements be OR'ed instead, as in
|
757 |
AND. It is possible to specify that elements be OR'ed instead, as in
|
733 |
Beatles OR Lennon. The OR must be entered literally (capitals), and it has
|
758 |
Beatles OR Lennon. The OR must be entered literally (capitals), and it has
|
734 |
priority over the AND associations: word1 word2 OR word3 means word1 AND
|
759 |
priority over the AND associations: word1 word2 OR word3 means word1 AND
|
735 |
(word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
|
760 |
(word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
|
736 |
parenthesis, they are not supported for now.
|
761 |
parenthesis, they are not supported for now.
|
737 |
|
762 |
|
738 |
An entry preceded by a - specifies a term that should not appear.
|
763 |
An element preceded by a - specifies a term that should not appear. Pure
|
|
|
764 |
negative queries are forbidden.
|
739 |
|
765 |
|
740 |
The first element in the above exemple, author:"john doe" is a phrase
|
766 |
As usual, words inside quotes define a phrase (the order of words is
|
741 |
search limited to a specific field. Phrase searches are specified as usual
|
767 |
significant), so that title:"prejudice pride" is not the same as
|
742 |
by enclosing the words in double quotes. The field specification appears
|
768 |
title:prejudice title:pride, and is unlikely to find a result.
|
743 |
before the colon (of course this is not limited to phrases, author:Balzac
|
769 |
|
744 |
would be ok too). Recoll currently manages the following fields:
|
770 |
Recoll currently manages the following default fields:
|
745 |
|
771 |
|
746 |
* title, subject or caption are synonyms which specify data to be
|
772 |
* title, subject or caption are synonyms which specify data to be
|
747 |
searched for in the document title or subject.
|
773 |
searched for in the document title or subject.
|
748 |
|
774 |
|
749 |
* author or from for searching the documents originators.
|
775 |
* author or from for searching the documents originators.
|
750 |
|
776 |
|
|
|
777 |
* recipient or to for searching the documents recipients.
|
|
|
778 |
|
751 |
* keyword for searching the document specified keywords (few documents
|
779 |
* keyword for searching the document-specified keywords (few documents
|
752 |
actually have any).
|
780 |
actually have any).
|
753 |
|
781 |
|
754 |
As of release 1.9, the filters have the possibility to create other fields
|
782 |
* filename for the document's file name.
|
755 |
with arbitrary names. No standard filters use this possibility yet.
|
|
|
756 |
|
783 |
|
757 |
There are two other elements which may be specified through the field
|
|
|
758 |
syntax, but are somewhat special:
|
|
|
759 |
|
|
|
760 |
* ext for specifying the file name extension (Ex: ext:html)
|
784 |
* ext specifies the file name extension (Ex: ext:html)
|
761 |
|
785 |
|
762 |
* dir for specifying the file location (Ex: dir:/home/me/somedir).
|
786 |
The field syntax also supports a few field-like, but special, criteria:
|
763 |
Please note that this is quite inefficient, that it may produce very
|
787 |
|
764 |
slow searches, and that it may be worth in some cases to set up
|
788 |
* dir for filtering the results on file location (Ex:
|
|
|
789 |
dir:/home/me/somedir). Please note that this is quite inefficient,
|
|
|
790 |
that it may produce very slow searches, and that it may be worth in
|
765 |
separate databases instead.
|
791 |
some cases to set up separate databases instead.
|
766 |
|
792 |
|
767 |
* mime for specifying the mime type. This one is quite special because
|
793 |
* mime or format for specifying the mime type. This one is quite special
|
768 |
you can specify several values which will be OR'ed (the normal default
|
794 |
because you can specify several values which will be OR'ed (the normal
|
769 |
for the language is AND). Ex: mime:text/plain mime:text/html.
|
795 |
default for the language is AND). Ex: mime:text/plain mime:text/html.
|
770 |
Specifying an explicit boolean operator or negation (-) before a mime
|
796 |
Specifying an explicit boolean operator or negation (-) before a mime
|
771 |
specification is not supported and will produce strange results.
|
797 |
specification is not supported and will produce strange results.
|
772 |
|
798 |
|
|
|
799 |
* type or rclcat for specifying the category (as in
|
|
|
800 |
text/media/presentation/etc.). The classification of mime types in
|
|
|
801 |
categories is defined in the Recoll configuration (mimeconf), and can
|
|
|
802 |
be modified or extended. The default category names are those which
|
|
|
803 |
permit filtering results in the main GUI screen. Categories are OR'ed
|
|
|
804 |
like mime types above.
|
|
|
805 |
|
|
|
806 |
The document filters used while indexing have the possibility to create
|
|
|
807 |
other fields with arbitrary names, and aliases may be defined in the
|
|
|
808 |
configuration, so that the exact field search possibilities may be
|
|
|
809 |
different for you if someone took care of the customisation.
|
|
|
810 |
|
773 |
The query language is currently the only way to use the Recoll field
|
811 |
The query language is currently the only way to use the Recoll field
|
774 |
search capability.
|
812 |
search capability.
|
775 |
|
813 |
|
776 |
Words inside phrases and capitalized words are not stem-expanded.
|
814 |
Words inside phrases and capitalized words are not stem-expanded.
|
777 |
Wildcards may be used anywhere inside a term. Specifying a wild-card on
|
815 |
Wildcards may be used anywhere inside a term. Specifying a wild-card on
|
778 |
the left of a term can produce a very slow search.
|
816 |
the left of a term can produce a very slow search (or even an incorrect
|
|
|
817 |
one if the expansion is truncated because of excessive size).
|
779 |
|
818 |
|
780 |
You can use the show query link at the top of the result list to check the
|
819 |
You can use the show query link at the top of the result list to check the
|
781 |
exact query which was finally executed by Xapian.
|
820 |
exact query which was finally executed by Xapian.
|
|
|
821 |
|
|
|
822 |
Most Xesam phrase modifiers are unsupported, except for l (small ell) to
|
|
|
823 |
disable stemming, and p to turn an phrase into a NEAR (unordered) search.
|
|
|
824 |
Exemple: "prejudice pride"p
|
782 |
|
825 |
|
783 |
----------------------------------------------------------------------
|
826 |
----------------------------------------------------------------------
|
784 |
|
827 |
|
785 |
3.5. Complex/advanced search
|
828 |
3.5. Complex/advanced search
|
786 |
|
829 |
|
|
... |
|
... |
1192 |
you can chose which ones you want to use at any moment by checking or
|
1235 |
you can chose which ones you want to use at any moment by checking or
|
1193 |
unchecking their entries.
|
1236 |
unchecking their entries.
|
1194 |
|
1237 |
|
1195 |
Your main database (the one the current configuration indexes to), is
|
1238 |
Your main database (the one the current configuration indexes to), is
|
1196 |
always implicitly active. If this is not desirable, you can set up your
|
1239 |
always implicitly active. If this is not desirable, you can set up your
|
1197 |
configuration so that it indexes, for example, an empty directory.
|
1240 |
configuration so that it indexes, for example, an empty directory. An
|
|
|
1241 |
alternative indexer may also need to implement a way of purging the index
|
|
|
1242 |
from stale data,
|
1198 |
|
1243 |
|
|
|
1244 |
----------------------------------------------------------------------
|
|
|
1245 |
|
|
|
1246 |
Chapter 4. Programming interface
|
|
|
1247 |
|
|
|
1248 |
Recoll has an Application programming Interface, usable both for indexing
|
|
|
1249 |
and searching, currently accessible from the Python language.
|
|
|
1250 |
|
|
|
1251 |
Another less radical way to extend the application is to write filters for
|
|
|
1252 |
new types of documents.
|
|
|
1253 |
|
|
|
1254 |
The processing of metadata attributes for documents (fields) is highly
|
|
|
1255 |
configurable.
|
|
|
1256 |
|
|
|
1257 |
----------------------------------------------------------------------
|
|
|
1258 |
|
|
|
1259 |
4.1. Writing a document filter
|
|
|
1260 |
|
|
|
1261 |
Recoll filters are executable programs which translate from a specific
|
|
|
1262 |
format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
|
|
|
1263 |
format, which may be text/plain or text/html.
|
|
|
1264 |
|
|
|
1265 |
Recoll filters are usually shell-scripts, but this is in no way necessary.
|
|
|
1266 |
These programs are extremely simple and most of the difficulty lies in
|
|
|
1267 |
extracting the text from the native format, not outputting what is
|
|
|
1268 |
expected by Recoll. Happily enough, most document formats already have
|
|
|
1269 |
translators or text extractors which handle the difficult part and can be
|
|
|
1270 |
called from the filter. In some case the output of the translating program
|
|
|
1271 |
is appropriate, and no intermediate shell-script is needed.
|
|
|
1272 |
|
|
|
1273 |
Filters are called with a single argument which is the source file name.
|
|
|
1274 |
They should output the result to stdout.
|
|
|
1275 |
|
|
|
1276 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
|
|
1277 |
the filter if the operation is for indexing or previewing. Some filters
|
|
|
1278 |
use this to output a slightly different format. This is not essential.
|
|
|
1279 |
|
|
|
1280 |
The association of file types to filters is performed in the mimeconf
|
|
|
1281 |
file. A sample:
|
|
|
1282 |
|
|
|
1283 |
[index]
|
|
|
1284 |
application/msword = exec antiword -t -i 1 -m UTF-8;\
|
|
|
1285 |
mimetype=text/plain;charset=utf-8
|
|
|
1286 |
|
|
|
1287 |
application/ogg = exec rclogg
|
|
|
1288 |
|
|
|
1289 |
text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
|
|
|
1290 |
|
|
|
1291 |
The fragment specifies that:
|
|
|
1292 |
|
|
|
1293 |
* application/msword files are processed by executing the antiword
|
|
|
1294 |
program, which outputs text/plain encoded in iso-8859-1.
|
|
|
1295 |
|
|
|
1296 |
* application/ogg files are processed by the rclogg script, with default
|
|
|
1297 |
output type (text/html, with encoding specified in the header, or
|
|
|
1298 |
utf-8 by default).
|
|
|
1299 |
|
|
|
1300 |
* text/rtf is processed by unrtf, which outputs text/html. The
|
|
|
1301 |
iso-8859-1 encoding is specified because it is not the utf-8 default,
|
|
|
1302 |
and not output by unrtf in the HTML header section.
|
|
|
1303 |
|
|
|
1304 |
The easiest way to write a new filter is probably to start from an
|
|
|
1305 |
existing one.
|
|
|
1306 |
|
|
|
1307 |
Filters which output text/plain text are generally simpler, but they
|
|
|
1308 |
cannot specify the character set and other metadata, so they are limited
|
|
|
1309 |
to cases where these elements are not needed.
|
|
|
1310 |
|
|
|
1311 |
----------------------------------------------------------------------
|
|
|
1312 |
|
|
|
1313 |
4.1.1. Filter HTML output
|
|
|
1314 |
|
|
|
1315 |
The output HTML could be very minimal like the following example:
|
|
|
1316 |
|
|
|
1317 |
<html><head>
|
|
|
1318 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
|
1319 |
</head>
|
|
|
1320 |
<body>some text content</body></html>
|
|
|
1321 |
|
|
|
1322 |
|
|
|
1323 |
You should take care to escape some characters inside the text by
|
|
|
1324 |
transforming them into appropriate entities. "&" should be transformed
|
|
|
1325 |
into "&", "<" should be transformed into "<". This is not always
|
|
|
1326 |
properly done by translating programs which output HTML, and of course
|
|
|
1327 |
nerver by those which output plain text.
|
|
|
1328 |
|
|
|
1329 |
The character set needs to be specified in the header. It does not need to
|
|
|
1330 |
be UTF-8 (Recoll will take care of translating it), but it must be
|
|
|
1331 |
accurate for good results.
|
|
|
1332 |
|
|
|
1333 |
Recoll will also make use of other header fields if they are present:
|
|
|
1334 |
title, description, keywords.
|
|
|
1335 |
|
|
|
1336 |
Filters also have the possibility to "invent" field names. This should be
|
|
|
1337 |
output as meta tags:
|
|
|
1338 |
|
|
|
1339 |
<meta name="somefield" content="Some textual data" />
|
|
|
1340 |
|
|
|
1341 |
See the following section for details about configuring how field data is
|
|
|
1342 |
processed by the indexer.
|
|
|
1343 |
|
|
|
1344 |
----------------------------------------------------------------------
|
|
|
1345 |
|
|
|
1346 |
4.2. Field data processing configuration
|
|
|
1347 |
|
|
|
1348 |
Fields are named pieces of information in or about documents, like title,
|
|
|
1349 |
author, abstract.
|
|
|
1350 |
|
|
|
1351 |
The field values for documents can appear in several ways during indexing:
|
|
|
1352 |
either output by filters as meta fields in the HTML header section, or
|
|
|
1353 |
added as attributes of the Doc object when using the API, or again
|
|
|
1354 |
synthetized internally by Recoll.
|
|
|
1355 |
|
|
|
1356 |
The Recoll query language allows searching for text in a specific field.
|
|
|
1357 |
|
|
|
1358 |
Recoll defines a number of default fields. Additional ones can be output
|
|
|
1359 |
by filters, and described in the fields configuration file.
|
|
|
1360 |
|
|
|
1361 |
Fields can be:
|
|
|
1362 |
|
|
|
1363 |
* indexed, meaning that their terms are separately stored in inverted
|
|
|
1364 |
lists (with a specific prefix), and that a field-specific search is
|
|
|
1365 |
possible.
|
|
|
1366 |
|
|
|
1367 |
* stored, meaning that their value is recorded in the index data record
|
|
|
1368 |
for the document, and can be returned and displayed with search
|
|
|
1369 |
results.
|
|
|
1370 |
|
|
|
1371 |
A field can be either or both indexed and stored.
|
|
|
1372 |
|
|
|
1373 |
A field becomes indexed by having a prefix defined in the [prefixes]
|
|
|
1374 |
section of the fields file. See the comments in there for details
|
|
|
1375 |
|
|
|
1376 |
A field becomes stored by appearing in the [stored] section of the fields
|
|
|
1377 |
file.
|
|
|
1378 |
|
|
|
1379 |
----------------------------------------------------------------------
|
|
|
1380 |
|
|
|
1381 |
4.3. API
|
|
|
1382 |
|
|
|
1383 |
4.3.1. Interface elements
|
|
|
1384 |
|
|
|
1385 |
A few elements in the interface are specific and and need an explanation.
|
|
|
1386 |
|
|
|
1387 |
udi
|
|
|
1388 |
|
|
|
1389 |
An udi (unique document identifier) identifies a document. Because
|
|
|
1390 |
of limitations inside the index engine, it is restricted in length
|
|
|
1391 |
(to 200 bytes), which is why a regular URI cannot be used. The
|
|
|
1392 |
structure and contents of the udi is defined by the application
|
|
|
1393 |
and opaque to the index engine. For example, the internal file
|
|
|
1394 |
system indexer uses the complete document path (file path +
|
|
|
1395 |
internal path), truncated to length, the suppressed part being
|
|
|
1396 |
replaced by a hash value.
|
|
|
1397 |
|
|
|
1398 |
ipath
|
|
|
1399 |
|
|
|
1400 |
This data value (set as a field in the Doc object) is stored,
|
|
|
1401 |
along with the URL, but not indexed by Recoll. Its contents are
|
|
|
1402 |
not interpreted, and its use is up to the application. For
|
|
|
1403 |
example, the Recoll internal file system indexer stores the part
|
|
|
1404 |
of the document access path internal to the container file (ipath
|
|
|
1405 |
in this case is a list of subdocument sequential numbers). url and
|
|
|
1406 |
ipath are returned in every search result and permit access to the
|
|
|
1407 |
original document.
|
|
|
1408 |
|
|
|
1409 |
Stored and indexed fields
|
|
|
1410 |
|
|
|
1411 |
The fields file inside the Recoll configuration defines which
|
|
|
1412 |
document fields are either "indexed" (searchable), "stored"
|
|
|
1413 |
(retrievable with search results), or both.
|
|
|
1414 |
|
|
|
1415 |
Data for an external indexer, should be stored in a separate index, not
|
|
|
1416 |
the one for the Recoll internal file system indexer, except if the latter
|
|
|
1417 |
is not used at all). The reason is that the main document indexer purge
|
|
|
1418 |
pass would remove all the other indexer's documents, as they were not seen
|
|
|
1419 |
during indexing. The main indexer documents would also probably be a
|
|
|
1420 |
problem for the external indexer purge operation.
|
|
|
1421 |
|
|
|
1422 |
----------------------------------------------------------------------
|
|
|
1423 |
|
|
|
1424 |
4.3.2. Python interface
|
|
|
1425 |
|
|
|
1426 |
4.3.2.1. Introduction
|
|
|
1427 |
|
|
|
1428 |
Recoll versions after 1.11 define a Python programming interface, both for
|
|
|
1429 |
searching and indexing.
|
|
|
1430 |
|
|
|
1431 |
The python interface is not built by default and can be found in the
|
|
|
1432 |
source package, under python/recoll. The directory contains the usual
|
|
|
1433 |
setup.py script which you can use to build and install the module:
|
|
|
1434 |
|
|
|
1435 |
cd recoll-xxx/python/recoll
|
|
|
1436 |
python setup.py build
|
|
|
1437 |
python setup.py install
|
|
|
1438 |
|
|
|
1439 |
|
|
|
1440 |
----------------------------------------------------------------------
|
|
|
1441 |
|
|
|
1442 |
4.3.2.2. Interface manual
|
|
|
1443 |
|
|
|
1444 |
NAME
|
|
|
1445 |
recoll - This is an interface to the Recoll full text indexer.
|
|
|
1446 |
|
|
|
1447 |
FILE
|
|
|
1448 |
/usr/local/lib/python2.5/site-packages/recoll.so
|
|
|
1449 |
|
|
|
1450 |
CLASSES
|
|
|
1451 |
Db
|
|
|
1452 |
Doc
|
|
|
1453 |
Query
|
|
|
1454 |
SearchData
|
|
|
1455 |
|
|
|
1456 |
class Db(__builtin__.object)
|
|
|
1457 |
| Db([confdir=None], [extra_dbs=None], [writable = False])
|
|
|
1458 |
|
|
|
|
1459 |
| A Db object holds a connection to a Recoll index. Use the connect()
|
|
|
1460 |
| function to create one.
|
|
|
1461 |
| confdir specifies a Recoll configuration directory (default:
|
|
|
1462 |
| $RECOLL_CONFDIR or ~/.recoll).
|
|
|
1463 |
| extra_dbs is a list of external databases (xapian directories)
|
|
|
1464 |
| writable decides if we can index new data through this connection
|
|
|
1465 |
|
|
|
|
1466 |
| Methods defined here:
|
|
|
1467 |
|
|
|
|
1468 |
|
|
|
|
1469 |
| addOrUpdate(...)
|
|
|
1470 |
| addOrUpdate(udi, doc, parent_udi=None) -> None
|
|
|
1471 |
| Add or update index data for a given document
|
|
|
1472 |
| The udi string must define a unique id for the document. It is not
|
|
|
1473 |
| interpreted inside Recoll
|
|
|
1474 |
| doc is a Doc object
|
|
|
1475 |
| if parent_udi is set, this is a unique identifier for the
|
|
|
1476 |
| top-level container (ie mbox file)
|
|
|
1477 |
|
|
|
|
1478 |
| delete(...)
|
|
|
1479 |
| delete(udi) -> Bool.
|
|
|
1480 |
| Purge index from all data for udi. If udi matches a container
|
|
|
1481 |
| document, purge all subdocs (docs with a parent_udi matching udi).
|
|
|
1482 |
|
|
|
|
1483 |
| makeDocAbstract(...)
|
|
|
1484 |
| makeDocAbstract(Doc, Query) -> string
|
|
|
1485 |
| Build and return 'keyword-in-context' abstract for document
|
|
|
1486 |
| and query.
|
|
|
1487 |
|
|
|
|
1488 |
| needUpdate(...)
|
|
|
1489 |
| needUpdate(udi, sig) -> Bool.
|
|
|
1490 |
| Check if the index is up to date for the document defined by udi,
|
|
|
1491 |
| having the current signature sig.
|
|
|
1492 |
|
|
|
|
1493 |
| purge(...)
|
|
|
1494 |
| purge() -> Bool.
|
|
|
1495 |
| Delete all documents that were not touched during the just finished
|
|
|
1496 |
| indexing pass (since open-for-write). These are the documents for
|
|
|
1497 |
| the needUpdate() call was not performed, indicating that they no
|
|
|
1498 |
| longer exist in the primary storage system.
|
|
|
1499 |
|
|
|
|
1500 |
| query(...)
|
|
|
1501 |
| query() -> Query. Return a new, blank query object for this index.
|
|
|
1502 |
|
|
|
|
1503 |
| setAbstractParams(...)
|
|
|
1504 |
| setAbstractParams(maxchars, contextwords).
|
|
|
1505 |
| Set the parameters used to build 'keyword-in-context' abstracts
|
|
|
1506 |
|
|
1199 |
----------------------------------------------------------------------
|
1507 |
| ----------------------------------------------------------------------
|
|
|
1508 |
| Data and other attributes defined here:
|
|
|
1509 |
|
|
|
|
1510 |
|
|
|
1511 |
class Doc(__builtin__.object)
|
|
|
1512 |
| Doc()
|
|
|
1513 |
|
|
|
|
1514 |
| A Doc object contains index data for a given document.
|
|
|
1515 |
| The data is extracted from the index when searching, or set by the
|
|
|
1516 |
| indexer program when updating. The Doc object has no useful methods but
|
|
|
1517 |
| many attributes to be read or set by its user. It matches exactly the
|
|
|
1518 |
| Rcl::Doc c++ object. Some of the attributes are predefined, but,
|
|
|
1519 |
| especially when indexing, others can be set, the name of which will be
|
|
|
1520 |
| processed as field names by the indexing configuration.
|
|
|
1521 |
| Inputs can be specified as unicode or strings.
|
|
|
1522 |
| Outputs are unicode objects.
|
|
|
1523 |
| All dates are specified as unix timestamps, printed as strings
|
|
|
1524 |
| Predefined attributes (index/query/both):
|
|
|
1525 |
| text (index): document plain text
|
|
|
1526 |
| url (both)
|
|
|
1527 |
| fbytes (both) optional) file size in bytes
|
|
|
1528 |
| filename (both)
|
|
|
1529 |
| fmtime (both) optional file modification date. Unix time printed
|
|
|
1530 |
| as string
|
|
|
1531 |
| dbytes (both) document text bytes
|
|
|
1532 |
| dmtime (both) document creation/modification date
|
|
|
1533 |
| ipath (both) value private to the app.: internal access path
|
|
|
1534 |
| inside file
|
|
|
1535 |
| mtype (both) mime type for original document
|
|
|
1536 |
| mtime (query) dmtime if set else fmtime
|
|
|
1537 |
| origcharset (both) charset the text was converted from
|
|
|
1538 |
| size (query) dbytes if set, else fbytes
|
|
|
1539 |
| sig (both) app-defined file modification signature.
|
|
|
1540 |
| For up to date checks
|
|
|
1541 |
| relevancyrating (query)
|
|
|
1542 |
| abstract (both)
|
|
|
1543 |
| author (both)
|
|
|
1544 |
| title (both)
|
|
|
1545 |
| keywords (both)
|
|
|
1546 |
|
|
|
|
1547 |
| Methods defined here:
|
|
|
1548 |
|
|
|
|
1549 |
|
|
|
|
1550 |
| ----------------------------------------------------------------------
|
|
|
1551 |
| Data and other attributes defined here:
|
|
|
1552 |
|
|
|
|
1553 |
|
|
|
1554 |
class Query(__builtin__.object)
|
|
|
1555 |
| Recoll Query objects are used to execute index searches.
|
|
|
1556 |
| They must be created by the Db.query() method.
|
|
|
1557 |
|
|
|
|
1558 |
| Methods defined here:
|
|
|
1559 |
|
|
|
|
1560 |
|
|
|
|
1561 |
| execute(...)
|
|
|
1562 |
| execute(query_string, stemming=1|0)
|
|
|
1563 |
|
|
|
|
1564 |
| Starts a search for query_string, a Recoll search language string
|
|
|
1565 |
| (mostly Xesam-compatible).
|
|
|
1566 |
| The query can be a simple list of terms (and'ed by default), or more
|
|
|
1567 |
| complicated with field specs etc. See the Recoll manual.
|
|
|
1568 |
|
|
|
|
1569 |
| executesd(...)
|
|
|
1570 |
| executesd(SearchData)
|
|
|
1571 |
|
|
|
|
1572 |
| Starts a search for the query defined by the SearchData object.
|
|
|
1573 |
|
|
|
|
1574 |
| fetchone(...)
|
|
|
1575 |
| fetchone(None) -> Doc
|
|
|
1576 |
|
|
|
|
1577 |
| Fetches the next Doc object in the current search results.
|
|
|
1578 |
|
|
|
|
1579 |
| sortby(...)
|
|
|
1580 |
| sortby(field=fieldname, ascending=true)
|
|
|
1581 |
| Sort results by 'fieldname', in ascending or descending order.
|
|
|
1582 |
| Only one field can be used, no subsorts for now.
|
|
|
1583 |
| Must be called before executing the search
|
|
|
1584 |
|
|
|
|
1585 |
| ----------------------------------------------------------------------
|
|
|
1586 |
| Data descriptors defined here:
|
|
|
1587 |
|
|
|
|
1588 |
| next
|
|
|
1589 |
| Next index to be fetched from results. Normally increments after
|
|
|
1590 |
| each fetchone() call, but can be set/reset before the call effect
|
|
|
1591 |
| seeking. Starts at 0
|
|
|
1592 |
|
|
|
|
1593 |
| ----------------------------------------------------------------------
|
|
|
1594 |
| Data and other attributes defined here:
|
|
|
1595 |
|
|
|
|
1596 |
|
|
|
1597 |
class SearchData(__builtin__.object)
|
|
|
1598 |
| SearchData()
|
|
|
1599 |
|
|
|
|
1600 |
| A SearchData object describes a query. It has a number of global
|
|
|
1601 |
| parameters and a chain of search clauses.
|
|
|
1602 |
|
|
|
|
1603 |
| Methods defined here:
|
|
|
1604 |
|
|
|
|
1605 |
|
|
|
|
1606 |
| addclause(...)
|
|
|
1607 |
| addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
|
1608 |
| qstring=string, slack=int, field=string, stemming=1|0,
|
|
|
1609 |
| subSearch=SearchData)
|
|
|
1610 |
| Adds a simple clause to the SearchData And/Or chain, or a subquery
|
|
|
1611 |
| defined by another SearchData object
|
|
|
1612 |
|
|
|
|
1613 |
| ----------------------------------------------------------------------
|
|
|
1614 |
| Data and other attributes defined here:
|
|
|
1615 |
|
|
1200 |
|
1616 |
|
|
|
1617 |
FUNCTIONS
|
|
|
1618 |
connect(...)
|
|
|
1619 |
connect([confdir=None], [extra_dbs=None], [writable = False])
|
|
|
1620 |
-> Db.
|
|
|
1621 |
|
|
|
1622 |
Connects to a Recoll database and returns a Db object.
|
|
|
1623 |
confdir specifies a Recoll configuration directory
|
|
|
1624 |
(the default is built like for any Recoll program).
|
|
|
1625 |
extra_dbs is a list of external databases (xapian directories)
|
|
|
1626 |
writable decides if we can index new data through this connection
|
|
|
1627 |
|
|
|
1628 |
|
|
|
1629 |
|
|
|
1630 |
----------------------------------------------------------------------
|
|
|
1631 |
|
|
|
1632 |
4.3.2.3. Example code
|
|
|
1633 |
|
|
|
1634 |
The following sample would query the index with a user language string.
|
|
|
1635 |
See the python/samples directory inside the Recoll source for other
|
|
|
1636 |
examples.
|
|
|
1637 |
|
|
|
1638 |
#!/usr/bin/env python
|
|
|
1639 |
|
|
|
1640 |
import recoll
|
|
|
1641 |
|
|
|
1642 |
db = recoll.connect()
|
|
|
1643 |
db.setAbstractParams(maxchars=80, contextwords=2)
|
|
|
1644 |
|
|
|
1645 |
query = db.query()
|
|
|
1646 |
nres = query.execute("some user question")
|
|
|
1647 |
print "Result count: ", nres
|
|
|
1648 |
if nres > 5:
|
|
|
1649 |
nres = 5
|
|
|
1650 |
while query.next >= 0 and query.next < nres:
|
|
|
1651 |
doc = query.fetchone()
|
|
|
1652 |
print query.next
|
|
|
1653 |
for k in ("title", "size"):
|
|
|
1654 |
print k, ":", getattr(doc, k).encode('utf-8')
|
|
|
1655 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
|
1656 |
print abs
|
|
|
1657 |
print
|
|
|
1658 |
|
|
|
1659 |
|
|
|
1660 |
|
|
|
1661 |
----------------------------------------------------------------------
|
|
|
1662 |
|
1201 |
Chapter 4. Installation
|
1663 |
Chapter 5. Installation
|
1202 |
|
1664 |
|
1203 |
4.1. Installing a prebuilt copy
|
1665 |
5.1. Installing a prebuilt copy
|
1204 |
|
1666 |
|
1205 |
Recoll binary packages from the Recoll web site are always linked
|
1667 |
Recoll binary packages from the Recoll web site are always linked
|
1206 |
statically to the Xapian libraries, and have no other dependencies. You
|
1668 |
statically to the Xapian libraries, and have no other dependencies. You
|
1207 |
will only have to check or install supporting applications for the file
|
1669 |
will only have to check or install supporting applications for the file
|
1208 |
types that you want to index beyond text, HTML and mail files, and maybe
|
1670 |
types that you want to index beyond text, HTML and mail files, and maybe
|
1209 |
have a look at the configuration section (but this may not be necessary
|
1671 |
have a look at the configuration section (but this may not be necessary
|
1210 |
for a quick test with default parameters).
|
1672 |
for a quick test with default parameters).
|
1211 |
|
1673 |
|
1212 |
----------------------------------------------------------------------
|
1674 |
----------------------------------------------------------------------
|
1213 |
|
1675 |
|
1214 |
4.1.1. Installing through a package system
|
1676 |
5.1.1. Installing through a package system
|
1215 |
|
1677 |
|
1216 |
If you use a BSD-type port system or a prebuilt package (RPM or other),
|
1678 |
If you use a BSD-type port system or a prebuilt package (RPM or other),
|
1217 |
just follow the usual procedure for your system.
|
1679 |
just follow the usual procedure for your system.
|
1218 |
|
1680 |
|
1219 |
----------------------------------------------------------------------
|
1681 |
----------------------------------------------------------------------
|
1220 |
|
1682 |
|
1221 |
4.1.2. Installing a prebuilt Recoll
|
1683 |
5.1.2. Installing a prebuilt Recoll
|
1222 |
|
1684 |
|
1223 |
The unpackaged binary versions on the Recoll web site are just compressed
|
1685 |
The unpackaged binary versions on the Recoll web site are just compressed
|
1224 |
tar files of a build tree, where only the useful parts were kept
|
1686 |
tar files of a build tree, where only the useful parts were kept
|
1225 |
(executables and sample configuration).
|
1687 |
(executables and sample configuration).
|
1226 |
|
1688 |
|
|
... |
|
... |
1231 |
had built the package from source (that is, just type make install). The
|
1693 |
had built the package from source (that is, just type make install). The
|
1232 |
binary trees are built for installation to /usr/local.
|
1694 |
binary trees are built for installation to /usr/local.
|
1233 |
|
1695 |
|
1234 |
----------------------------------------------------------------------
|
1696 |
----------------------------------------------------------------------
|
1235 |
|
1697 |
|
1236 |
4.2. Supporting packages
|
1698 |
5.2. Supporting packages
|
1237 |
|
1699 |
|
1238 |
Recoll uses external applications to index some file types. You need to
|
1700 |
Recoll uses external applications to index some file types. You need to
|
1239 |
install them for the file types that you wish to have indexed (these are
|
1701 |
install them for the file types that you wish to have indexed (these are
|
1240 |
run-time dependencies. None is needed for building Recoll):
|
1702 |
run-time dependencies. None is needed for building Recoll).
|
|
|
1703 |
|
|
|
1704 |
After an indexing pass, the commands that were found missing can be
|
|
|
1705 |
displayed from the recoll File menu. The list is stored in the missing
|
|
|
1706 |
text file inside the configuration directory.
|
|
|
1707 |
|
|
|
1708 |
A list of common file types which need external commands:
|
1241 |
|
1709 |
|
1242 |
* Openoffice: supported natively, but needs the unzip command to be
|
1710 |
* Openoffice: supported natively, but needs the unzip command to be
|
1243 |
installed.
|
1711 |
installed.
|
1244 |
|
1712 |
|
1245 |
* PDF: pdftotext is part of the Xpdf package.
|
1713 |
* PDF: pdftotext is part of the Xpdf package.
|
|
... |
|
... |
1273 |
Text, HTML, mail folders Openoffice and Scribus files are processed
|
1741 |
Text, HTML, mail folders Openoffice and Scribus files are processed
|
1274 |
internally. Lyx is used to index Lyx files. Many filters need sed and awk.
|
1742 |
internally. Lyx is used to index Lyx files. Many filters need sed and awk.
|
1275 |
|
1743 |
|
1276 |
----------------------------------------------------------------------
|
1744 |
----------------------------------------------------------------------
|
1277 |
|
1745 |
|
1278 |
4.3. Building from source
|
1746 |
5.3. Building from source
|
1279 |
|
1747 |
|
1280 |
4.3.1. Prerequisites
|
1748 |
5.3.1. Prerequisites
|
1281 |
|
1749 |
|
1282 |
At the very least, you will need to download and install the xapian core
|
1750 |
At the very least, you will need to download and install the xapian core
|
1283 |
package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
|
1751 |
package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
|
1284 |
version will work too), and the qt run-time and development packages
|
1752 |
version will work too), and the qt run-time and development packages
|
1285 |
(Recoll development currently uses version 3.3.5, but any 3.3 version is
|
1753 |
(Recoll development currently uses version 3.3.5, but any 3.3 version is
|
|
... |
|
... |
1293 |
not be critical). On Linux systems, the iconv interface is part of libc
|
1761 |
not be critical). On Linux systems, the iconv interface is part of libc
|
1294 |
and you should not need to do anything special.
|
1762 |
and you should not need to do anything special.
|
1295 |
|
1763 |
|
1296 |
----------------------------------------------------------------------
|
1764 |
----------------------------------------------------------------------
|
1297 |
|
1765 |
|
1298 |
4.3.2. Building
|
1766 |
5.3.2. Building
|
1299 |
|
1767 |
|
1300 |
Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
|
1768 |
Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
|
1301 |
3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
|
1769 |
3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
|
1302 |
system, and need to modify things, I would very much welcome patches.
|
1770 |
system, and need to modify things, I would very much welcome patches.
|
1303 |
|
1771 |
|
|
... |
|
... |
1333 |
manually copy and modify one of the existing files (the new file name
|
1801 |
manually copy and modify one of the existing files (the new file name
|
1334 |
should be the output of uname -s).
|
1802 |
should be the output of uname -s).
|
1335 |
|
1803 |
|
1336 |
----------------------------------------------------------------------
|
1804 |
----------------------------------------------------------------------
|
1337 |
|
1805 |
|
1338 |
4.3.3. Installation
|
1806 |
5.3.3. Installation
|
1339 |
|
1807 |
|
1340 |
Either type make install or execute recollinstall prefix, in the root of
|
1808 |
Either type make install or execute recollinstall prefix, in the root of
|
1341 |
the source tree. This will copy the commands to prefix/bin and the sample
|
1809 |
the source tree. This will copy the commands to prefix/bin and the sample
|
1342 |
configuration files, scripts and other shared data to prefix/share/recoll.
|
1810 |
configuration files, scripts and other shared data to prefix/share/recoll.
|
1343 |
|
1811 |
|
|
... |
|
... |
1348 |
|
1816 |
|
1349 |
You can then proceed to configuration.
|
1817 |
You can then proceed to configuration.
|
1350 |
|
1818 |
|
1351 |
----------------------------------------------------------------------
|
1819 |
----------------------------------------------------------------------
|
1352 |
|
1820 |
|
1353 |
4.4. Configuration overview
|
1821 |
5.4. Configuration overview
|
1354 |
|
1822 |
|
1355 |
Most of the parameters specific to the recoll GUI are set through the
|
1823 |
Most of the parameters specific to the recoll GUI are set through the
|
1356 |
Preferences menu and stored in the standard QT place ($HOME/.qt/recollrc).
|
1824 |
Preferences menu and stored in the standard QT place ($HOME/.qt/recollrc).
|
1357 |
You probably do not want to edit this by hand.
|
1825 |
You probably do not want to edit this by hand.
|
1358 |
|
1826 |
|
|
... |
|
... |
1408 |
White space is used for separation inside lists. List elements with
|
1876 |
White space is used for separation inside lists. List elements with
|
1409 |
embedded spaces can be quoted using double-quotes.
|
1877 |
embedded spaces can be quoted using double-quotes.
|
1410 |
|
1878 |
|
1411 |
----------------------------------------------------------------------
|
1879 |
----------------------------------------------------------------------
|
1412 |
|
1880 |
|
1413 |
4.4.1. Main configuration file
|
1881 |
5.4.1. Main configuration file
|
1414 |
|
1882 |
|
1415 |
recoll.conf is the main configuration file. It defines things like what to
|
1883 |
recoll.conf is the main configuration file. It defines things like what to
|
1416 |
index (top directories and things to ignore), and the default character
|
1884 |
index (top directories and things to ignore), and the default character
|
1417 |
set to use for document types which do not specify it internally.
|
1885 |
set to use for document types which do not specify it internally.
|
1418 |
|
1886 |
|
|
... |
|
... |
1614 |
cases. A value of 3 would allow more precision and efficiency on
|
2082 |
cases. A value of 3 would allow more precision and efficiency on
|
1615 |
longer words, but the index will be approximately twice as large.
|
2083 |
longer words, but the index will be approximately twice as large.
|
1616 |
|
2084 |
|
1617 |
----------------------------------------------------------------------
|
2085 |
----------------------------------------------------------------------
|
1618 |
|
2086 |
|
1619 |
4.4.2. The mimemap file
|
2087 |
5.4.2. The mimemap file
|
1620 |
|
2088 |
|
1621 |
mimemap specifies the file name extension to mime type mappings.
|
2089 |
mimemap specifies the file name extension to mime type mappings.
|
1622 |
|
2090 |
|
1623 |
For file names without an extension, or with an unknown one, the system's
|
2091 |
For file names without an extension, or with an unknown one, the system's
|
1624 |
file -i command will be executed to determine the mime type (this can be
|
2092 |
file -i command will be executed to determine the mime type (this can be
|
|
... |
|
... |
1640 |
there avoids cluttering the more user-oriented and locally customized
|
2108 |
there avoids cluttering the more user-oriented and locally customized
|
1641 |
skippedNames.
|
2109 |
skippedNames.
|
1642 |
|
2110 |
|
1643 |
----------------------------------------------------------------------
|
2111 |
----------------------------------------------------------------------
|
1644 |
|
2112 |
|
1645 |
4.4.3. The mimeconf file
|
2113 |
5.4.3. The mimeconf file
|
1646 |
|
2114 |
|
1647 |
mimeconf specifies how the different mime types are handled for indexing,
|
2115 |
mimeconf specifies how the different mime types are handled for indexing,
|
1648 |
and which icons are displayed in the recoll result lists.
|
2116 |
and which icons are displayed in the recoll result lists.
|
1649 |
|
2117 |
|
1650 |
Changing the parameters in the [index] section is probably not a good idea
|
2118 |
Changing the parameters in the [index] section is probably not a good idea
|
|
... |
|
... |
1654 |
recoll in the result lists (the values are the basenames of the png images
|
2122 |
recoll in the result lists (the values are the basenames of the png images
|
1655 |
inside the iconsdir directory (specified in recoll.conf).
|
2123 |
inside the iconsdir directory (specified in recoll.conf).
|
1656 |
|
2124 |
|
1657 |
----------------------------------------------------------------------
|
2125 |
----------------------------------------------------------------------
|
1658 |
|
2126 |
|
1659 |
4.4.4. The mimeview file
|
2127 |
5.4.4. The mimeview file
|
1660 |
|
2128 |
|
1661 |
mimeview specifies which programs are started when you click on an Edit
|
2129 |
mimeview specifies which programs are started when you click on an Edit
|
1662 |
link in a result list. Ie: HTML is normally displayed using firefox, but
|
2130 |
link in a result list. Ie: HTML is normally displayed using firefox, but
|
1663 |
you may prefer Konqueror, your openoffice.org program might be named
|
2131 |
you may prefer Konqueror, your openoffice.org program might be named
|
1664 |
oofice instead of openoffice etc.
|
2132 |
oofice instead of openoffice etc.
|
|
... |
|
... |
1677 |
user preferences, all mimeview entries will be ignored except the one
|
2145 |
user preferences, all mimeview entries will be ignored except the one
|
1678 |
labelled application/x-all (which is set to use xdg-open by default).
|
2146 |
labelled application/x-all (which is set to use xdg-open by default).
|
1679 |
|
2147 |
|
1680 |
----------------------------------------------------------------------
|
2148 |
----------------------------------------------------------------------
|
1681 |
|
2149 |
|
1682 |
4.4.5. Examples of configuration adjustments
|
2150 |
5.4.5. Examples of configuration adjustments
|
1683 |
|
2151 |
|
1684 |
4.4.5.1. Adding an external viewer for an non-indexed type
|
2152 |
5.4.5.1. Adding an external viewer for an non-indexed type
|
1685 |
|
2153 |
|
1686 |
Imagine that you have some kind of file which does not have indexable
|
2154 |
Imagine that you have some kind of file which does not have indexable
|
1687 |
content, but for which you would like to have a functional Edit link in
|
2155 |
content, but for which you would like to have a functional Edit link in
|
1688 |
the result list (when found by file name). The file names end in .blob and
|
2156 |
the result list (when found by file name). The file names end in .blob and
|
1689 |
can be displayed by application blobviewer.
|
2157 |
can be displayed by application blobviewer.
|
|
... |
|
... |
1712 |
The entries you add in your personal file override those in the central
|
2180 |
The entries you add in your personal file override those in the central
|
1713 |
configuration, which you do not need to alter
|
2181 |
configuration, which you do not need to alter
|
1714 |
|
2182 |
|
1715 |
----------------------------------------------------------------------
|
2183 |
----------------------------------------------------------------------
|
1716 |
|
2184 |
|
1717 |
4.4.5.2. Adding indexing support for a new file type
|
2185 |
5.4.5.2. Adding indexing support for a new file type
|
1718 |
|
2186 |
|
1719 |
Let us now imagine that the above .blob files actually contain indexable
|
2187 |
Let us now imagine that the above .blob files actually contain indexable
|
1720 |
text and that you know how to extract it with a command line program.
|
2188 |
text and that you know how to extract it with a command line program.
|
1721 |
Getting Recoll to index the files is easy. You need to perform the above
|
2189 |
Getting Recoll to index the files is easy. You need to perform the above
|
1722 |
alteration, and also to add data to the mimeconf file (typically in
|
2190 |
alteration, and also to add data to the mimeconf file (typically in
|
|
... |
|
... |
1736 |
makes sense (you can also create a category). Categories may be used
|
2204 |
makes sense (you can also create a category). Categories may be used
|
1737 |
for filtering in advanced search.
|
2205 |
for filtering in advanced search.
|
1738 |
|
2206 |
|
1739 |
The rclblob filter should be an executable program or script which exists
|
2207 |
The rclblob filter should be an executable program or script which exists
|
1740 |
inside /usr/[local/]share/recoll/filters. It will be given a file name as
|
2208 |
inside /usr/[local/]share/recoll/filters. It will be given a file name as
|
1741 |
argument and should output the text contents in html format on the
|
2209 |
argument and should output the text contents on the standard output.
|
1742 |
standard output.
|
|
|
1743 |
|
2210 |
|
1744 |
You can find more details about writing a Recoll filter in the section
|
2211 |
The filter programming section describes in more detail how to write a
|
1745 |
about writing filters
|
2212 |
filter.
|
1746 |
|
2213 |
|
1747 |
----------------------------------------------------------------------
|
2214 |
----------------------------------------------------------------------
|
1748 |
|
2215 |
|
1749 |
4.5. The KDE Kicker Recoll applet
|
2216 |
5.5. The KDE Kicker Recoll applet
|
1750 |
|
2217 |
|
1751 |
The Recoll source tree contains the source code to the recoll_applet, a
|
2218 |
The Recoll source tree contains the source code to the recoll_applet, a
|
1752 |
small application derived from the find_applet. This can be used to add a
|
2219 |
small application derived from the find_applet. This can be used to add a
|
1753 |
small Recoll launcher to the KDE panel.
|
2220 |
small Recoll launcher to the KDE panel.
|
1754 |
|
2221 |
|
1755 |
The applet is not automatically built with the main Recoll programs. To
|
2222 |
The applet is not automatically built with the main Recoll programs, nor
|
1756 |
build it, you need to unpack the Recoll source code, then go to the
|
2223 |
is it included with the main source distribution (because the KDE build
|
1757 |
kde/recoll_applet/ directory, and type the usual configure;make;make
|
2224 |
boilerplate makes it relatively big). You can download its source from the
|
1758 |
install.
|
2225 |
recoll.org download page. Use the omnipotent configure;make;make install
|
|
|
2226 |
incantation to build and install.
|
1759 |
|
2227 |
|
1760 |
You can then add the applet to the panel by right-clicking the panel and
|
2228 |
You can then add the applet to the panel by right-clicking the panel and
|
1761 |
choosing the Add applet entry.
|
2229 |
choosing the Add applet entry.
|
1762 |
|
2230 |
|
1763 |
The recoll_applet has a small text window where you can type a Recoll
|
2231 |
The recoll_applet has a small text window where you can type a Recoll
|
1764 |
query (in query language form), and an icon which can be used to restrict
|
2232 |
query (in query language form), and an icon which can be used to restrict
|
1765 |
the search to certain types of files.
|
2233 |
the search to certain types of files. It is quite primitive, and launches
|
|
|
2234 |
a new recoll GUI instance every time (even if it is already running). You
|
|
|
2235 |
may find it useful anyway.
|
1766 |
|
2236 |
|
1767 |
----------------------------------------------------------------------
|
2237 |
----------------------------------------------------------------------
|
1768 |
|
|
|
1769 |
4.6. Extending Recoll
|
|
|
1770 |
|
|
|
1771 |
4.6.1. Writing a document filter
|
|
|
1772 |
|
|
|
1773 |
Recoll filters are executable programs which translate from a specific
|
|
|
1774 |
format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
|
|
|
1775 |
format, which was chosen to be HTML.
|
|
|
1776 |
|
|
|
1777 |
Recoll filters are usually shell-scripts, but this is in no way necessary.
|
|
|
1778 |
These programs are extremely simple and most of the difficulty lies in
|
|
|
1779 |
extracting the text from the native format, not outputting what is
|
|
|
1780 |
expected by Recoll. Happily enough, most document formats already have
|
|
|
1781 |
translators or text extractors which handle the difficult part and can be
|
|
|
1782 |
called from the filter.
|
|
|
1783 |
|
|
|
1784 |
Filters are called with a single argument which is the source file name.
|
|
|
1785 |
They should output the result to stdout.
|
|
|
1786 |
|
|
|
1787 |
The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
|
|
|
1788 |
the filter if the operation is for indexing or previewing. Some filters
|
|
|
1789 |
use this to output a slightly different format. This is not essential.
|
|
|
1790 |
|
|
|
1791 |
The output HTML could be very minimal like the following example:
|
|
|
1792 |
|
|
|
1793 |
<html><head>
|
|
|
1794 |
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
|
|
|
1795 |
</head>
|
|
|
1796 |
<body>some text content</body></html>
|
|
|
1797 |
|
|
|
1798 |
|
|
|
1799 |
You should take care to escape some characters inside the text by
|
|
|
1800 |
transforming them into appropriate entities. "&" should be transformed
|
|
|
1801 |
into "&", "<" should be transformed into "<".
|
|
|
1802 |
|
|
|
1803 |
The character set needs to be specified in the header. It does not need to
|
|
|
1804 |
be UTF-8 (Recoll will take care of translating it), but it must be
|
|
|
1805 |
accurate for good results.
|
|
|
1806 |
|
|
|
1807 |
Recoll will also make use of other header fields if they are present:
|
|
|
1808 |
title, description, keywords.
|
|
|
1809 |
|
|
|
1810 |
As of Recoll release 1.9, filters also have the possibility to "invent"
|
|
|
1811 |
field names. This should be output as meta tags:
|
|
|
1812 |
|
|
|
1813 |
<meta name="somefield" content="Some textual data" />
|
|
|
1814 |
|
|
|
1815 |
In this case, a correspondance between field name and Xapian prefix should
|
|
|
1816 |
also be added to the mimeconf file. See the existing entries for
|
|
|
1817 |
inspiration. The field can then be used inside the query language to
|
|
|
1818 |
narrow searches.
|
|
|
1819 |
|
|
|
1820 |
The easiest way to write a new filter is probably to start from an
|
|
|
1821 |
existing one.
|
|
|
1822 |
|
|
|
1823 |
----------------------------------------------------------------------
|
|
|