--- a/src/README
+++ b/src/README
@@ -9,6 +9,12 @@
<jfd@recoll.org>
Copyright (c) 2005-2012 Jean-Francois Dockes
+
+ Permission is granted to copy, distribute and/or modify this document
+ under the terms of the GNU Free Documentation License, Version 1.3 or any
+ later version published by the Free Software Foundation; with no Invariant
+ Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
+ license can be found at the following location: GNU web site.
This document introduces full text search notions and describes the
installation and use of the Recoll application. It currently describes
@@ -52,17 +58,21 @@
2.3.3. The index configuration GUI
- 2.4. Index WEB visited page history
-
- 2.5. Periodic indexing
-
- 2.5.1. Running indexing
-
- 2.5.2. Using cron to automate indexing
-
- 2.6. Real time indexing
-
- 2.6.1. Slowing down the reindexing rate for fast
+ 2.4. Indexing WEB pages you wisit
+
+ 2.5. Extended attributes data
+
+ 2.6. Importing external tags
+
+ 2.7. Periodic indexing
+
+ 2.7.1. Running indexing
+
+ 2.7.2. Using cron to automate indexing
+
+ 2.8. Real time indexing
+
+ 2.8.1. Slowing down the reindexing rate for fast
changing files
3. Searching
@@ -102,23 +112,25 @@
3.3. Searching on the command line
- 3.4. The query language
-
- 3.4.1. Modifiers
-
- 3.5. Search case and diacritics sensitivity
-
- 3.6. Anchored searches and wildcards
-
- 3.6.1. More about wildcards
-
- 3.6.2. Anchored searches
-
- 3.7. Desktop integration
-
- 3.7.1. Hotkeying recoll
-
- 3.7.2. The KDE Kicker Recoll applet
+ 3.4. Path translations
+
+ 3.5. The query language
+
+ 3.5.1. Modifiers
+
+ 3.6. Search case and diacritics sensitivity
+
+ 3.7. Anchored searches and wildcards
+
+ 3.7.1. More about wildcards
+
+ 3.7.2. Anchored searches
+
+ 3.8. Desktop integration
+
+ 3.8.1. Hotkeying recoll
+
+ 3.8.2. The KDE Kicker Recoll applet
4. Programming interface
@@ -172,7 +184,9 @@
5.4.5. The mimeview file
- 5.4.6. Examples of configuration adjustments
+ 5.4.6. The ptrans file
+
+ 5.4.7. Examples of configuration adjustments
Chapter 1. Introduction
@@ -396,6 +410,29 @@
recoll GUI. It is stored in the missing text file inside the configuration
directory.
+ By default, Recoll will try to index any file type that it has a way to
+ read. This is sometimes not desirable, and there are ways to either
+ exclude some types, or on the contrary to define a positive list of types
+ to be indexed. In the latter case, any type not in the list will be
+ ignored.
+
+ Excluding types can be done by adding name patterns to the skippedNames
+ list, which can be done from the GUI Index configuration menu. It is also
+ possible to exclude a mime type independantly of the file name by
+ associating it with the rclnull filter. This can be done by editing the
+ mimeconf configuration file.
+
+ In order to define a positive list, You need to edit the main
+ configuration file (recoll.conf) and set the indexedmimetypes
+ configuration variable. Example:
+
+ indexedmimetypes = text/html application/pdf
+
+
+ There is no GUI way to do this, because this option runs a bit contrary to
+ Recoll main goal which is to help you find information, independantly of
+ how it may be stored.
+
2.1.4. Recovery
In the rare case where the index becomes corrupted (which can signal
@@ -615,7 +652,7 @@
use it on hand-edited files, which you might nevertheless want to backup
first...
-2.4. Index WEB visited page history
+2.4. Indexing WEB pages you wisit
With the help of a Firefox extension, Recoll can index the Internet pages
that you visit. The extension was initially designed for the Beagle
@@ -638,9 +675,43 @@
make room for new ones, so you need to explicitly archive in some other
place the pages that you want to keep indefinitely.
-2.5. Periodic indexing
-
- 2.5.1. Running indexing
+2.5. Extended attributes data
+
+ User extended attributes are named pieces of information that most modern
+ file systems can attach to any file.
+
+ Recoll versions 1.19 and later process extended attributes as document
+ fields by default. For older versions, this has to be activated at build
+ time.
+
+ A freedesktop standard defines a few special attributes, which are handled
+ as such by Recoll:
+
+ mime_type
+
+ If set, this overrides any other determination of the file mime
+ type.
+
+ charset
+ If set, this defines the file character set (mostly useful for
+ plain text files).
+
+ By default, other attributes are handled as Recoll fields. On Linux, the
+ user prefix is removed from the name. This can be configured more
+ precisely inside the fields configuration file.
+
+2.6. Importing external tags
+
+ During indexing, it is possible to import metadata for each file by
+ executing commands. For example, this could extract user tag data for the
+ file and store it in a field for indexing.
+
+ See the section about the metadatacmds field in the main configuration
+ chapter for more detail.
+
+2.7. Periodic indexing
+
+ 2.7.1. Running indexing
Indexing is always performed by the recollindex program, which can be
started either from the command line or from the File menu in the recoll
@@ -696,7 +767,7 @@
parameters, but just add them as index entries. It is up to the external
file selection method to build the complete file list.
- 2.5.2. Using cron to automate indexing
+ 2.7.2. Using cron to automate indexing
The most common way to set up indexing is to have a cron task execute it
every night. For example the following crontab entry would do it every day
@@ -722,7 +793,7 @@
Especially the PATH variable may be of concern. Please check the crontab
manual pages about possible issues.
-2.6. Real time indexing
+2.8. Real time indexing
Real time monitoring/indexing is performed by starting the recollindex -m
command. With this option, recollindex will detach from the terminal and
@@ -781,7 +852,7 @@
it if your system is short on resources. Periodic indexing is adequate in
most cases.
- 2.6.1. Slowing down the reindexing rate for fast changing files
+ 2.8.1. Slowing down the reindexing rate for fast changing files
When using the real time monitor, it may happen that some files need to be
indexed, but change so often that they impose an excessive load for the
@@ -1275,7 +1346,8 @@
to search for (ie a wildcard expression like *coll), the expansion can
take quite a long time because the full index term list will have to be
processed. The expansion is currently limited at 10000 results for
- wildcards and regular expressions.
+ wildcards and regular expressions. It is possible to change the limit in
+ the configuration file.
Double-clicking on a term in the result list will insert it into the
simple search entry field. You can also cut/paste between the result list
@@ -1294,14 +1366,19 @@
Index selection is performed in two phases. A set of all usable indexes
must first be defined, and then the subset of indexes to be used for
- searching. Of course, these parameters are retained across program
- executions (there are kept separately for each Recoll configuration). The
- set of all indexes is usually quite stable, while the active ones might
- typically be adjusted quite frequently.
+ searching. These parameters are retained across program executions (there
+ are kept separately for each Recoll configuration). The set of all indexes
+ is usually quite stable, while the active ones might typically be adjusted
+ quite frequently.
The main index (defined by RECOLL_CONFDIR) is always active. If this is
undesirable, you can set up your base configuration to index an empty
directory.
+
+ When adding a new index to the set, you can select either a Recoll
+ configuration directory, or directly a Xapian index directory. In the
+ first case, the Xapian index directory will be obtained from the selected
+ configuration.
As building the set of all indexes can be a little tedious when done
through the user interface, you can use the RECOLL_EXTRA_DBS environment
@@ -1455,6 +1532,11 @@
PageDown to scroll the result list, Shift+Home to go back to the first
page. These work even while the focus is in the search entry.
+ Editing a new search while the focus is not in the search entry. You can
+ use the Ctrl-Shift-S shortcut to return the cursor to the search entry
+ (and select the current search text), while the focus is anywhere in the
+ main window.
+
Forced opening of a preview window. You can use Shift+Click on a result
list Preview link to force the creation of a preview window instead of a
new tab in the existing one.
@@ -1490,6 +1572,14 @@
/usr/share/recoll/examples directory. Using a style sheet, you can
change most recoll graphical parameters: colors, fonts, etc. See the
sample file for a few simple examples.
+
+ You should be aware that parameters (e.g.: the background color) set
+ inside the Recoll GUI style sheet will override global system
+ preferences, with possible strange side effects: for example if you
+ set the foreground to a light color and the background to a dark one
+ in the desktop preferences, but only the background is set inside the
+ Recoll style sheet, and it is light too, then text will appear
+ light-on-light inside the Recoll GUI.
o Maximum text size highlighted for preview Inserting highlights on
search term inside the text before inserting it in the preview window
@@ -1693,11 +1783,9 @@
indexed but not stored fields is not known at this point in the search
process (see field configuration). There are currently very few fields
stored by default, apart from the values above (only author and filename),
- so this feature will need some custom local configuration to be useful.
- For example, you could look at the fields for the document types of
- interest (use the right-click menu inside the preview window), and add
- what you want to the list of stored fields. A candidate example would be
- the recipient field which is generated by the message filters.
+ so this feature will need some custom local configuration to be useful. An
+ example candidate would be the recipient field which is generated by the
+ message filters.
The default value for the paragraph format string is:
@@ -1759,7 +1847,7 @@
of HTML documents (for example a manual) so that they become their own
search interface inside konqueror.
- This can be done by either explicitly inserting <a href="recoll:/...">
+ This can be done by either explicitly inserting <a href="recoll://...">
links around some document areas, or automatically by adding a very small
javascript program to the documents, like the following example, which
would initiate a search by double-clicking any term:
@@ -1842,7 +1930,44 @@
text/html [file:///Users/uncrypted-dockes/projets/pagepers/index.html] [psxtcl/writemime/recoll]...
text/html [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
-3.4. The query language
+3.4. Path translations
+
+ In some cases, the document paths stored inside the index do not match the
+ actual ones, so that document previews and accesses will fail. This can
+ occur in a number of circumstances:
+
+ o When using multiple indexes it is a relatively common occurrence that
+ some will actually reside on a remote volume, for exemple mounted via
+ NFS. In this case, the paths used to access the documents on the local
+ machine are not necessarily the same than the ones used while indexing
+ on the remote machine. For example, /home/me may have been used as a
+ topdirs elements while indexing, but the directory might be mounted as
+ /net/server/home/me on the local machine.
+
+ o The case may also occur with removable disks. It is perfectly possible
+ to configure an index to live with the documents on the removable
+ disk, but it may happen that the disk is not mounted at the same place
+ so that the documents paths from the index are invalid.
+
+ o As a last exemple, one could imagine that a big directory has been
+ moved, but that it is currently inconvenient to run the indexer.
+
+ More generally, the path translation facility may be useful whenever the
+ documents paths seen by the indexer are not the same as the ones which
+ should be used at query time.
+
+ Recoll has a facility for rewriting access paths when extracting the data
+ from the index. The translations can be defined for the main index and for
+ any additional query index.
+
+ In the above NFS example, Recoll could be instructed to rewrite any
+ file:///home/me URL from the index to file:///net/server/home/me, allowing
+ accesses from the client.
+
+ The translations are defined in the ptrans configuration file, which can
+ be edited by hand or from the GUI external indexes configuration dialog.
+
+3.5. The query language
The query language processor is activated in the GUI simple search entry
when the search mode selector is set to Query Language. It can also be
@@ -1914,9 +2039,9 @@
o dir for filtering the results on file location (Ex:
dir:/home/me/somedir). -dir also works to find results not in the
specified directory (release >= 1.15.8). A tilde inside the value will
- be expanded to the home directory. Wildcards will not be expanded. You
- cannot use OR with dir clauses (this restriction may go away in the
- future).
+ be expanded to the home directory. Wildcards will be expanded, but
+ please have a look at an important limitation of wildcards in path
+ filters.
Relative paths also make sense, for example, dir:share/doc would match
either /usr/share/doc or /usr/local/share/doc
@@ -1930,8 +2055,10 @@
This would select results which have both recoll and src in the path
(in any order), and which have not either utils or common.
- Another special aspect of dir clauses is that the values in the index
- are not transcoded to UTF-8, and never lower-cased or unaccented, but
+ You can also use OR conjunctions with dir: clauses.
+
+ A special aspect of dir clauses is that the values in the index are
+ not transcoded to UTF-8, and never lower-cased or unaccented, but
stored as binary. This means that you need to enter the values in the
exact lower or upper case, and that searches for names with diacritics
may sometimes be impossible because of character set conversion
@@ -2000,7 +2127,7 @@
configuration, so that the exact field search possibilities may be
different for you if someone took care of the customisation.
- 3.4.1. Modifiers
+ 3.5.1. Modifiers
Some characters are recognized as search modifiers when found immediately
after the closing double quote of a phrase, as in "some
@@ -2025,7 +2152,7 @@
o A weight can be specified for a query element by specifying a decimal
value at the start of the modifiers. Example: "Important"2.5.
-3.5. Search case and diacritics sensitivity
+3.6. Search case and diacritics sensitivity
For Recoll versions 1.18 and later, and when working with a raw index (not
the default), searches can be made sensitive to character case and
@@ -2075,7 +2202,7 @@
When either case or diacritics sensitivity is activated, stem expansion is
turned off. Having both does not make much sense.
-3.6. Anchored searches and wildcards
+3.7. Anchored searches and wildcards
Some special characters are interpreted by Recoll in search strings to
expand or specialize the search. Wildcards expand a root term in
@@ -2083,7 +2210,7 @@
if the match is found at or near the beginning of the document or one of
its fields.
- 3.6.1. More about wildcards
+ 3.7.1. More about wildcards
All words entered in Recoll search fields will be processed for wildcard
expansion before the request is finally executed.
@@ -2098,15 +2225,18 @@
matches a single character which may be 'a' or 'b' or 'c', [0-9]
matches any number.
- You should be aware of a few things before using wildcards.
+ You should be aware of a few things when using wildcards.
o Using a wildcard character at the beginning of a word can make for a
slow search because Recoll will have to scan the whole index term list
- to find the matches.
-
- o When working with a raw index (preserving character case and
- diacritics), the literal part of a wildcard expression will be matched
- exactly for case and diacritics.
+ to find the matches. However, this is much less a problem for field
+ searches, and queries like author:*@domain.com can sometimes be very
+ useful.
+
+ o For Recoll version 18 only, when working with a raw index (preserving
+ character case and diacritics), the literal part of a wildcard
+ expression will be matched exactly for case and diacritics. This is
+ not true any more for versions 19 and later.
o Using a * at the end of a word can produce more matches than you would
think, and strange search results. You can use the term explorer tool
@@ -2116,7 +2246,22 @@
expansion will produce better results than an ending * (stem expansion
is turned off when any wildcard character appears in the term).
- 3.6.2. Anchored searches
+ 3.7.1.1. Wildcards and path filtering
+
+ Due to the way that Recoll processes wildcards inside dir path filtering
+ clauses, they will have a multiplicative effect on the query size. A
+ clause containg wildcards in several paths elements, like, for example,
+ dir:/home/me/*/*/docdir, will almost certainly fail if your indexed tree
+ is of any realistic size.
+
+ Depending on the case, you may be able to work around the issue by
+ specifying the paths elements more narrowly, with a constant prefix, or by
+ using 2 separate dir: clauses instead of multiple wildcards, as in
+ dir:/home/me dir:docdir. The latter query is not equivalent to the initial
+ one because it does not specify a number of directory levels, but that's
+ the best we can do (and it may be actually more useful in some cases).
+
+ 3.7.2. Anchored searches
Two characters are used to specify that a search hit should occur at the
beginning or at the end of the text. ^ at the beginning of a term or
@@ -2145,7 +2290,7 @@
matches inside the abstract or the list of authors (which occur at the top
of the document).
-3.7. Desktop integration
+3.8. Desktop integration
Being independant of the desktop type has its drawbacks: Recoll desktop
integration is minimal. However there are a few tools available:
@@ -2159,14 +2304,14 @@
Here follow a few other things that may help.
- 3.7.1. Hotkeying recoll
+ 3.8.1. Hotkeying recoll
It is surprisingly convenient to be able to show or hide the Recoll GUI
with a single keystroke. Recoll comes with a small Python script, based on
the libwnck window manager interface library, which will allow you to do
just this. The detailed instructions are on this wiki page.
- 3.7.2. The KDE Kicker Recoll applet
+ 3.8.2. The KDE Kicker Recoll applet
This is probably obsolete now. Anyway:
@@ -2368,32 +2513,68 @@
The output HTML could be very minimal like the following example:
- <html><head>
- <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
- </head>
- <body>some text content</body></html>
+ <html>
+ <head>
+ <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
+ </head>
+ <body>
+ Some text content
+ </body>
+ </html>
You should take care to escape some characters inside the text by
- transforming them into appropriate entities. "&" should be transformed
- into "&", "<" should be transformed into "<". This is not always
- properly done by translating programs which output HTML, and of course
- never by those which output plain text.
+ transforming them into appropriate entities. At the very minimum, "&"
+ should be transformed into "&", "<" should be transformed into "<".
+ This is not always properly done by translating programs which output
+ HTML, and of course never by those which output plain text.
+
+ When encapsulating plain text in an HTML body, the display of a preview
+ may be improved by enclosing the text inside <pre> tags.
The character set needs to be specified in the header. It does not need to
be UTF-8 (Recoll will take care of translating it), but it must be
accurate for good results.
- Recoll will also make use of other header fields if they are present:
- title, description, keywords.
-
- Filters also have the possibility to "invent" field names. This should be
- output as meta tags:
+ Recoll will process meta tags inside the header as possible document
+ fields candidates. Documents fields can be processed by the indexer in
+ different ways, for searching or displaying inside query results. This is
+ described in a following section.
+
+ By default, the indexer will process the standard header fields if they
+ are present: title, meta/description, and meta/keywords are both indexed
+ and stored for query-time display.
+
+ A predefined non-standard meta tag will also be processed by Recoll
+ without further configuration: if a date tag is present and has the right
+ format, it will be used as the document date (for display and sorting), in
+ preference to the file modification date. The date format should be as
+ follows:
+
+ <meta name="date" content="YYYY-mm-dd HH:MM:SS">
+ or
+ <meta name="date" content="YYYY-mm-ddTHH:MM:SS">
+
+
+ Example:
+
+ <meta name="date" content="2013-02-24 17:50:00">
+
+
+ Filters also have the possibility to "invent" field names. This should
+ also be output as meta tags:
<meta name="somefield" content="Some textual data" />
- See the following section for details about configuring how field data is
- processed by the indexer.
+ You can embed HTML markup inside the content of custom fields, for
+ improving the display inside result lists. In this case, add a (wildly
+ non-standard) markup attribute to tell Recoll that the value is HTML and
+ should not be escaped for display.
+
+ <meta name="somefield" markup="html" content="Some <i>textual</i> data" />
+
+ As written above, the processing of fields is described in a further
+ section.
4.1.5. Page numbers
@@ -2409,8 +2590,8 @@
The field values for documents can appear in several ways during indexing:
either output by filters as meta fields in the HTML header section, or
- added as attributes of the Doc object when using the API, or again
- synthetized internally by Recoll.
+ extracted from file extended attributes, or added as attributes of the Doc
+ object when using the API, or again synthetized internally by Recoll.
The Recoll query language allows searching for text in a specific field.
@@ -2511,234 +2692,237 @@
Recoll versions after 1.11 define a Python programming interface, both for
searching and indexing.
+ The API is inspired by the Python database API specification, version 1.0
+ for Recoll versions up to 1.18, version 2.0 for Recoll versions 1.19 and
+ later. The package structure changed with Recoll 1.19 too. We will mostly
+ describe the new API and package structure here. A paragraph at the end of
+ this section will explain a few differences and ways to write code
+ compatible with both versions.
+
The Python interface can be found in the source package, under
python/recoll.
- In order to build the module, you should first build or re-build the
- Recoll library using position-independant objects:
-
- cd recoll-xxx/
- configure --enable-pic
- make
-
- There is no significant disadvantage in using PIC objects for the main
- Recoll executables, so you can use the --enable-pic option for the main
- build too.
-
- The python/recoll/ directory contains the usual setup.py script which you
- can then use to build and install the module:
-
- cd recoll-xxx/python/recoll
- python setup.py build
- python setup.py install
-
- 4.3.2.2. Interface manual
-
- NAME
- recoll - This is an interface to the Recoll full text indexer.
-
- FILE
- /usr/local/lib/python2.5/site-packages/recoll.so
-
- CLASSES
- Db
- Doc
- Query
- SearchData
-
- class Db(__builtin__.object)
- | Db([confdir=None], [extra_dbs=None], [writable = False])
- |
- | A Db object holds a connection to a Recoll index. Use the connect()
- | function to create one.
- | confdir specifies a Recoll configuration directory (default:
- | $RECOLL_CONFDIR or ~/.recoll).
- | extra_dbs is a list of external databases (xapian directories)
- | writable decides if we can index new data through this connection
- |
- | Methods defined here:
- |
- |
- | addOrUpdate(...)
- | addOrUpdate(udi, doc, parent_udi=None) -> None
- | Add or update index data for a given document
- | The udi string must define a unique id for the document. It is not
- | interpreted inside Recoll
- | doc is a Doc object
- | if parent_udi is set, this is a unique identifier for the
- | top-level container (ie mbox file)
- |
- | delete(...)
- | delete(udi) -> Bool.
- | Purge index from all data for udi. If udi matches a container
- | document, purge all subdocs (docs with a parent_udi matching udi).
- |
- | makeDocAbstract(...)
- | makeDocAbstract(Doc, Query) -> string
- | Build and return 'keyword-in-context' abstract for document
- | and query.
- |
- | needUpdate(...)
- | needUpdate(udi, sig) -> Bool.
- | Check if the index is up to date for the document defined by udi,
- | having the current signature sig.
- |
- | purge(...)
- | purge() -> Bool.
- | Delete all documents that were not touched during the just finished
- | indexing pass (since open-for-write). These are the documents for
- | the needUpdate() call was not performed, indicating that they no
- | longer exist in the primary storage system.
- |
- | query(...)
- | query() -> Query. Return a new, blank query object for this index.
- |
- | setAbstractParams(...)
- | setAbstractParams(maxchars, contextwords).
- | Set the parameters used to build 'keyword-in-context' abstracts
- |
- | ----------------------------------------------------------------------
- | Data and other attributes defined here:
- |
-
- class Doc(__builtin__.object)
- | Doc()
- |
- | A Doc object contains index data for a given document.
- | The data is extracted from the index when searching, or set by the
- | indexer program when updating. The Doc object has no useful methods but
- | many attributes to be read or set by its user. It matches exactly the
- | Rcl::Doc c++ object. Some of the attributes are predefined, but,
- | especially when indexing, others can be set, the name of which will be
- | processed as field names by the indexing configuration.
- | Inputs can be specified as unicode or strings.
- | Outputs are unicode objects.
- | All dates are specified as unix timestamps, printed as strings
- | Predefined attributes (index/query/both):
- | text (index): document plain text
- | url (both)
- | fbytes (both) optional) file size in bytes
- | filename (both)
- | fmtime (both) optional file modification date. Unix time printed
- | as string
- | dbytes (both) document text bytes
- | dmtime (both) document creation/modification date
- | ipath (both) value private to the app.: internal access path
- | inside file
- | mtype (both) mime type for original document
- | mtime (query) dmtime if set else fmtime
- | origcharset (both) charset the text was converted from
- | size (query) dbytes if set, else fbytes
- | sig (both) app-defined file modification signature.
- | For up to date checks
- | relevancyrating (query)
- | abstract (both)
- | author (both)
- | title (both)
- | keywords (both)
- |
- | Methods defined here:
- |
- |
- | ----------------------------------------------------------------------
- | Data and other attributes defined here:
- |
-
- class Query(__builtin__.object)
- | Recoll Query objects are used to execute index searches.
- | They must be created by the Db.query() method.
- |
- | Methods defined here:
- |
- |
- | execute(...)
- | execute(query_string, stemming=1|0, stemlang="stemming language")
- |
- | Starts a search for query_string, a Recoll search language string
- | (mostly Xesam-compatible).
- | The query can be a simple list of terms (and'ed by default), or more
- | complicated with field specs etc. See the Recoll manual.
- |
- | executesd(...)
- | executesd(SearchData)
- |
- | Starts a search for the query defined by the SearchData object.
- |
- | fetchone(...)
- | fetchone(None) -> Doc
- |
- | Fetches the next Doc object in the current search results.
- |
- | sortby(...)
- | sortby(field=fieldname, ascending=true)
- | Sort results by 'fieldname', in ascending or descending order.
- | Only one field can be used, no subsorts for now.
- | Must be called before executing the search
- |
- | ----------------------------------------------------------------------
- | Data descriptors defined here:
- |
- | next
- | Next index to be fetched from results. Normally increments after
- | each fetchone() call, but can be set/reset before the call effect
- | seeking. Starts at 0
- |
- | ----------------------------------------------------------------------
- | Data and other attributes defined here:
- |
-
- class SearchData(__builtin__.object)
- | SearchData()
- |
- | A SearchData object describes a query. It has a number of global
- | parameters and a chain of search clauses.
- |
- | Methods defined here:
- |
- |
- | addclause(...)
- | addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
- | qstring=string, slack=int, field=string, stemming=1|0,
- | subSearch=SearchData)
- | Adds a simple clause to the SearchData And/Or chain, or a subquery
- | defined by another SearchData object
- |
- | ----------------------------------------------------------------------
- | Data and other attributes defined here:
- |
-
- FUNCTIONS
- connect(...)
- connect([confdir=None], [extra_dbs=None], [writable = False])
- -> Db.
-
- Connects to a Recoll database and returns a Db object.
- confdir specifies a Recoll configuration directory
- (the default is built like for any Recoll program).
- extra_dbs is a list of external databases (xapian directories)
- writable decides if we can index new data through this connection
-
- 4.3.2.3. Example code
+ The python/recoll/ directory contains the usual setup.py. After
+ configuring the main Recoll code, you can use the script to build and
+ install the Python module:
+
+ cd recoll-xxx/python/recoll
+ python setup.py build
+ python setup.py install
+
+
+ 4.3.2.2. Recoll package
+
+ The recoll package contains two modules:
+
+ o The recoll module contains functions and classes used to query (or
+ update) the index.
+
+ o The rclextract module contains functions and classes used to access
+ document data.
+
+ 4.3.2.3. The recoll module
+
+ Functions
+
+ connect(confdir=None, extra_dbs=None, writable = False)
+ The connect() function connects to one or several Recoll index(es)
+ and returns a Db object.
+ o confdir may specify a configuration directory. The usual
+ defaults apply.
+ o extra_dbs is a list of additional indexes (Xapian
+ directories).
+ o writable decides if we can index new data through this
+ connection.
+ This call initializes the recoll module, and it should always be
+ performed before any other call or object creation.
+
+ Classes
+
+ The Db class
+
+ A Db object is created by a connect() function and holds a connection to a
+ Recoll index.
+
+ Methods
+
+ Db.close()
+ Closes the connection. You can't do anything with the Db object
+ after this.
+
+ Db.query(), Db.cursor()
+ These aliases return a blank Query object for this index.
+
+ Db.setAbstractParams(maxchars, contextwords)
+ Set the parameters used to build snippets.
+
+ The Query class
+
+ A Query object (equivalent to a cursor in the Python DB API) is created by
+ a Db.query() call. It is used to execute index searches.
+
+ Methods
+
+ Query.sortby(fieldname, ascending=True)
+ Sort results by fieldname, in ascending or descending order. Must
+ be called before executing the search.
+
+ Query.execute(query_string, stemming=1, stemlang="english")
+ Starts a search for query_string, a Recoll search language string.
+
+ Query.executesd(SearchData)
+ Starts a search for the query defined by the SearchData object.
+
+ Query.fetchmany(size=query.arraysize)
+ Fetches the next Doc objects in the current search results, and
+ returns them as an array of the required size, which is by default
+ the value of the arraysize data member.
+
+ Query.fetchone()
+ Fetches the next Doc object from the current search results.
+
+ Query.close()
+ Closes the connection. The object is unusable after the call.
+
+ Query.scroll(value, mode='relative')
+ Adjusts the position in the current result set. mode can be
+ relative or absolute.
+
+ Query.getgroups()
+ Retrieves the expanded query terms as a list of pairs. Meaningful
+ only after executexx In each pair, the first entry is a list of
+ user terms, the second a list of query terms as derived from the
+ user terms and used in the Xapian Query. The size of each list is
+ one for simple terms, or more for group and phrase clauses.
+
+ Query.getxquery()
+ Return the Xapian query description as a Unicode string.
+ Meaningful only after executexx.
+
+ Query.highlight(text, ishtml = 0, methods = object)
+ Will insert <span "class=rclmatch">, </span> tags around the match
+ areas in the input text and return the modified text. ishtml can
+ be set to indicate that the input text is HTML and that HTML
+ special characters should not be escaped. methods if set should be
+ an object with methods startMatch(i) and endMatch() which will be
+ called for each match and should return a begin and end tag
+
+ Query.makedocabstract(doc, methods = object))
+ Create a snippets abstract for doc (a Doc object) by selecting
+ text around the match terms. If methods is set, will also perform
+ highlighting. See the highlight method.
+
+ Query.__iter__() and Query.next()
+ So that things like for doc in query: will work.
+
+ Data descriptors
+
+ Query.arraysize
+ Default number of records processed by fetchmany (r/w).
+
+ Query.rowcount
+ Number of records returned by the last execute.
+
+ Query.rownumber
+ Next index to be fetched from results. Normally increments after
+ each fetchone() call, but can be set/reset before the call effect
+ seeking. Starts at 0.
+
+ The Doc class
+
+ A Doc object contains index data for a given document. The data is
+ extracted from the index when searching, or set by the indexer program
+ when updating. The Doc object has many attributes to be read or set by its
+ user. It matches exactly the Rcl::Doc C++ object. Some of the attributes
+ are predefined, but, especially when indexing, others can be set, the name
+ of which will be processed as field names by the indexing configuration.
+ Inputs can be specified as Unicode or strings. Outputs are Unicode
+ objects. All dates are specified as Unix timestamps, printed as strings.
+ Please refer to the rcldb/rcldoc.h C++ file for a description of the
+ predefined attributes.
+
+ At query time, only the fields that are defined as stored either by
+ default or in the fields configuration file will be meaningful in the Doc
+ object. Especially this will not be the case for the document text. See
+ the rclextract module for accessing document contents.
+
+ Methods
+
+ get(key), [] operator
+ Retrieve the named doc attribute
+
+ getbinurl()
+ Retrieve the URL in byte array format (no transcoding), for use as
+ parameter to a system call.
+
+ items()
+ Return a dictionary of doc object keys/values
+
+ keys()
+ list of doc object keys (attribute names).
+
+ The SearchData class
+
+ A SearchData object allows building a query by combining clauses, for
+ execution by Query.executesd(). It can be used in replacement of the query
+ language approach. The interface is going to change a little, so no
+ detailed doc for now...
+
+ Methods
+
+ addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', qstring=string,
+ slack=0, field='', stemming=1, subSearch=SearchData)
+
+ 4.3.2.4. The rclextract module
+
+ Document content is not provided by an index query. To access it, the data
+ extraction part of the indexing process must be performed (subdocument
+ access and format translation). This is not trivial in general. The
+ rclextract module currently provides a single class which can be used to
+ access the data content for result documents.
+
+ Classes
+
+ The Extractor class
+
+ Methods
+
+ Extractor(doc)
+ An Extractor object is built from a Doc object, output from a
+ query.
+
+ Extractor.textextract(ipath)
+ Extract document defined by ipath and return a Doc object. The
+ doc.text field has the document text as either text/plain or
+ text/html according to doc.mimetype.
+
+ Extractor.idoctofile()
+ Extracts document into an output file, which can be given
+ explicitly or will be created as a temporary file to be deleted by
+ the caller.
+
+ 4.3.2.5. Example code
The following sample would query the index with a user language string.
See the python/samples directory inside the Recoll source for other
- examples.
+ examples. The recollgui subdirectory has a very embryonic GUI which
+ demonstrates the highlighting and data extraction functions.
#!/usr/bin/env python
- import recoll
+ from recoll import recoll
db = recoll.connect()
- db.setAbstractParams(maxchars=80, contextwords=2)
+ db.setAbstractParams(maxchars=80, contextwords=4)
query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres > 5:
nres = 5
- while query.next >= 0 and query.next < nres:
+ for i in range(nres):
doc = query.fetchone()
- print query.next
+ print "Result #%d" % (query.rownumber,)
for k in ("title", "size"):
print k, ":", getattr(doc, k).encode('utf-8')
abs = db.makeDocAbstract(doc, query).encode('utf-8')
@@ -2746,6 +2930,32 @@
print
+
+ 4.3.2.6. Compatibility with the previous version
+
+ The following code fragments can be used to ensure that code can run with
+ both the old and the new API (as long as it does not use the new abilities
+ of the new API of course).
+
+ Adapting to the new package structure:
+
+
+ try:
+ from recoll import recoll
+ from recoll import rclextract
+ hasextract = True
+ except:
+ import recoll
+ hasextract = False
+
+
+ Adapting to the change of nature of the next Query member. The same test
+ can be used to choose to use the scroll() method (new) or set the next
+ value (old).
+
+
+ rownum = query.next if type(query.next) == int else \
+ query.rownumber
Chapter 5. Installation and configuration
@@ -3359,10 +3569,22 @@
This allows setting fields for all documents under a given
directory. Typical usage would be to set an "rclaptg" field, to be
used in mimeview to select a specific viewer. If several fields
- are to be set, they should be separated with a colon (':')
- character (which there is currently no way to escape). Ie:
- localfields= rclaptg=gnus:other = val, then select specifier
- viewer with mimetype|tag=... in mimeview.
+ are to be set, they should be separated with a semi-colon (';')
+ character, which there is currently no way to escape. Also note
+ the initial semi-colon. Example: localfields= ;rclaptg=gnus;other
+ = val, then select specifier viewer with mimetype|tag=... in
+ mimeview.
+
+ metadatacmds
+
+ This allows executing external commands for each file and storing
+ the output in a Recoll field. This could be used for example to
+ index external tag data. The value is a list of field names and
+ commands, don't forget an initial semi-colon. Example:
+
+ [/some/area/of/the/fs]
+ metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
+
5.4.1.3. Parameters affecting where and how we store things:
@@ -3592,6 +3814,18 @@
# mailmytag field name
x-my-tag = mailmytag
+ 5.4.2.1. Extended attributes in the fields file
+
+ Recoll versions 1.19 and later process user extended file attributes as
+ documents fields by default.
+
+ Attributes are processed as fields of the same name, after removing the
+ user prefix on Linux.
+
+ The [xattrtofields] section of the fields file allows specifying
+ translations from extended attributes names to Recoll field names. An
+ empty translation disables use of the corresponding attribute data.
+
5.4.3. The mimemap file
mimemap specifies the file name extension to mime type mappings.
@@ -3699,9 +3933,28 @@
document. This could be used in combination with field customisation to
help with opening the document.
- 5.4.6. Examples of configuration adjustments
-
- 5.4.6.1. Adding an external viewer for an non-indexed type
+ 5.4.6. The ptrans file
+
+ ptrans specifies query-time path translations. These can be useful in
+ multiple cases.
+
+ The file has a section for any index which needs translations, either the
+ main one or additional query indexes. The sections are named with the
+ Xapian index directory names. No slash character should exist at the end
+ of the paths (all comparisons are textual). An exemple should make things
+ sufficiently clear
+
+ [/home/me/.recoll/xapiandb]
+ /this/directory/moved = /to/this/place
+
+ [/path/to/additional/xapiandb]
+ /server/volume1/docdir = /net/server/volume1/docdir
+ /server/volume2/docdir = /net/server/volume2/docdir
+
+
+ 5.4.7. Examples of configuration adjustments
+
+ 5.4.7.1. Adding an external viewer for an non-indexed type
Imagine that you have some kind of file which does not have indexable
content, but for which you would like to have a functional Open link in
@@ -3731,7 +3984,7 @@
configuration, which you do not need to alter. mimeview can also be
modified from the Gui.
- 5.4.6.2. Adding indexing support for a new file type
+ 5.4.7.2. Adding indexing support for a new file type
Let us now imagine that the above .blob files actually contain indexable
text and that you know how to extract it with a command line program.