recoll / Code / Diff of /src/README

Diff of /src/README [156f2a] .. [ad4f24]

Switch to side-by-side view

--- a/src/README
+++ b/src/README
@@ -102,7 +102,7 @@
 
                              6.1.1. Filter HTML output
 
-                6.2. Field data processing configuration
+                6.2. Field data processing
 
                 6.3. API
 
@@ -132,13 +132,15 @@
 
                              7.4.1. Main configuration file
 
-                             7.4.2. The mimemap file
-
-                             7.4.3. The mimeconf file
-
-                             7.4.4. The mimeview file
-
-                             7.4.5. Examples of configuration adjustments
+                             7.4.2. The fields file
+
+                             7.4.3. The mimemap file
+
+                             7.4.4. The mimeconf file
+
+                             7.4.5. The mimeview file
+
+                             7.4.6. Examples of configuration adjustments
 
                 7.5. The KDE Kicker Recoll applet
 
@@ -867,6 +869,32 @@
        dir:/home/me/somedir). Please note that this is quite inefficient,
        that it may produce very slow searches, and that it may be worth in
        some cases to set up separate databases instead.
+
+     * date for searching or filtering on dates. The syntax for the argument
+       is based on the ISO8601 standard for dates and time intervals. Only
+       dates are supported, no times. The general syntax is 2 elements
+       separated by a / character. Each element can be a date or a period of
+       time. Periods are specified as PnYnMnD. The n numbers are the
+       respective numbers of years, months or days, any of which may be
+       missing. Dates are specified as YYYY-MM-DD. The days and months parts
+       may be missing. If the / is present but an element is missing, the
+       missing element is interpreted as the lowest or highest date in the
+       index. Exemples:
+
+          * 2001-03-01/2002-05-01 the basic syntax for an interval of dates.
+
+          * 2001-03-01/P1Y2M the same specified with a period.
+
+          * 2001/ from the beginning of 2001 to the latest date in the index.
+
+          * 2001 the whole year of 2001
+
+          * P2D/ means 2 days ago up to now if there are no documents with
+            dates in the future.
+
+          * /2003 all documents from 2003 or older.
+
+       Periods can also be specified with small letters (ie: p2y).
 
      * mime or format for specifying the mime type. This one is quite special
        because you can specify several values which will be OR'ed (the normal
@@ -1156,6 +1184,10 @@
    Wildcards. Wildcards can be used inside search terms in all forms of
    searches. More about wildcards.
 
+   Automatic suffixes. Words like odt or ods can be automatically turned into
+   query language ext:xxx clauses. This can be enabled in the Search
+   preferences panel in the GUI.
+
    Disabling stem expansion. Entering a capitalized word in any search field
    will prevent stem expansion (no search for gardening if you enter Garden
    instead of garden). This is the only case where character case should make
@@ -1321,14 +1353,15 @@
        the search terms. This can slow down result list display significantly
        for big documents, and you may want to turn it off.
 
-     * Replace abstracts from documents: this decides if we should synthesize
-       and display an abstract in place of an explicit abstract found within
-       the document itself.
-
      * Synthetic abstract size: adjust to taste...
 
      * Synthetic abstract context words: how many words should be displayed
        around each term occurrence.
+
+     * Query language magic file name suffixes: a list of words which
+       automatically get turned into ext:xxx file name suffix clauses when
+       starting a query language query (ie: doc xls xlsx...). This will save
+       some typing for people who use file types a lot when querying.
 
    External indexes: This panel will let you browse for additional indexes
    that you may want to search. External indexes are designated by their
@@ -1650,7 +1683,7 @@
 
      ----------------------------------------------------------------------
 
-6.2. Field data processing configuration
+6.2. Field data processing
 
    Fields are named pieces of information in or about documents, like title,
    author, abstract.
@@ -1675,15 +1708,11 @@
        for the document, and can be returned and displayed with search
        results.
 
-   A field can be either or both indexed and stored.
-
-   A field becomes indexed by having a prefix defined in the [prefixes]
-   section of the fields file. See the comments in there for details
-
-   A field becomes stored by appearing in the [stored] section of the fields
-   file.
-
-   See the comments inside the fields for more details.
+   A field can be either or both indexed and stored. This and other aspects
+   of fields handling is defined inside the fields configuration file.
+
+   You can find more information in the section about the fields file, or in
+   comments inside the file.
 
      ----------------------------------------------------------------------
 
@@ -2041,7 +2070,13 @@
    displayed from the recoll File menu. The list is stored in the missing
    text file inside the configuration directory.
 
-   A list of common file types which need external commands:
+   A list of common file types which need external commands follows. Many of
+   the filters need the iconv command, which is not always listed as a
+   dependancy.
+
+   As of Recoll release 1.14, a number of XML-based formats that were handled
+   by ad hoc filter code now use xsltproc, which usually comes with libxslt.
+   These are: abiword, fb2 (ebooks), kword, openoffice, svg.
 
      * Openoffice: supported natively, but needs the unzip command to be
        installed.
@@ -2053,6 +2088,8 @@
      * MS Word: antiword.
 
      * MS Excel and PowerPoint: catdoc.
+
+     * MS Open XML (docx): needs xsltproc.
 
      * Wordperfect files: libwpd.
 
@@ -2067,13 +2104,12 @@
 
      * djvu: DjVuLibre
 
-     * mp3: Recoll will use the id3info command from the id3lib package to
-       extract tag information. Without it, only the file names will be
-       indexed.
-
-     * flac files need metaflac.
-
-     * ogg files need ogginfo.
+     * mp3, flac, ogg vorbis: Recoll releases before 1.13 use the id3info
+       command from the id3lib package to extract mp3 tag information. (Some
+       gcc versions after 4.4 may have trouble compiling id3lib. You can find
+       a workaround here), metaflac (standard flac tools) for flac files, and
+       ogginfo (vorbis tools) for ogg files. Releases 1.14 and later use a
+       single Python filter based on mutagen for all audio file types.
 
      * Pictures: Recoll uses the Exiftool Perl package to extract tag
        information. Most image file formats are supported. Note that there
@@ -2084,12 +2120,14 @@
      * chm: files in microsoft help format need Python and the pychm module
        (which needs chmlib).
 
-     * ics: iCalendar files need Python and the icalendar module.
+     * ics: up to Recoll 1.13, iCalendar files need Python and the icalendar
+       module. For newer versions, icalendar is not needed
 
      * zip: Zip archives need Python (and the standard zipfile module).
 
    Text, HTML, mail folders, Openoffice and Scribus files are processed
-   internally. Lyx is used to index Lyx files. Many filters need sed and awk.
+   internally. Lyx is used to index Lyx files. Many filters need iconv and
+   the standard sed and awk.
 
      ----------------------------------------------------------------------
 
@@ -2097,11 +2135,18 @@
 
   7.3.1. Prerequisites
 
-   At the very least, you will need to download and install the xapian core
-   package and the qt run-time and development packages. Check the Recoll
-   download page for up to date version information.
-
-   You will most probably be able to find a binary package for qt for your
+   C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
+   itself by strange messages about a missing iconv_open.
+
+   Development files for Xapian core
+
+   Development files for Qt .
+
+   Development files for X11 and zlib.
+
+   Check the Recoll download page for up to date version information.
+
+   You will most probably be able to find a binary package for Qt for your
    system. You may have to compile Xapian but this is not difficult (if you
    are using FreeBSD, there is a port).
 
@@ -2113,7 +2158,7 @@
 
   7.3.2. Building
 
-   Recoll has been built on Linux, FreeBSD, macosx, and Solaris, most
+   Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
    versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
    ok). If you build on another system, and need to modify things, I would
    very much welcome patches.
@@ -2282,14 +2327,20 @@
    and edit the configuration file before restarting the command. This will
    start the initial indexing, which may take some time.
 
-   Paramers affecting what we index:
+   Most of the following parameters can be changed from the Index
+   Configuration menu in the recoll interface. Some can only be set by
+   editing the configuration file.
+
+     ----------------------------------------------------------------------
+
+    7.4.1.1. Parameters affecting what documents we index:
 
    topdirs
 
            Specifies the list of directories or files to index (recursively
-           for directories). The indexer will not follow symbolic links
-           inside the indexed trees by default (see the followLinks options
-           though).
+           for directories). You can use symbolic links as elements of this
+           list. See the followLinks option about following symbolic links
+           found under the top elements (not followed by default).
 
    skippedNames
 
@@ -2403,67 +2454,39 @@
            Beagle plugin as ~/.beagle/ToIndex so there should be no need to
            change it.
 
-   Parameters affecting where and how we store things:
-
-   dbdir
-
-           The name of the Xapian data directory. It will be created if
-           needed when the index is initialized. If this is not an absolute
-           path, it will be interpreted relative to the configuration
-           directory. The value can have embedded spaces but starting or
-           trailing spaces will be trimmed. You cannot use quotes here.
-
-   maxfsoccuppc
-
-           Maximum file system occupation before we stop indexing. The value
-           is a percentage, corresponding to what the "Capacity" df output
-           column shows. The default value is 0, meaning no checking.
-
-   mboxcachedir
-
-           The directory where mbox message offsets cache files are held.
-           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
-           to share a directory between different configurations.
-
-   mboxcacheminmbs
-
-           The minimum mbox file size over which we cache the offsets. There
-           is really no sense in caching offsets for small files. The default
-           is 5 MB.
-
-   webcachedir
-
-           This is only used by the Beagle web browser plugin indexing code,
-           and defines where the cache for visited pages will live. Default:
-           $RECOLL_CONFDIR/webcache
-
-   webcachemaxmbs
-
-           This is only used by the Beagle web browser plugin indexing code,
-           and defines the maximum size for the web page cache. Default: 40
-           MB.
-
-   idxflushmb
-
-           Threshold (megabytes of new text data) where we flush from memory
-           to disk index. Setting this can help control memory usage. A value
-           of 0 means no explicit flushing, letting Xapian use its own
-           default, which is flushing every 10000 documents (memory usage
-           depends on average document size). The default value is 10.
-
-   Miscellani:
-
-   loglevel,daemloglevel
-
-           Verbosity level for recoll and recollindex. A value of 4 lists
-           quite a lot of debug/information messages. 2 only lists errors.
-           The daemversion is specific to the indexing monitor daemon.
-
-   logfilename, daemlogfilename
-
-           Where the messages should go. 'stderr' can be used as a special
-           value, and is the default. The daemversion is specific to the
-           indexing monitor daemon.
+     ----------------------------------------------------------------------
+
+    7.4.1.2. Parameters affecting how we generate terms:
+
+   Changing some of these parameters will imply a full reindex. Also, when
+   using multiple indexes, it may not make sense to search indexes that don't
+   share the values for these parameters, because they usually affect both
+   search and index operations.
+
+   nonumbers
+
+           If this set to true, no terms will be generated for numbers. For
+           example "123", "1.5e6", 192.168.1.4, would not be indexed
+           ("value123" would still be). Numbers are often quite interesting
+           to search for, and this should probably not be set except for
+           special situations, ie, scientific documents with huge amounts of
+           numbers in them. This can only be set for a whole index, not for a
+           subtree.
+
+   nocjk
+
+           If this set to true, specific east asian (Chinese Korean Japanese)
+           characters/word splitting is turned off. This will save a small
+           amount of cpu if you have no CJK documents. If your document base
+           does include such text but you are not interested in searching it,
+           setting nocjk may be a significant time and space saver.
+
+   cjkngramlen
+
+           This lets you adjust the size of n-grams used for indexing CJK
+           text. The default value of 2 is probably appropriate in most
+           cases. A value of 3 would allow more precision and efficiency on
+           longer words, but the index will be approximately twice as large.
 
    indexstemminglanguages
 
@@ -2482,11 +2505,6 @@
            character set used is the one defined by the nls environment
            (LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
 
-   filtermaxseconds
-
-           Maximum filter execution time, after which it is aborted. Some
-           postscript programs just loop...
-
    maildefcharset
 
            This can be used to define the default character set specifically
@@ -2498,10 +2516,81 @@
            This allows setting fields for all documents under a given
            directory. Typical usage would be to set an "rclaptg" field, to be
            used in mimeview to select a specific viewer. If several fields
-           are to be set, they should be separated with a ':' character
-           (which there is currently no way to escape). Ie: localfields=
-           rclaptg=gnus:other = val, then select specifier viewer with
-           mimetype|tag=... in mimeview.
+           are to be set, they should be separated with a colon (':')
+           character (which there is currently no way to escape). Ie:
+           localfields= rclaptg=gnus:other = val, then select specifier
+           viewer with mimetype|tag=... in mimeview.
+
+     ----------------------------------------------------------------------
+
+    7.4.1.3. Parameters affecting where and how we store things:
+
+   dbdir
+
+           The name of the Xapian data directory. It will be created if
+           needed when the index is initialized. If this is not an absolute
+           path, it will be interpreted relative to the configuration
+           directory. The value can have embedded spaces but starting or
+           trailing spaces will be trimmed. You cannot use quotes here.
+
+   maxfsoccuppc
+
+           Maximum file system occupation before we stop indexing. The value
+           is a percentage, corresponding to what the "Capacity" df output
+           column shows. The default value is 0, meaning no checking.
+
+   mboxcachedir
+
+           The directory where mbox message offsets cache files are held.
+           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
+           to share a directory between different configurations.
+
+   mboxcacheminmbs
+
+           The minimum mbox file size over which we cache the offsets. There
+           is really no sense in caching offsets for small files. The default
+           is 5 MB.
+
+   webcachedir
+
+           This is only used by the Beagle web browser plugin indexing code,
+           and defines where the cache for visited pages will live. Default:
+           $RECOLL_CONFDIR/webcache
+
+   webcachemaxmbs
+
+           This is only used by the Beagle web browser plugin indexing code,
+           and defines the maximum size for the web page cache. Default: 40
+           MB.
+
+   idxflushmb
+
+           Threshold (megabytes of new text data) where we flush from memory
+           to disk index. Setting this can help control memory usage. A value
+           of 0 means no explicit flushing, letting Xapian use its own
+           default, which is flushing every 10000 documents (memory usage
+           depends on average document size). The default value is 10.
+
+     ----------------------------------------------------------------------
+
+    7.4.1.4. Miscellaneous parameters:
+
+   loglevel,daemloglevel
+
+           Verbosity level for recoll and recollindex. A value of 4 lists
+           quite a lot of debug/information messages. 2 only lists errors.
+           The daemversion is specific to the indexing monitor daemon.
+
+   logfilename, daemlogfilename
+
+           Where the messages should go. 'stderr' can be used as a special
+           value, and is the default. The daemversion is specific to the
+           indexing monitor daemon.
+
+   filtermaxseconds
+
+           Maximum filter execution time, after which it is aborted. Some
+           postscript programs just loop...
 
    filtersdir
 
@@ -2542,21 +2631,6 @@
            Useful for cases where you don't need the functionality or when it
            is unusable because aspell crashes during dictionary generation.
 
-   nocjk
-
-           If this set to true, specific east asian (Chinese Korean Japanese)
-           characters/word splitting is turned off. This will save a small
-           amount of cpu if you have no CJK documents. If your document base
-           does include such text but you are not interested in searching it,
-           setting nocjk may be a significant time and space saver.
-
-   cjkngramlen
-
-           This lets you adjust the size of n-grams used for indexing CJK
-           text. The default value of 2 is probably appropriate in most
-           cases. A value of 3 would allow more precision and efficiency on
-           longer words, but the index will be approximately twice as large.
-
    guesscharset
 
            Decide if we try to guess the character set of files if no
@@ -2565,7 +2639,69 @@
 
      ----------------------------------------------------------------------
 
-  7.4.2. The mimemap file
+  7.4.2. The fields file
+
+   This file contains information about dynamic fields handling in Recoll.
+   Some very basic fields have hard-wired behaviour, and, mostly, you should
+   not change the original data inside the fields file. But you can create
+   custom fields fitting your data and handle them just like they were native
+   ones.
+
+   The fields file has several sections, which each define an aspect of
+   fields processing. Quite often, you'll have to modify several sections to
+   obtain the desired behaviour.
+
+   We will only give a short description here, you should refer to the
+   comments inside the file for more detailed information.
+
+   Field names should be lowercase alphabetic ASCII.
+
+   [prefixes]
+
+           A field becomes indexed (searchable) by having a prefix defined in
+           this section.
+
+   [stored]
+
+           A field becomes stored (displayable inside results) by having its
+           name listed in this section (typically with an empty value).
+
+   [aliases]
+
+           This section defines lists of synonyms for the canonical names
+           used inside the [prefixes] and [stored] sections
+
+   filter-specific sections
+
+           Some filters may need specific configuration for handling fields.
+           Only the mail message filter currently has such a section (named
+           [mail]). It allows indexing arbitrary mail headers in addition to
+           the ones indexed by default. Other such sections may appear in the
+           future.
+
+   Here follows a small example of a personal fields file. This would extract
+   a specific mail header and use it as a searchable field, with data
+   displayable inside result lists. (Side note: as the mail filter does no
+   decoding on the values, only plain ascii headers can be indexed, and only
+   the first occurrence will be used for headers that occur several times).
+
+ [prefixes]
+ # Index mailmytag contents (with the given prefix)
+ mailmytag = XMTAG
+
+ [stored]
+ # Store mailmytag inside the document data record (so that it can be
+ # displayed - as %(mailmytag) - in result lists).
+ mailmytag =
+
+ [mail]
+ # Extract the X-My-Tag mail header, and use it internally with the
+ # mailmytag field name
+ x-my-tag = mailmytag
+
+     ----------------------------------------------------------------------
+
+  7.4.3. The mimemap file
 
    mimemap specifies the file name extension to mime type mappings.
 
@@ -2591,7 +2727,7 @@
 
      ----------------------------------------------------------------------
 
-  7.4.3. The mimeconf file
+  7.4.4. The mimeconf file
 
    mimeconf specifies how the different mime types are handled for indexing,
    and which icons are displayed in the recoll result lists.
@@ -2605,7 +2741,7 @@
 
      ----------------------------------------------------------------------
 
-  7.4.4. The mimeview file
+  7.4.5. The mimeview file
 
    mimeview specifies which programs are started when you click on an Edit
    link in a result list. Ie: HTML is normally displayed using firefox, but
@@ -2633,9 +2769,9 @@
 
      ----------------------------------------------------------------------
 
-  7.4.5. Examples of configuration adjustments
-
-    7.4.5.1. Adding an external viewer for an non-indexed type
+  7.4.6. Examples of configuration adjustments
+
+    7.4.6.1. Adding an external viewer for an non-indexed type
 
    Imagine that you have some kind of file which does not have indexable
    content, but for which you would like to have a functional Edit link in
@@ -2667,7 +2803,7 @@
 
      ----------------------------------------------------------------------
 
-    7.4.5.2. Adding indexing support for a new file type
+    7.4.6.2. Adding indexing support for a new file type
 
    Let us now imagine that the above .blob files actually contain indexable
    text and that you know how to extract it with a command line program.