recoll / Code / Diff of /src/INSTALL

Diff of /src/INSTALL [156f2a] .. [ad4f24]

Switch to side-by-side view

--- a/src/INSTALL
+++ b/src/INSTALL
@@ -91,7 +91,13 @@
    displayed from the recoll File menu. The list is stored in the missing
    text file inside the configuration directory.
 
-   A list of common file types which need external commands:
+   A list of common file types which need external commands follows. Many of
+   the filters need the iconv command, which is not always listed as a
+   dependancy.
+
+   As of Recoll release 1.14, a number of XML-based formats that were handled
+   by ad hoc filter code now use xsltproc, which usually comes with libxslt.
+   These are: abiword, fb2 (ebooks), kword, openoffice, svg.
 
      * Openoffice: supported natively, but needs the unzip command to be
        installed.
@@ -103,6 +109,8 @@
      * MS Word: antiword.
 
      * MS Excel and PowerPoint: catdoc.
+
+     * MS Open XML (docx): needs xsltproc.
 
      * Wordperfect files: libwpd.
 
@@ -117,13 +125,12 @@
 
      * djvu: DjVuLibre
 
-     * mp3: Recoll will use the id3info command from the id3lib package to
-       extract tag information. Without it, only the file names will be
-       indexed.
-
-     * flac files need metaflac.
-
-     * ogg files need ogginfo.
+     * mp3, flac, ogg vorbis: Recoll releases before 1.13 use the id3info
+       command from the id3lib package to extract mp3 tag information. (Some
+       gcc versions after 4.4 may have trouble compiling id3lib. You can find
+       a workaround here), metaflac (standard flac tools) for flac files, and
+       ogginfo (vorbis tools) for ogg files. Releases 1.14 and later use a
+       single Python filter based on mutagen for all audio file types.
 
      * Pictures: Recoll uses the Exiftool Perl package to extract tag
        information. Most image file formats are supported. Note that there
@@ -134,12 +141,14 @@
      * chm: files in microsoft help format need Python and the pychm module
        (which needs chmlib).
 
-     * ics: iCalendar files need Python and the icalendar module.
+     * ics: up to Recoll 1.13, iCalendar files need Python and the icalendar
+       module. For newer versions, icalendar is not needed
 
      * zip: Zip archives need Python (and the standard zipfile module).
 
    Text, HTML, mail folders, Openoffice and Scribus files are processed
-   internally. Lyx is used to index Lyx files. Many filters need sed and awk.
+   internally. Lyx is used to index Lyx files. Many filters need iconv and
+   the standard sed and awk.
 
    --------------------------------------------------------------------------
 
@@ -159,11 +168,18 @@
 
 7.3.1. Prerequisites
 
-   At the very least, you will need to download and install the xapian core
-   package and the qt run-time and development packages. Check the Recoll
-   download page for up to date version information.
-
-   You will most probably be able to find a binary package for qt for your
+   C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
+   itself by strange messages about a missing iconv_open.
+
+   Development files for Xapian core
+
+   Development files for Qt .
+
+   Development files for X11 and zlib.
+
+   Check the Recoll download page for up to date version information.
+
+   You will most probably be able to find a binary package for Qt for your
    system. You may have to compile Xapian but this is not difficult (if you
    are using FreeBSD, there is a port).
 
@@ -173,7 +189,7 @@
 
 7.3.2. Building
 
-   Recoll has been built on Linux, FreeBSD, macosx, and Solaris, most
+   Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
    versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
    ok). If you build on another system, and need to modify things, I would
    very much welcome patches.
@@ -350,14 +366,18 @@
    and edit the configuration file before restarting the command. This will
    start the initial indexing, which may take some time.
 
-   Paramers affecting what we index:
+   Most of the following parameters can be changed from the Index
+   Configuration menu in the recoll interface. Some can only be set by
+   editing the configuration file.
+
+  7.4.1.1. Parameters affecting what documents we index:
 
    topdirs
 
            Specifies the list of directories or files to index (recursively
-           for directories). The indexer will not follow symbolic links
-           inside the indexed trees by default (see the followLinks options
-           though).
+           for directories). You can use symbolic links as elements of this
+           list. See the followLinks option about following symbolic links
+           found under the top elements (not followed by default).
 
    skippedNames
 
@@ -471,67 +491,37 @@
            Beagle plugin as ~/.beagle/ToIndex so there should be no need to
            change it.
 
-   Parameters affecting where and how we store things:
-
-   dbdir
-
-           The name of the Xapian data directory. It will be created if
-           needed when the index is initialized. If this is not an absolute
-           path, it will be interpreted relative to the configuration
-           directory. The value can have embedded spaces but starting or
-           trailing spaces will be trimmed. You cannot use quotes here.
-
-   maxfsoccuppc
-
-           Maximum file system occupation before we stop indexing. The value
-           is a percentage, corresponding to what the "Capacity" df output
-           column shows. The default value is 0, meaning no checking.
-
-   mboxcachedir
-
-           The directory where mbox message offsets cache files are held.
-           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
-           to share a directory between different configurations.
-
-   mboxcacheminmbs
-
-           The minimum mbox file size over which we cache the offsets. There
-           is really no sense in caching offsets for small files. The default
-           is 5 MB.
-
-   webcachedir
-
-           This is only used by the Beagle web browser plugin indexing code,
-           and defines where the cache for visited pages will live. Default:
-           $RECOLL_CONFDIR/webcache
-
-   webcachemaxmbs
-
-           This is only used by the Beagle web browser plugin indexing code,
-           and defines the maximum size for the web page cache. Default: 40
-           MB.
-
-   idxflushmb
-
-           Threshold (megabytes of new text data) where we flush from memory
-           to disk index. Setting this can help control memory usage. A value
-           of 0 means no explicit flushing, letting Xapian use its own
-           default, which is flushing every 10000 documents (memory usage
-           depends on average document size). The default value is 10.
-
-   Miscellani:
-
-   loglevel,daemloglevel
-
-           Verbosity level for recoll and recollindex. A value of 4 lists
-           quite a lot of debug/information messages. 2 only lists errors.
-           The daemversion is specific to the indexing monitor daemon.
-
-   logfilename, daemlogfilename
-
-           Where the messages should go. 'stderr' can be used as a special
-           value, and is the default. The daemversion is specific to the
-           indexing monitor daemon.
+  7.4.1.2. Parameters affecting how we generate terms:
+
+   Changing some of these parameters will imply a full reindex. Also, when
+   using multiple indexes, it may not make sense to search indexes that don't
+   share the values for these parameters, because they usually affect both
+   search and index operations.
+
+   nonumbers
+
+           If this set to true, no terms will be generated for numbers. For
+           example "123", "1.5e6", 192.168.1.4, would not be indexed
+           ("value123" would still be). Numbers are often quite interesting
+           to search for, and this should probably not be set except for
+           special situations, ie, scientific documents with huge amounts of
+           numbers in them. This can only be set for a whole index, not for a
+           subtree.
+
+   nocjk
+
+           If this set to true, specific east asian (Chinese Korean Japanese)
+           characters/word splitting is turned off. This will save a small
+           amount of cpu if you have no CJK documents. If your document base
+           does include such text but you are not interested in searching it,
+           setting nocjk may be a significant time and space saver.
+
+   cjkngramlen
+
+           This lets you adjust the size of n-grams used for indexing CJK
+           text. The default value of 2 is probably appropriate in most
+           cases. A value of 3 would allow more precision and efficiency on
+           longer words, but the index will be approximately twice as large.
 
    indexstemminglanguages
 
@@ -550,11 +540,6 @@
            character set used is the one defined by the nls environment
            (LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
 
-   filtermaxseconds
-
-           Maximum filter execution time, after which it is aborted. Some
-           postscript programs just loop...
-
    maildefcharset
 
            This can be used to define the default character set specifically
@@ -566,10 +551,77 @@
            This allows setting fields for all documents under a given
            directory. Typical usage would be to set an "rclaptg" field, to be
            used in mimeview to select a specific viewer. If several fields
-           are to be set, they should be separated with a ':' character
-           (which there is currently no way to escape). Ie: localfields=
-           rclaptg=gnus:other = val, then select specifier viewer with
-           mimetype|tag=... in mimeview.
+           are to be set, they should be separated with a colon (':')
+           character (which there is currently no way to escape). Ie:
+           localfields= rclaptg=gnus:other = val, then select specifier
+           viewer with mimetype|tag=... in mimeview.
+
+  7.4.1.3. Parameters affecting where and how we store things:
+
+   dbdir
+
+           The name of the Xapian data directory. It will be created if
+           needed when the index is initialized. If this is not an absolute
+           path, it will be interpreted relative to the configuration
+           directory. The value can have embedded spaces but starting or
+           trailing spaces will be trimmed. You cannot use quotes here.
+
+   maxfsoccuppc
+
+           Maximum file system occupation before we stop indexing. The value
+           is a percentage, corresponding to what the "Capacity" df output
+           column shows. The default value is 0, meaning no checking.
+
+   mboxcachedir
+
+           The directory where mbox message offsets cache files are held.
+           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
+           to share a directory between different configurations.
+
+   mboxcacheminmbs
+
+           The minimum mbox file size over which we cache the offsets. There
+           is really no sense in caching offsets for small files. The default
+           is 5 MB.
+
+   webcachedir
+
+           This is only used by the Beagle web browser plugin indexing code,
+           and defines where the cache for visited pages will live. Default:
+           $RECOLL_CONFDIR/webcache
+
+   webcachemaxmbs
+
+           This is only used by the Beagle web browser plugin indexing code,
+           and defines the maximum size for the web page cache. Default: 40
+           MB.
+
+   idxflushmb
+
+           Threshold (megabytes of new text data) where we flush from memory
+           to disk index. Setting this can help control memory usage. A value
+           of 0 means no explicit flushing, letting Xapian use its own
+           default, which is flushing every 10000 documents (memory usage
+           depends on average document size). The default value is 10.
+
+  7.4.1.4. Miscellaneous parameters:
+
+   loglevel,daemloglevel
+
+           Verbosity level for recoll and recollindex. A value of 4 lists
+           quite a lot of debug/information messages. 2 only lists errors.
+           The daemversion is specific to the indexing monitor daemon.
+
+   logfilename, daemlogfilename
+
+           Where the messages should go. 'stderr' can be used as a special
+           value, and is the default. The daemversion is specific to the
+           indexing monitor daemon.
+
+   filtermaxseconds
+
+           Maximum filter execution time, after which it is aborted. Some
+           postscript programs just loop...
 
    filtersdir
 
@@ -610,28 +662,73 @@
            Useful for cases where you don't need the functionality or when it
            is unusable because aspell crashes during dictionary generation.
 
-   nocjk
-
-           If this set to true, specific east asian (Chinese Korean Japanese)
-           characters/word splitting is turned off. This will save a small
-           amount of cpu if you have no CJK documents. If your document base
-           does include such text but you are not interested in searching it,
-           setting nocjk may be a significant time and space saver.
-
-   cjkngramlen
-
-           This lets you adjust the size of n-grams used for indexing CJK
-           text. The default value of 2 is probably appropriate in most
-           cases. A value of 3 would allow more precision and efficiency on
-           longer words, but the index will be approximately twice as large.
-
    guesscharset
 
            Decide if we try to guess the character set of files if no
            internal value is available (ie: for plain text files). This does
            not work well in general, and should probably not be used.
 
-7.4.2. The mimemap file
+7.4.2. The fields file
+
+   This file contains information about dynamic fields handling in Recoll.
+   Some very basic fields have hard-wired behaviour, and, mostly, you should
+   not change the original data inside the fields file. But you can create
+   custom fields fitting your data and handle them just like they were native
+   ones.
+
+   The fields file has several sections, which each define an aspect of
+   fields processing. Quite often, you'll have to modify several sections to
+   obtain the desired behaviour.
+
+   We will only give a short description here, you should refer to the
+   comments inside the file for more detailed information.
+
+   Field names should be lowercase alphabetic ASCII.
+
+   [prefixes]
+
+           A field becomes indexed (searchable) by having a prefix defined in
+           this section.
+
+   [stored]
+
+           A field becomes stored (displayable inside results) by having its
+           name listed in this section (typically with an empty value).
+
+   [aliases]
+
+           This section defines lists of synonyms for the canonical names
+           used inside the [prefixes] and [stored] sections
+
+   filter-specific sections
+
+           Some filters may need specific configuration for handling fields.
+           Only the mail message filter currently has such a section (named
+           [mail]). It allows indexing arbitrary mail headers in addition to
+           the ones indexed by default. Other such sections may appear in the
+           future.
+
+   Here follows a small example of a personal fields file. This would extract
+   a specific mail header and use it as a searchable field, with data
+   displayable inside result lists. (Side note: as the mail filter does no
+   decoding on the values, only plain ascii headers can be indexed, and only
+   the first occurrence will be used for headers that occur several times).
+
+ [prefixes]
+ # Index mailmytag contents (with the given prefix)
+ mailmytag = XMTAG
+
+ [stored]
+ # Store mailmytag inside the document data record (so that it can be
+ # displayed - as %(mailmytag) - in result lists).
+ mailmytag =
+
+ [mail]
+ # Extract the X-My-Tag mail header, and use it internally with the
+ # mailmytag field name
+ x-my-tag = mailmytag
+
+7.4.3. The mimemap file
 
    mimemap specifies the file name extension to mime type mappings.
 
@@ -655,7 +752,7 @@
    given Recoll version. Having it there avoids cluttering the more
    user-oriented and locally customized skippedNames.
 
-7.4.3. The mimeconf file
+7.4.4. The mimeconf file
 
    mimeconf specifies how the different mime types are handled for indexing,
    and which icons are displayed in the recoll result lists.
@@ -667,7 +764,7 @@
    recoll in the result lists (the values are the basenames of the png images
    inside the iconsdir directory (specified in recoll.conf).
 
-7.4.4. The mimeview file
+7.4.5. The mimeview file
 
    mimeview specifies which programs are started when you click on an Edit
    link in a result list. Ie: HTML is normally displayed using firefox, but
@@ -693,9 +790,9 @@
    user preferences, all mimeview entries will be ignored except the one
    labelled application/x-all (which is set to use xdg-open by default).
 
-7.4.5. Examples of configuration adjustments
-
-  7.4.5.1. Adding an external viewer for an non-indexed type
+7.4.6. Examples of configuration adjustments
+
+  7.4.6.1. Adding an external viewer for an non-indexed type
 
    Imagine that you have some kind of file which does not have indexable
    content, but for which you would like to have a functional Edit link in
@@ -725,7 +822,7 @@
    configuration, which you do not need to alter. mimeview can also be
    modified from the Gui.
 
-  7.4.5.2. Adding indexing support for a new file type
+  7.4.6.2. Adding indexing support for a new file type
 
    Let us now imagine that the above .blob files actually contain indexable
    text and that you know how to extract it with a command line program.