Switch to unified view

a/src/INSTALL b/src/INSTALL
1
1
2
More documentation can be found in the doc/ directory or at http://www.recoll.org
2
More documentation can be found in the doc/ directory or at http://www.recoll.org
3
3
4
4
5
   Link: HOME
6
   Link: PREVIOUS
7
   Link: NEXT
8
9
                               Recoll user manual
10
   Prev                                                                  Next 
11
12
   --------------------------------------------------------------------------
13
14
                   Chapter 5. Installation and configuration
15
16
   Table of Contents
17
18
   5.1. Installing a binary copy
19
20
   5.2. Supporting packages
21
22
   5.3. Building from source
23
24
   5.4. Configuration overview
25
26
                         5.1. Installing a binary copy
27
28
   There are three types of binary Recoll installations:
29
30
     * Through your system normal software distribution framework (ie,
31
       Debian/Ubuntu apt, FreeBSD ports, etc.).
32
33
     * From a package downloaded from the Recoll web site.
34
35
     * From a prebuilt tree downloaded from the Recoll web site.
36
37
   In all cases, the strict software dependancies (ie on Xapian or iconv)
38
   will be automatically satisfied, you should not have to worry about them.
39
40
   You will only have to check or install supporting applications for the
41
   file types that you want to index beyond those that are natively processed
42
   by Recoll (text, HTML, email files, and a few others).
43
44
   You should also maybe have a look at the configuration section (but this
45
   may not be necessary for a quick test with default parameters). Most
46
   parameters can be more conveniently set from the GUI interface.
47
48
5.1.1. Installing through a package system
49
50
   If you use a BSD-type port system or a prebuilt package (DEB, RPM,
51
   manually or through the system software configuration utility), just
52
   follow the usual procedure for your system.
53
54
5.1.2. Installing a prebuilt Recoll
55
56
   The unpackaged binary versions on the Recoll web site are just compressed
57
   tar files of a build tree, where only the useful parts were kept
58
   (executables and sample configuration).
59
60
   The executable binary files are built with a static link to libxapian and
61
   libiconv, to make installation easier (no dependencies).
62
63
   After extracting the tar file, you can proceed with installation as if you
64
   had built the package from source (that is, just type make install). The
65
   binary trees are built for installation to /usr/local.
66
67
   --------------------------------------------------------------------------
68
69
   Prev                               Home                               Next 
70
   API                                                    Supporting packages 
71
   Link: HOME
72
   Link: UP
73
   Link: PREVIOUS
74
   Link: NEXT
75
76
                               Recoll user manual
77
   Prev            Chapter 5. Installation and configuration             Next 
78
79
   --------------------------------------------------------------------------
80
81
                            5.2. Supporting packages
82
83
   Recoll uses external applications to index some file types. You need to
84
   install them for the file types that you wish to have indexed (these are
85
   run-time optional dependencies. None is needed for building or running
86
   Recoll except for indexing their specific file type).
87
88
   After an indexing pass, the commands that were found missing can be
89
   displayed from the recoll File menu. The list is stored in the missing
90
   text file inside the configuration directory.
91
92
   A list of common file types which need external commands follows. Many of
93
   the filters need the iconv command, which is not always listed as a
94
   dependancy.
95
96
   Please note that, due to the relatively dynamic nature of this
97
   information, the most up to date version is now kept on the Recoll helper
98
   applications page along with links to the home pages or best
99
   source/patches pages, and misc tips. The list below is not updated often
100
   and may be quite stale.
101
102
   For many Linux distributions, most of the commands listed can be installed
103
   from the package repositories. However, the packages are sometimes
104
   outdated, or not the best version for Recoll, so you should take a look at
105
   the Recoll helper applications page if a file type is important to you.
106
107
   As of Recoll release 1.14, a number of XML-based formats that were handled
108
   by ad hoc filter code now use the xsltproc command, which usually comes
109
   with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
110
111
   Now for the list:
112
113
     * Openoffice files need unzip and xsltproc.
114
115
     * PDF files need pdftotext which is part of the Xpdf or Poppler
116
       packages.
117
118
     * Postscript files need pstotext. The original version has an issue with
119
       shell character in file names, which is corrected in recent packages.
120
       See the the Recoll helper applications page for more detail.
121
122
     * MS Word needs antiword. It is also useful to have wvWare installed as
123
       it may be be used as a fallback for some files which antiword does not
124
       handle.
125
126
     * MS Excel and PowerPoint need catdoc.
127
128
     * MS Open XML (docx) needs xsltproc.
129
130
     * Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
131
       Ubuntu) package.
132
133
     * RTF files need unrtf, which, in its standard version, has much trouble
134
       with non-western character sets. Check the Recoll helper applications
135
       page.
136
137
     * TeX files need untex or detex. Check the Recoll helper applications
138
       page for sources if it's not packaged for your distribution.
139
140
     * dvi files need dvips.
141
142
     * djvu files need djvutxt and djvused from the DjVuLibre package.
143
144
     * Audio files: Recoll releases before 1.13 used the id3info command from
145
       the id3lib package to extract mp3 tag information, metaflac (standard
146
       flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
147
       Releases 1.14 and later use a single Python filter based on mutagen
148
       for all audio file types.
149
150
     * Pictures: Recoll uses the Exiftool Perl package to extract tag
151
       information. Most image file formats are supported. Note that there
152
       may not be much interest in indexing the technical tags (image size,
153
       aperture, etc.). This is only of interest if you store personal tags
154
       or textual descriptions inside the image files.
155
156
     * chm: files in microsoft help format need Python and the pychm module
157
       (which needs chmlib).
158
159
     * ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
160
       module. icalendar is not needed for newer versions, which use internal
161
       code.
162
163
     * Zip archives need Python (and the standard zipfile module).
164
165
     * Rar archives need Python, the rarfile Python module and the unrar
166
       utility.
167
168
     * Midi karaoke files need Python and the Midi module
169
170
     * Konqueror webarchive format with Python (uses the Tarfile module).
171
172
     * mimehtml web archive format (support based on the email filter, which
173
       introduces some mild weirdness, but still usable).
174
175
   Text, HTML, email folders, and Scribus files are processed internally. Lyx
176
   is used to index Lyx files. Many filters need iconv and the standard sed
177
   and awk.
178
179
   --------------------------------------------------------------------------
180
181
   Prev                                  Home                            Next 
182
   Installation and configuration         Up             Building from source 
183
   Link: HOME
184
   Link: UP
185
   Link: PREVIOUS
186
   Link: NEXT
187
188
                               Recoll user manual
189
   Prev            Chapter 5. Installation and configuration             Next 
190
191
   --------------------------------------------------------------------------
192
193
                           5.3. Building from source
194
195
5.3.1. Prerequisites
196
197
   C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
198
   itself by strange messages about a missing iconv_open.
199
200
   Development files for Xapian core.
201
202
     Important: If you are building Xapian for an older CPU (before Pentium 4
203
     or Athlon 64), you need to add the --disable-sse flag to the configure
204
     command. Else all Xapian application will crash with an illegal
205
     instruction error.
206
207
   Development files for Qt .
208
209
   Development files for X11 and zlib.
210
211
   Check the Recoll download page for up to date version information.
212
213
   You will most probably be able to find a binary package for Qt for your
214
   system. You may have to compile Xapian but this is not difficult (if you
215
   are using FreeBSD, there is a port).
216
217
   You may also need libiconv. Recoll currently uses version 1.9 (this should
218
   not be critical). On Linux systems, the iconv interface is part of libc
219
   and you should not need to do anything special.
220
221
5.3.2. Building
222
223
   Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
224
   versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
225
   ok). If you build on another system, and need to modify things, I would
226
   very much welcome patches.
227
228
   Depending on the Qt 3 configuration on your system, you may have to set
229
   the QTDIR and QMAKESPECS variables in your environment:
230
231
     * QTDIR should point to the directory above the one that holds the qt
232
       include files (ie: if qt.h is /usr/local/qt/include/qt.h, QTDIR should
233
       be /usr/local/qt).
234
235
     * QMAKESPECS should be set to the name of one of the Qt mkspecs
236
       sub-directories (ie: linux-g++).
237
238
   On many Linux systems, QTDIR is set by the login scripts, and QMAKESPECS
239
   is not needed because there is a default link in mkspecs/.
240
241
   Neither QTDIR nor QMAKESPECS should be needed with Qt 4, configuration
242
   details are entirely determined by qmake (which is quite often installed
243
   as qmake-qt4).
244
245
   Configure options:
246
247
     * --without-aspell will disable the code for phonetic matching of search
248
       terms.
249
250
     * --with-fam or --with-inotify will enable the code for real time
251
       indexing. Inotify support is enabled by default on recent Linux
252
       systems.
253
254
     * --disable-webkit is available from version 1.17 to implement the
255
       result list with a Qt QTextBrowser instead of a WebKit widget if you
256
       do not or can't depend on the latter.
257
258
     * --enable-xattr will enable code to fetch data from file extended
259
       attributes. This is only useful is some application stores data in
260
       there, and also needs some simple configuration (see comments in the
261
       fields configuration file).
262
263
     * --enable-camelcase will enable splitting camelCase words. This is not
264
       enabled by default as it has the unfortunate side-effect of making
265
       some phrase searches quite confusing: ie, "MySQL manual" would be
266
       matched by "MySQL manual" and "my sql manual" but not "mysql manual"
267
       (only inside phrase searches).
268
269
     * --with-file-command Specify the version of the 'file' command to use
270
       (ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
271
       the gnu version on systems where the native one is bad.
272
273
     * --disable-qtgui Disable the Qt interface. Will allow building the
274
       indexer and the command line search program in absence of a Qt
275
       environment.
276
277
     * --disable-x11mon Disable X11 connection monitoring inside recollindex.
278
       Together with --disable-qtgui, this allows building recoll without Qt
279
       and X11.
280
281
     * Of course the usual autoconf configure options, like --prefix apply.
282
283
   Normal procedure:
284
285
         cd recoll-xxx
286
         configure
287
         make
288
         (practices usual hardship-repelling invocations)
289
      
290
291
   There is little auto-configuration. The configure script will mainly link
292
   one of the system-specific files in the mk directory to mk/sysconf. If
293
   your system is not known yet, it will tell you as much, and you may want
294
   to manually copy and modify one of the existing files (the new file name
295
   should be the output of uname -s).
296
297
5.3.3. Installation
298
299
   Either type make install or execute recollinstall prefix, in the root of
300
   the source tree. This will copy the commands to prefix/bin and the sample
301
   configuration files, scripts and other shared data to prefix/share/recoll.
302
303
   If the installation prefix given to recollinstall is different from either
304
   the system default or the value which was specified when executing
305
   configure (as in configure --prefix /some/path), you will have to set the
306
   RECOLL_DATADIR environment variable to indicate where the shared data is
307
   to be found (ie for (ba)sh: export
308
   RECOLL_DATADIR=/some/path/share/recoll).
309
310
   You can then proceed to configuration.
311
312
   --------------------------------------------------------------------------
313
314
   Prev                               Home                               Next 
315
   Supporting packages                 Up              Configuration overview 
316
   Link: HOME
317
   Link: UP
318
   Link: PREVIOUS
319
320
                               Recoll user manual
321
   Prev            Chapter 5. Installation and configuration                  
322
323
   --------------------------------------------------------------------------
324
325
                          5.4. Configuration overview
326
327
   Most of the parameters specific to the recoll GUI are set through the
328
   Preferences menu and stored in the standard Qt place
329
   ($HOME/.config/Recoll.org/recoll.conf). You probably do not want to edit
330
   this by hand.
331
332
   Recoll indexing options are set inside text configuration files located in
333
   a configuration directory. There can be several such directories, each of
334
   which define the parameters for one index.
335
336
   The configuration files can be edited by hand or through the Index
337
   configuration dialog (Preferences menu). The GUI tool will try to respect
338
   your formatting and comments as much as possible, so it is quite possible
339
   to use both ways.
340
341
   The most accurate documentation for the configuration parameters is given
342
   by comments inside the default files, and we will just give a general
343
   overview here.
344
345
   For each index, there are two sets of configuration files. System-wide
346
   configuration files are kept in a directory named like
347
   /usr/[local/]share/recoll/examples, and define default values, shared by
348
   all indexes. For each index, a parallel set of files defines the
349
   customized parameters.
350
351
   The default location of the configuration is the .recoll directory in your
352
   home. Most people will only use this directory.
353
354
   This location can be changed, or others can be added with the
355
   RECOLL_CONFDIR environment variable or the -c option parameter to recoll
356
   and recollindex.
357
358
   If the .recoll directory does not exist when recoll or recollindex are
359
   started, it will be created with a set of empty configuration files.
360
   recoll will give you a chance to edit the configuration file before
361
   starting indexing. recollindex will proceed immediately. To avoid
362
   mistakes, the automatic directory creation will only occur for the default
363
   location, not if -c or RECOLL_CONFDIR were used (in the latter cases, you
364
   will have to create the directory).
365
366
   All configuration files share the same format. For example, a short
367
   extract of the main configuration file might look as follows:
368
369
         # Space-separated list of directories to index.
370
         topdirs =  ~/docs /usr/share/doc
371
372
         [~/somedirectory-with-utf8-txt-files]
373
         defaultcharset = utf-8
374
        
375
376
   There are three kinds of lines:
377
378
     * Comment (starts with #) or empty.
379
380
     * Parameter affectation (name = value).
381
382
     * Section definition ([somedirname]).
383
384
   Depending on the type of configuration file, section definitions either
385
   separate groups of parameters or allow redefining some parameters for a
386
   directory sub-tree. They stay in effect until another section definition,
387
   or the end of file, is encountered. Some of the parameters used for
388
   indexing are looked up hierarchically from the current directory location
389
   upwards. Not all parameters can be meaningfully redefined, this is
390
   specified for each in the next section.
391
392
   When found at the beginning of a file path, the tilde character (~) is
393
   expanded to the name of the user's home directory, as a shell would do.
394
395
   White space is used for separation inside lists. List elements with
396
   embedded spaces can be quoted using double-quotes.
397
398
   Encoding issues. Most of the configuration parameters are plain ASCII. Two
399
   particular sets of values may cause encoding issues:
400
401
     * File path parameters may contain non-ascii characters and should use
402
       the exact same byte values as found in the file system directory.
403
       Usually, this means that the configuration file should use the system
404
       default locale encoding.
405
406
     * The unac_except_trans parameter should be encoded in UTF-8. If your
407
       system locale is not UTF-8, and you need to also specify non-ascii
408
       file paths, this poses a difficulty because common text editors cannot
409
       handle multiple encodings in a single file. In this relatively
410
       unlikely case, you can edit the configuration file as two separate
411
       text files with appropriate encodings, and concatenate them to create
412
       the complete configuration.
413
414
5.4.1. Main configuration file
415
416
   recoll.conf is the main configuration file. It defines things like what to
417
   index (top directories and things to ignore), and the default character
418
   set to use for document types which do not specify it internally.
419
420
   The default configuration will index your home directory. If this is not
421
   appropriate, start recoll to create a blank configuration, click Cancel,
422
   and edit the configuration file before restarting the command. This will
423
   start the initial indexing, which may take some time.
424
425
   Most of the following parameters can be changed from the Index
426
   Configuration menu in the recoll interface. Some can only be set by
427
   editing the configuration file.
428
429
  5.4.1.1. Parameters affecting what documents we index:
430
431
   topdirs
432
433
           Specifies the list of directories or files to index (recursively
434
           for directories). You can use symbolic links as elements of this
435
           list. See the followLinks option about following symbolic links
436
           found under the top elements (not followed by default).
437
438
   skippedNames
439
440
           A space-separated list of patterns for names of files or
441
           directories that should be completely ignored. The list defined in
442
           the default file is:
443
444
 skippedNames = #* bin CVS  Cache cache* caughtspam  tmp .thumbnails .svn \
445
                *~ .beagle .git .hg .bzr loop.ps .xsession-errors \
446
                .recoll* xapiandb recollrc recoll.conf
447
448
           The list can be redefined at any sub-directory in the indexed
449
           area.
450
451
           The top-level directories are not affected by this list (that is,
452
           a directory in topdirs might match and would still be indexed).
453
454
           The list in the default configuration does not exclude hidden
455
           directories (names beginning with a dot), which means that it may
456
           index quite a few things that you do not want. On the other hand,
457
           email user agents like thunderbird usually store messages in
458
           hidden directories, and you probably want this indexed. One
459
           possible solution is to have .* in skippedNames, and add things
460
           like ~/.thunderbird or ~/.evolution in topdirs.
461
462
           Not even the file names are indexed for patterns in this list. See
463
           the recoll_noindex variable in mimemap for an alternative approach
464
           which indexes the file names.
465
466
   skippedPaths and daemSkippedPaths
467
468
           A space-separated list of patterns for paths of files or
469
           directories that should be skipped. There is no default in the
470
           sample configuration file, but the code always adds the
471
           configuration and database directories in there.
472
473
           skippedPaths is used both by batch and real time indexing.
474
           daemSkippedPaths can be used to specify things that should be
475
           indexed at startup, but not monitored.
476
477
           Example of use for skipping text files only in a specific
478
           directory:
479
480
 skippedPaths = ~/somedir/..txt
481
              
482
483
   skippedPathsFnmPathname
484
485
           The values in the *skippedPaths variables are matched by default
486
           with fnmatch(3), with the FNM_PATHNAME and FNM_LEADING_DIR flags.
487
           This means that '/' characters must be matched explicitely. You
488
           can set skippedPathsFnmPathname to 0 to disable the use of
489
           FNM_PATHNAME (meaning that /*/dir3 will match /dir1/dir2/dir3).
490
491
   followLinks
492
493
           Specifies if the indexer should follow symbolic links while
494
           walking the file tree. The default is to ignore symbolic links to
495
           avoid multiple indexing of linked files. No effort is made to
496
           avoid duplication when this option is set to true. This option can
497
           be set individually for each of the topdirs members by using
498
           sections. It can not be changed below the topdirs level.
499
500
   indexedmimetypes
501
502
           Recoll normally indexes any file which it knows how to read. This
503
           list lets you restrict the indexed mime types to what you specify.
504
           If the variable is unspecified or the list empty (the default),
505
           all supported types are processed.
506
507
   compressedfilemaxkbs
508
509
           Size limit for compressed (.gz or .bz2) files. These need to be
510
           decompressed in a temporary directory for identification, which
511
           can be very wasteful if 'uninteresting' big compressed files are
512
           present. Negative means no limit, 0 means no processing of any
513
           compressed file. Defaults to -1.
514
515
   textfilemaxmbs
516
517
           Maximum size for text files. Very big text files are often
518
           uninteresting logs. Set to -1 to disable (default 20MB).
519
520
   textfilepagekbs
521
522
           If set to other than -1, text files will be indexed as multiple
523
           documents of the given page size. This may be useful if you do
524
           want to index very big text files as it will both reduce memory
525
           usage at index time and help with loading data to the preview
526
           window. A size of a few megabytes would seem reasonable (default:
527
           1MB).
528
529
   membermaxkbs
530
531
           This defines the maximum size in kilobytes for an archive member
532
           (zip, tar or rar at the moment). Bigger entries will be skipped.
533
534
   indexallfilenames
535
536
           Recoll indexes file names in a special section of the database to
537
           allow specific file names searches using wild cards. This
538
           parameter decides if file name indexing is performed only for
539
           files with mime types that would qualify them for full text
540
           indexing, or for all files inside the selected subtrees,
541
           independently of mime type.
542
543
   usesystemfilecommand
544
545
           Decide if we use the file -i system command as a final step for
546
           determining the mime type for a file (the main procedure uses
547
           suffix associations as defined in the mimemap file). This can be
548
           useful for files with suffix-less names, but it will also cause
549
           the indexing of many bogus "text" files.
550
551
   processbeaglequeue
552
553
           If this is set, process the directory where Beagle Web browser
554
           plugins copy visited pages for indexing. Of course, Beagle MUST
555
           NOT be running, else things will behave strangely.
556
557
   beaglequeuedir
558
559
           The path to the Beagle indexing queue. This is hard-coded in the
560
           Beagle plugin as ~/.beagle/ToIndex so there should be no need to
561
           change it.
562
563
  5.4.1.2. Parameters affecting how we generate terms:
564
565
   Changing some of these parameters will imply a full reindex. Also, when
566
   using multiple indexes, it may not make sense to search indexes that don't
567
   share the values for these parameters, because they usually affect both
568
   search and index operations.
569
570
   indexStripChars
571
572
           Decide if we strip characters of diacritics and convert them to
573
           lower-case before terms are indexed. If we don't, searches
574
           sensitive to case and diacritics can be performed, but the index
575
           will be bigger, and some marginal weirdness may sometimes occur.
576
           The default is a stripped index (indexStripChars = 1) for now.
577
           When using multiple indexes for a search, this parameter must be
578
           defined identically for all. Changing the value implies an index
579
           reset.
580
581
   maxTermExpand
582
583
           Maximum expansion count for a single term (e.g.: when using
584
           wildcards). The default of 10000 is reasonable and will avoid
585
           queries that appear frozen while the engine is walking the term
586
           list.
587
588
   maxXapianClauses
589
590
           Maximum number of elementary clauses we can add to a single Xapian
591
           query. In some cases, the result of term expansion can be
592
           multiplicative, and we want to avoid using excessive memory. The
593
           default of 100 000 should be both high enough in most cases and
594
           compatible with current typical hardware configurations.
595
596
   nonumbers
597
598
           If this set to true, no terms will be generated for numbers. For
599
           example "123", "1.5e6", 192.168.1.4, would not be indexed
600
           ("value123" would still be). Numbers are often quite interesting
601
           to search for, and this should probably not be set except for
602
           special situations, ie, scientific documents with huge amounts of
603
           numbers in them. This can only be set for a whole index, not for a
604
           subtree.
605
606
   nocjk
607
608
           If this set to true, specific east asian (Chinese Korean Japanese)
609
           characters/word splitting is turned off. This will save a small
610
           amount of cpu if you have no CJK documents. If your document base
611
           does include such text but you are not interested in searching it,
612
           setting nocjk may be a significant time and space saver.
613
614
   cjkngramlen
615
616
           This lets you adjust the size of n-grams used for indexing CJK
617
           text. The default value of 2 is probably appropriate in most
618
           cases. A value of 3 would allow more precision and efficiency on
619
           longer words, but the index will be approximately twice as large.
620
621
   indexstemminglanguages
622
623
           A list of languages for which the stem expansion databases will be
624
           built. See recollindex(1) or use the recollindex -l command for
625
           possible values. You can add a stem expansion database for a
626
           different language by using recollindex -s, but it will be deleted
627
           during the next indexing. Only languages listed in the
628
           configuration file are permanent.
629
630
   defaultcharset
631
632
           The name of the character set used for files that do not contain a
633
           character set definition (ie: plain text files). This can be
634
           redefined for any sub-directory. If it is not set at all, the
635
           character set used is the one defined by the nls environment (
636
           LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
637
638
   unac_except_trans
639
640
           This is a list of characters, encoded in UTF-8, which should be
641
           handled specially when converting text to unaccented lowercase.
642
           For example, in Swedish, the letter a with diaeresis has full
643
           alphabet citizenship and should not be turned into an a. Each
644
           element in the space-separated list has the special character as
645
           first element and the translation following. The handling of both
646
           the lowercase and upper-case versions of a character should be
647
           specified, as appartenance to the list will turn-off both standard
648
           accent and case processing. Example for Swedish:
649
650
 unac_except_trans =  aaaa AAaa a:a: A:a: o:o: O:o:
651
            
652
653
           Note that the translation is not limited to a single character,
654
           you could very well have something like u:ue in the list.
655
656
           The default value set for unac_except_trans can't be listed here
657
           because I have trouble with SGML and UTF-8, but it only contains
658
           ligature decompositions: german ss, oe, ae, fi, fl.
659
660
           This parameter can't be defined for subdirectories, it is global,
661
           because there is no way to do otherwise when querying. If you have
662
           document sets which would need different values, you will have to
663
           index and query them separately.
664
665
   maildefcharset
666
667
           This can be used to define the default character set specifically
668
           for email messages which don't specify it. This is mainly useful
669
           for readpst (libpst) dumps, which are utf-8 but do not say so.
670
671
   localfields
672
673
           This allows setting fields for all documents under a given
674
           directory. Typical usage would be to set an "rclaptg" field, to be
675
           used in mimeview to select a specific viewer. If several fields
676
           are to be set, they should be separated with a colon (':')
677
           character (which there is currently no way to escape). Ie:
678
           localfields= rclaptg=gnus:other = val, then select specifier
679
           viewer with mimetype|tag=... in mimeview.
680
681
  5.4.1.3. Parameters affecting where and how we store things:
682
683
   dbdir
684
685
           The name of the Xapian data directory. It will be created if
686
           needed when the index is initialized. If this is not an absolute
687
           path, it will be interpreted relative to the configuration
688
           directory. The value can have embedded spaces but starting or
689
           trailing spaces will be trimmed. You cannot use quotes here.
690
691
   idxstatusfile
692
693
           The name of the scratch file where the indexer process updates its
694
           status. Default: idxstatus.txt inside the configuration directory.
695
696
   maxfsoccuppc
697
698
           Maximum file system occupation before we stop indexing. The value
699
           is a percentage, corresponding to what the "Capacity" df output
700
           column shows. The default value is 0, meaning no checking.
701
702
   mboxcachedir
703
704
           The directory where mbox message offsets cache files are held.
705
           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
706
           to share a directory between different configurations.
707
708
   mboxcacheminmbs
709
710
           The minimum mbox file size over which we cache the offsets. There
711
           is really no sense in caching offsets for small files. The default
712
           is 5 MB.
713
714
   webcachedir
715
716
           This is only used by the Beagle web browser plugin indexing code,
717
           and defines where the cache for visited pages will live. Default:
718
           $RECOLL_CONFDIR/webcache
719
720
   webcachemaxmbs
721
722
           This is only used by the Beagle web browser plugin indexing code,
723
           and defines the maximum size for the web page cache. Default: 40
724
           MB.
725
726
   idxflushmb
727
728
           Threshold (megabytes of new text data) where we flush from memory
729
           to disk index. Setting this can help control memory usage. A value
730
           of 0 means no explicit flushing, letting Xapian use its own
731
           default, which is flushing every 10000 (or XAPIAN_FLUSH_THRESHOLD)
732
           documents, which gives little memory usage control, as memory
733
           usage depends on average document size. The default value is 10.
734
735
  5.4.1.4. Miscellaneous parameters:
736
737
   autodiacsens
738
739
           IF the index is not stripped, decide if we automatically trigger
740
           diacritics sensitivity if the search term has accented characters
741
           (not in unac_except_trans). Else you need to use the query
742
           language and the D modifier to specify diacritics sensitivity.
743
           Default is no.
744
745
   autocasesens
746
747
           IF the index is not stripped, decide if we automatically trigger
748
           character case sensitivity if the search term has upper-case
749
           characters in any but the first position. Else you need to use the
750
           query language and the C modifier to specify character-case
751
           sensitivity. Default is yes.
752
753
   loglevel,daemloglevel
754
755
           Verbosity level for recoll and recollindex. A value of 4 lists
756
           quite a lot of debug/information messages. 2 only lists errors.
757
           The daemversion is specific to the indexing monitor daemon.
758
759
   logfilename, daemlogfilename
760
761
           Where the messages should go. 'stderr' can be used as a special
762
           value, and is the default. The daemversion is specific to the
763
           indexing monitor daemon.
764
765
   mondelaypatterns
766
767
           This allows specify wildcard path patterns (processed with
768
           fnmatch(3) with 0 flag), to match files which change too often and
769
           for which a delay should be observed before re-indexing. This is a
770
           space-separated list, each entry being a pattern and a time in
771
           seconds, separated by a colon. You can use double quotes if a path
772
           entry contains white space. Example:
773
774
 mondelaypatterns = *.log:20 "this one has spaces*:10"
775
              
776
777
   monixinterval
778
779
           Minimum interval (seconds) for processing the indexing queue. The
780
           real time monitor does not process each event when it comes in,
781
           but will wait this time for the queue to accumulate to diminish
782
           overhead and in order to aggregate multiple events to the same
783
           file. Default 30 S.
784
785
   monauxinterval
786
787
           Period (in seconds) at which the real time monitor will regenerate
788
           the auxiliary databases (spelling, stemming) if needed. The
789
           default is one hour.
790
791
   monioniceclass, monioniceclassdata
792
793
           These allow defining the ionice class and data used by the indexer
794
           (default class 3, no data).
795
796
   filtermaxseconds
797
798
           Maximum filter execution time, after which it is aborted. Some
799
           postscript programs just loop...
800
801
   filtersdir
802
803
           A directory to search for the external filter scripts used to
804
           index some types of files. The value should not be changed, except
805
           if you want to modify one of the default scripts. The value can be
806
           redefined for any sub-directory.
807
808
   iconsdir
809
810
           The name of the directory where recoll result list icons are
811
           stored. You can change this if you want different images.
812
813
   idxabsmlen
814
815
           Recoll stores an abstract for each indexed file inside the
816
           database. The text can come from an actual 'abstract' section in
817
           the document or will just be the beginning of the document. It is
818
           stored in the index so that it can be displayed inside the result
819
           lists without decoding the original file. The idxabsmlen parameter
820
           defines the size of the stored abstract. The default value is 250
821
           bytes. The search interface gives you the choice to display this
822
           stored text or a synthetic abstract built by extracting text
823
           around the search terms. If you always prefer the synthetic
824
           abstract, you can reduce this value and save a little space.
825
826
   aspellLanguage
827
828
           Language definitions to use when creating the aspell dictionary.
829
           The value must match a set of aspell language definition files.
830
           You can type "aspell config" to see where these are installed
831
           (look for data-dir). The default if the variable is not set is to
832
           use your desktop national language environment to guess the value.
833
834
   noaspell
835
836
           If this is set, the aspell dictionary generation is turned off.
837
           Useful for cases where you don't need the functionality or when it
838
           is unusable because aspell crashes during dictionary generation.
839
840
   mhmboxquirks
841
842
           This allows definining location-related quirks for the mailbox
843
           handler. Currently only the tbird flag is defined, and it should
844
           be set for directories which hold Thunderbird data, as their
845
           folder format is weird.
846
847
5.4.2. The fields file
848
849
   This file contains information about dynamic fields handling in Recoll.
850
   Some very basic fields have hard-wired behaviour, and, mostly, you should
851
   not change the original data inside the fields file. But you can create
852
   custom fields fitting your data and handle them just like they were native
853
   ones.
854
855
   The fields file has several sections, which each define an aspect of
856
   fields processing. Quite often, you'll have to modify several sections to
857
   obtain the desired behaviour.
858
859
   We will only give a short description here, you should refer to the
860
   comments inside the file for more detailed information.
861
862
   Field names should be lowercase alphabetic ASCII.
863
864
   [prefixes]
865
866
           A field becomes indexed (searchable) by having a prefix defined in
867
           this section.
868
869
   [stored]
870
871
           A field becomes stored (displayable inside results) by having its
872
           name listed in this section (typically with an empty value).
873
874
   [aliases]
875
876
           This section defines lists of synonyms for the canonical names
877
           used inside the [prefixes] and [stored] sections
878
879
   filter-specific sections
880
881
           Some filters may need specific configuration for handling fields.
882
           Only the email message filter currently has such a section (named
883
           [mail]). It allows indexing arbitrary email headers in addition to
884
           the ones indexed by default. Other such sections may appear in the
885
           future.
886
887
   Here follows a small example of a personal fields file. This would extract
888
   a specific email header and use it as a searchable field, with data
889
   displayable inside result lists. (Side note: as the email filter does no
890
   decoding on the values, only plain ascii headers can be indexed, and only
891
   the first occurrence will be used for headers that occur several times).
892
893
 [prefixes]
894
 # Index mailmytag contents (with the given prefix)
895
 mailmytag = XMTAG
896
897
 [stored]
898
 # Store mailmytag inside the document data record (so that it can be
899
 # displayed - as %(mailmytag) - in result lists).
900
 mailmytag =
901
902
 [mail]
903
 # Extract the X-My-Tag mail header, and use it internally with the
904
 # mailmytag field name
905
 x-my-tag = mailmytag
906
907
5.4.3. The mimemap file
908
909
   mimemap specifies the file name extension to mime type mappings.
910
911
   For file names without an extension, or with an unknown one, the system's
912
   file -i command will be executed to determine the mime type (this can be
913
   switched off inside the main configuration file).
914
915
   The mappings can be specified on a per-subtree basis, which may be useful
916
   in some cases. Example: gaim logs have a .txt extension but should be
917
   handled specially, which is possible because they are usually all located
918
   in one place.
919
920
   mimemap also has a recoll_noindex variable which is a list of suffixes.
921
   Matching files will be skipped (which avoids unnecessary decompressions or
922
   file executions). This is partially redundant with skippedNames in the
923
   main configuration file, with a few differences: it will not affect
924
   directories, it cannot be made dependant on the file-system location (it
925
   is a configuration-wide parameter), and the file names will still be
926
   indexed (not even the file names are indexed for patterns in skippedNames.
927
   recoll_noindex is used mostly for things known to be unindexable by a
928
   given Recoll version. Having it there avoids cluttering the more
929
   user-oriented and locally customized skippedNames.
930
931
5.4.4. The mimeconf file
932
933
   mimeconf specifies how the different mime types are handled for indexing,
934
   and which icons are displayed in the recoll result lists.
935
936
   Changing the parameters in the [index] section is probably not a good idea
937
   except if you are a Recoll developer.
938
939
   The [icons] section allows you to change the icons which are displayed by
940
   recoll in the result lists (the values are the basenames of the png images
941
   inside the iconsdir directory (specified in recoll.conf).
942
943
5.4.5. The mimeview file
944
945
   mimeview specifies which programs are started when you click on an Open
946
   link in a result list. Ie: HTML is normally displayed using firefox, but
947
   you may prefer Konqueror, your openoffice.org program might be named
948
   oofice instead of openoffice etc.
949
950
   Changes to this file can be done by direct editing, or through the recoll
951
   GUI preferences dialog.
952
953
   If Use desktop preferences to choose document editor is checked in the
954
   Recoll GUI preferences, all mimeview entries will be ignored except the
955
   one labelled application/x-all (which is set to use xdg-open by default).
956
957
   In this case, the xallexcepts top level variable defines a list of mime
958
   type exceptions which will be processed according to the local entries
959
   instead of being passed to the desktop. This is so that specific Recoll
960
   options such as a page number or a search string can be passed to
961
   applications that support them, such as the evince viewer.
962
963
   As for the other configuration files, the normal usage is to have a
964
   mimeview inside your own configuration directory, with just the
965
   non-default entries, which will override those from the central
966
   configuration file.
967
968
   All viewer definition entries must be placed under a [view] section.
969
970
   The keys in the file are normally mime types. You can add an application
971
   tag to specialize the choice for an area of the filesystem (using a
972
   localfields specification in mimeconf). The syntax for the key is
973
   mimetype|tag
974
975
   The nouncompforviewmts entry, (placed at the top level, outside of the
976
   [view] section), holds a list of mime types that should not be
977
   uncompressed before starting the viewer (if they are found compressed, ie:
978
   mydoc.doc.gz).
979
980
   The right side of each assignment holds a command to be executed for
981
   opening the file. The following substitutions are performed:
982
983
     * %D. Document date
984
985
     * %f. File name. This may be the name of a temporary file if it was
986
       necessary to create one (ie: to extract a subdocument from a
987
       container).
988
989
     * %F. Original file name. Same as %f except if a temporary file is used.
990
991
     * %i. Internal path, for subdocuments of containers. The format depends
992
       on the container type. If this appears in the command line, Recoll
993
       will not create a temporary file to extract the subdocument, expecting
994
       the called application (possibly a script) to be able to handle it.
995
996
     * %M. Mime type
997
998
     * %p. Page index. Only significant for a subset of document types,
999
       currently only PDF, Postscript and DVI files. Can be used to start the
1000
       editor at the right page for a match or snippet.
1001
1002
     * %s. Search term. The value will only be set for documents with indexed
1003
       page numbers (ie: PDF). The value will be one of the matched search
1004
       terms. It would allow pre-setting the value in the "Find" entry inside
1005
       Evince for example, for easy highlighting of the term.
1006
1007
     * %U, %u. Url.
1008
1009
   In addition to the predefined values above, all strings like %(fieldname)
1010
   will be replaced by the value of the field named fieldname for the
1011
   document. This could be used in combination with field customisation to
1012
   help with opening the document.
1013
1014
5.4.6. Examples of configuration adjustments
1015
1016
  5.4.6.1. Adding an external viewer for an non-indexed type
1017
1018
   Imagine that you have some kind of file which does not have indexable
1019
   content, but for which you would like to have a functional Open link in
1020
   the result list (when found by file name). The file names end in .blob and
1021
   can be displayed by application blobviewer.
1022
1023
   You need two entries in the configuration files for this to work:
1024
1025
     * In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
1026
       following line:
1027
1028
 .blob = application/x-blobapp
1029
1030
       Note that the mime type is made up here, and you could call it
1031
       diesel/oil just the same.
1032
     * In $RECOLL_CONFDIR/mimeview under the [view] section, add:
1033
1034
 application/x-blobapp = blobviewer %f
1035
1036
       We are supposing that blobviewer wants a file name parameter here, you
1037
       would use %u if it liked URLs better.
1038
1039
   If you just wanted to change the application used by Recoll to display a
1040
   mime type which it already knows, you would just need to edit mimeview.
1041
   The entries you add in your personal file override those in the central
1042
   configuration, which you do not need to alter. mimeview can also be
1043
   modified from the Gui.
1044
1045
  5.4.6.2. Adding indexing support for a new file type
1046
1047
   Let us now imagine that the above .blob files actually contain indexable
1048
   text and that you know how to extract it with a command line program.
1049
   Getting Recoll to index the files is easy. You need to perform the above
1050
   alteration, and also to add data to the mimeconf file (typically in
1051
   ~/.recoll/mimeconf):
1052
1053
     * Under the [index] section, add the following line (more about the
1054
       rclblob indexing script later):
1055
1056
 application/x-blobapp = exec rclblob
1057
1058
     * Under the [icons] section, you should choose an icon to be displayed
1059
       for the files inside the result lists. Icons are normally 64x64 pixels
1060
       PNG files which live in /usr/[local/]share/recoll/images.
1061
1062
     * Under the [categories] section, you should add the mime type where it
1063
       makes sense (you can also create a category). Categories may be used
1064
       for filtering in advanced search.
1065
1066
   The rclblob filter should be an executable program or script which exists
1067
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
1068
   argument and should output the text or html contents on the standard
1069
   output.
1070
1071
   The filter programming section describes in more detail how to write a
1072
   filter.
1073
1074
   --------------------------------------------------------------------------
1075
1076
   Prev                               Home                                    
1077
   Building from source                Up