Switch to unified view

a/src/INSTALL b/src/INSTALL
1
1
2
More documentation can be found in the doc/ directory or at http://www.recoll.org
2
More documentation can be found in the doc/ directory or at http://www.recoll.org
3
3
4
4
5
   Link: home: Recoll user manual
6
   Link: up: Recoll user manual
7
   Link: prev: 4.3. API
8
   Link: next: 5.2. Supporting packages
9
10
                   Chapter 5. Installation and configuration
11
   Prev                                                                  Next 
12
13
     ----------------------------------------------------------------------
14
15
Chapter 5. Installation and configuration
16
17
5.1. Installing a binary copy
18
19
   There are three types of binary Recoll installations:
20
21
     o Through your system normal software distribution framework (ie,
22
       Debian/Ubuntu apt, FreeBSD ports, etc.).
23
24
     o From a package downloaded from the Recoll web site.
25
26
     o From a prebuilt tree downloaded from the Recoll web site.
27
28
   In all cases, the strict software dependancies (ie on Xapian or iconv)
29
   will be automatically satisfied, you should not have to worry about them.
30
31
   You will only have to check or install supporting applications for the
32
   file types that you want to index beyond those that are natively processed
33
   by Recoll (text, HTML, email files, and a few others).
34
35
   You should also maybe have a look at the configuration section (but this
36
   may not be necessary for a quick test with default parameters). Most
37
   parameters can be more conveniently set from the GUI interface.
38
39
  5.1.1. Installing through a package system
40
41
   If you use a BSD-type port system or a prebuilt package (DEB, RPM,
42
   manually or through the system software configuration utility), just
43
   follow the usual procedure for your system.
44
45
  5.1.2. Installing a prebuilt Recoll
46
47
   The unpackaged binary versions on the Recoll web site are just compressed
48
   tar files of a build tree, where only the useful parts were kept
49
   (executables and sample configuration).
50
51
   The executable binary files are built with a static link to libxapian and
52
   libiconv, to make installation easier (no dependencies).
53
54
   After extracting the tar file, you can proceed with installation as if you
55
   had built the package from source (that is, just type make install). The
56
   binary trees are built for installation to /usr/local.
57
58
     ----------------------------------------------------------------------
59
60
   Prev                                                                  Next 
61
   4.3. API                           Home           5.2. Supporting packages 
62
   Link: home: Recoll user manual
63
   Link: up: Chapter 5. Installation and configuration
64
   Link: prev: Chapter 5. Installation and configuration
65
   Link: next: 5.3. Building from source
66
67
                            5.2. Supporting packages
68
   Prev            Chapter 5. Installation and configuration             Next 
69
70
     ----------------------------------------------------------------------
71
72
5.2. Supporting packages
73
74
   Recoll uses external applications to index some file types. You need to
75
   install them for the file types that you wish to have indexed (these are
76
   run-time optional dependencies. None is needed for building or running
77
   Recoll except for indexing their specific file type).
78
79
   After an indexing pass, the commands that were found missing can be
80
   displayed from the recoll File menu. The list is stored in the missing
81
   text file inside the configuration directory.
82
83
   A list of common file types which need external commands follows. Many of
84
   the filters need the iconv command, which is not always listed as a
85
   dependancy.
86
87
   Please note that, due to the relatively dynamic nature of this
88
   information, the most up to date version is now kept on the Recoll helper
89
   applications page along with links to the home pages or best
90
   source/patches pages, and misc tips. The list below is not updated often
91
   and may be quite stale.
92
93
   For many Linux distributions, most of the commands listed can be installed
94
   from the package repositories. However, the packages are sometimes
95
   outdated, or not the best version for Recoll, so you should take a look at
96
   the Recoll helper applications page if a file type is important to you.
97
98
   As of Recoll release 1.14, a number of XML-based formats that were handled
99
   by ad hoc filter code now use the xsltproc command, which usually comes
100
   with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
101
102
   Now for the list:
103
104
     o Openoffice files need unzip and xsltproc.
105
106
     o PDF files need pdftotext which is part of the Xpdf or Poppler
107
       packages.
108
109
     o Postscript files need pstotext. The original version has an issue with
110
       shell character in file names, which is corrected in recent packages.
111
       See the the Recoll helper applications page for more detail.
112
113
     o MS Word needs antiword. It is also useful to have wvWare installed as
114
       it may be be used as a fallback for some files which antiword does not
115
       handle.
116
117
     o MS Excel and PowerPoint need catdoc.
118
119
     o MS Open XML (docx) needs xsltproc.
120
121
     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
122
       Ubuntu) package.
123
124
     o RTF files need unrtf, which, in its standard version, has much trouble
125
       with non-western character sets. Check the Recoll helper applications
126
       page.
127
128
     o TeX files need untex or detex. Check the Recoll helper applications
129
       page for sources if it's not packaged for your distribution.
130
131
     o dvi files need dvips.
132
133
     o djvu files need djvutxt and djvused from the DjVuLibre package.
134
135
     o Audio files: Recoll releases before 1.13 used the id3info command from
136
       the id3lib package to extract mp3 tag information, metaflac (standard
137
       flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
138
       Releases 1.14 and later use a single Python filter based on mutagen
139
       for all audio file types.
140
141
     o Pictures: Recoll uses the Exiftool Perl package to extract tag
142
       information. Most image file formats are supported. Note that there
143
       may not be much interest in indexing the technical tags (image size,
144
       aperture, etc.). This is only of interest if you store personal tags
145
       or textual descriptions inside the image files.
146
147
     o chm: files in microsoft help format need Python and the pychm module
148
       (which needs chmlib).
149
150
     o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
151
       module. icalendar is not needed for newer versions, which use internal
152
       code.
153
154
     o Zip archives need Python (and the standard zipfile module).
155
156
     o Rar archives need Python, the rarfile Python module and the unrar
157
       utility.
158
159
     o Midi karaoke files need Python and the Midi module
160
161
     o Konqueror webarchive format with Python (uses the Tarfile module).
162
163
     o mimehtml web archive format (support based on the email filter, which
164
       introduces some mild weirdness, but still usable).
165
166
   Text, HTML, email folders, and Scribus files are processed internally. Lyx
167
   is used to index Lyx files. Many filters need iconv and the standard sed
168
   and awk.
169
170
     ----------------------------------------------------------------------
171
172
   Prev                                        Up                        Next 
173
   Chapter 5. Installation and configuration  Home  5.3. Building from source 
174
   Link: home: Recoll user manual
175
   Link: up: Chapter 5. Installation and configuration
176
   Link: prev: 5.2. Supporting packages
177
   Link: next: 5.4. Configuration overview
178
179
                           5.3. Building from source
180
   Prev            Chapter 5. Installation and configuration             Next 
181
182
     ----------------------------------------------------------------------
183
184
5.3. Building from source
185
186
  5.3.1. Prerequisites
187
188
   C++ compiler. Up to Recoll version 1.13.04, its absence can manifest
189
   itself by strange messages about a missing iconv_open.
190
191
   Development files for Xapian core.
192
193
  Important
194
195
   If you are building Xapian for an older CPU (before Pentium 4 or Athlon
196
   64), you need to add the --disable-sse flag to the configure command. Else
197
   all Xapian application will crash with an illegal instruction error.
198
199
   Development files for Qt .
200
201
   Development files for X11 and zlib.
202
203
   Check the Recoll download page for up to date version information.
204
205
   You will most probably be able to find a binary package for Qt for your
206
   system. You may have to compile Xapian but this is not difficult (if you
207
   are using FreeBSD, there is a port).
208
209
   You may also need libiconv. Recoll currently uses version 1.9 (this should
210
   not be critical). On Linux systems, the iconv interface is part of libc
211
   and you should not need to do anything special.
212
213
  5.3.2. Building
214
215
   Recoll has been built on Linux, FreeBSD, Mac OS X, and Solaris, most
216
   versions after 2005 should be ok, maybe some older ones too (Solaris 8 is
217
   ok). If you build on another system, and need to modify things, I would
218
   very much welcome patches.
219
220
   Depending on the Qt 3 configuration on your system, you may have to set
221
   the QTDIR and QMAKESPECS variables in your environment:
222
223
     o QTDIR should point to the directory above the one that holds the qt
224
       include files (ie: if qt.h is /usr/local/qt/include/qt.h, QTDIR should
225
       be /usr/local/qt).
226
227
     o QMAKESPECS should be set to the name of one of the Qt mkspecs
228
       sub-directories (ie: linux-g++).
229
230
   On many Linux systems, QTDIR is set by the login scripts, and QMAKESPECS
231
   is not needed because there is a default link in mkspecs/.
232
233
   Neither QTDIR nor QMAKESPECS should be needed with Qt 4, configuration
234
   details are entirely determined by qmake (which is quite often installed
235
   as qmake-qt4).
236
237
   Configure options: 
238
239
     o --without-aspell will disable the code for phonetic matching of search
240
       terms.
241
242
     o --with-fam or --with-inotify will enable the code for real time
243
       indexing. Inotify support is enabled by default on recent Linux
244
       systems.
245
246
     o --disable-webkit is available from version 1.17 to implement the
247
       result list with a Qt QTextBrowser instead of a WebKit widget if you
248
       do not or can't depend on the latter.
249
250
     o --enable-xattr will enable code to fetch data from file extended
251
       attributes. This is only useful is some application stores data in
252
       there, and also needs some simple configuration (see comments in the
253
       fields configuration file).
254
255
     o --enable-camelcase will enable splitting camelCase words. This is not
256
       enabled by default as it has the unfortunate side-effect of making
257
       some phrase searches quite confusing: ie, "MySQL manual" would be
258
       matched by "MySQL manual" and "my sql manual" but not "mysql manual"
259
       (only inside phrase searches).
260
261
     o --with-file-command Specify the version of the 'file' command to use
262
       (ie: --with-file-command=/usr/local/bin/file). Can be useful to enable
263
       the gnu version on systems where the native one is bad.
264
265
     o --disable-qtgui Disable the Qt interface. Will allow building the
266
       indexer and the command line search program in absence of a Qt
267
       environment.
268
269
     o --disable-x11mon Disable X11 connection monitoring inside recollindex.
270
       Together with --disable-qtgui, this allows building recoll without Qt
271
       and X11.
272
273
     o Of course the usual autoconf configure options, like --prefix apply.
274
275
   Normal procedure:
276
277
         cd recoll-xxx
278
         configure
279
         make
280
         (practices usual hardship-repelling invocations)
281
      
282
283
   There is little auto-configuration. The configure script will mainly link
284
   one of the system-specific files in the mk directory to mk/sysconf. If
285
   your system is not known yet, it will tell you as much, and you may want
286
   to manually copy and modify one of the existing files (the new file name
287
   should be the output of uname -s).
288
289
  5.3.3. Installation
290
291
   Either type make install or execute recollinstall prefix, in the root of
292
   the source tree. This will copy the commands to prefix/bin and the sample
293
   configuration files, scripts and other shared data to prefix/share/recoll.
294
295
   If the installation prefix given to recollinstall is different from either
296
   the system default or the value which was specified when executing
297
   configure (as in configure --prefix /some/path), you will have to set the
298
   RECOLL_DATADIR environment variable to indicate where the shared data is
299
   to be found (ie for (ba)sh: export
300
   RECOLL_DATADIR=/some/path/share/recoll).
301
302
   You can then proceed to configuration.
303
304
     ----------------------------------------------------------------------
305
306
   Prev                                Up                                Next 
307
   5.2. Supporting packages           Home        5.4. Configuration overview 
308
   Link: home: Recoll user manual
309
   Link: up: Chapter 5. Installation and configuration
310
   Link: prev: 5.3. Building from source
311
312
                          5.4. Configuration overview
313
   Prev            Chapter 5. Installation and configuration                  
314
315
     ----------------------------------------------------------------------
316
317
5.4. Configuration overview
318
319
   Most of the parameters specific to the recoll GUI are set through the
320
   Preferences menu and stored in the standard Qt place
321
   ($HOME/.config/Recoll.org/recoll.conf). You probably do not want to edit
322
   this by hand.
323
324
   Recoll indexing options are set inside text configuration files located in
325
   a configuration directory. There can be several such directories, each of
326
   which define the parameters for one index.
327
328
   The configuration files can be edited by hand or through the Index
329
   configuration dialog (Preferences menu). The GUI tool will try to respect
330
   your formatting and comments as much as possible, so it is quite possible
331
   to use both ways.
332
333
   The most accurate documentation for the configuration parameters is given
334
   by comments inside the default files, and we will just give a general
335
   overview here.
336
337
   For each index, there are two sets of configuration files. System-wide
338
   configuration files are kept in a directory named like
339
   /usr/[local/]share/recoll/examples, and define default values, shared by
340
   all indexes. For each index, a parallel set of files defines the
341
   customized parameters.
342
343
   The default location of the configuration is the .recoll directory in your
344
   home. Most people will only use this directory.
345
346
   This location can be changed, or others can be added with the
347
   RECOLL_CONFDIR environment variable or the -c option parameter to recoll
348
   and recollindex.
349
350
   If the .recoll directory does not exist when recoll or recollindex are
351
   started, it will be created with a set of empty configuration files.
352
   recoll will give you a chance to edit the configuration file before
353
   starting indexing. recollindex will proceed immediately. To avoid
354
   mistakes, the automatic directory creation will only occur for the default
355
   location, not if -c or RECOLL_CONFDIR were used (in the latter cases, you
356
   will have to create the directory).
357
358
   All configuration files share the same format. For example, a short
359
   extract of the main configuration file might look as follows:
360
361
         # Space-separated list of directories to index.
362
         topdirs =  ~/docs /usr/share/doc
363
364
         [~/somedirectory-with-utf8-txt-files]
365
         defaultcharset = utf-8
366
        
367
368
   There are three kinds of lines:
369
370
     o Comment (starts with #) or empty.
371
372
     o Parameter affectation (name = value).
373
374
     o Section definition ([somedirname]).
375
376
   Depending on the type of configuration file, section definitions either
377
   separate groups of parameters or allow redefining some parameters for a
378
   directory sub-tree. They stay in effect until another section definition,
379
   or the end of file, is encountered. Some of the parameters used for
380
   indexing are looked up hierarchically from the current directory location
381
   upwards. Not all parameters can be meaningfully redefined, this is
382
   specified for each in the next section.
383
384
   When found at the beginning of a file path, the tilde character (~) is
385
   expanded to the name of the user's home directory, as a shell would do.
386
387
   White space is used for separation inside lists. List elements with
388
   embedded spaces can be quoted using double-quotes.
389
390
   Encoding issues. Most of the configuration parameters are plain ASCII. Two
391
   particular sets of values may cause encoding issues:
392
393
     o File path parameters may contain non-ascii characters and should use
394
       the exact same byte values as found in the file system directory.
395
       Usually, this means that the configuration file should use the system
396
       default locale encoding.
397
398
     o The unac_except_trans parameter should be encoded in UTF-8. If your
399
       system locale is not UTF-8, and you need to also specify non-ascii
400
       file paths, this poses a difficulty because common text editors cannot
401
       handle multiple encodings in a single file. In this relatively
402
       unlikely case, you can edit the configuration file as two separate
403
       text files with appropriate encodings, and concatenate them to create
404
       the complete configuration.
405
406
  5.4.1. Main configuration file
407
408
   recoll.conf is the main configuration file. It defines things like what to
409
   index (top directories and things to ignore), and the default character
410
   set to use for document types which do not specify it internally.
411
412
   The default configuration will index your home directory. If this is not
413
   appropriate, start recoll to create a blank configuration, click Cancel,
414
   and edit the configuration file before restarting the command. This will
415
   start the initial indexing, which may take some time.
416
417
   Most of the following parameters can be changed from the Index
418
   Configuration menu in the recoll interface. Some can only be set by
419
   editing the configuration file.
420
421
    5.4.1.1. Parameters affecting what documents we index:
422
423
   topdirs
424
425
           Specifies the list of directories or files to index (recursively
426
           for directories). You can use symbolic links as elements of this
427
           list. See the followLinks option about following symbolic links
428
           found under the top elements (not followed by default).
429
430
   skippedNames
431
432
           A space-separated list of patterns for names of files or
433
           directories that should be completely ignored. The list defined in
434
           the default file is:
435
436
 skippedNames = #* bin CVS  Cache cache* caughtspam  tmp .thumbnails .svn \
437
                *~ .beagle .git .hg .bzr loop.ps .xsession-errors \
438
                .recoll* xapiandb recollrc recoll.conf
439
440
           The list can be redefined at any sub-directory in the indexed
441
           area.
442
443
           The top-level directories are not affected by this list (that is,
444
           a directory in topdirs might match and would still be indexed).
445
446
           The list in the default configuration does not exclude hidden
447
           directories (names beginning with a dot), which means that it may
448
           index quite a few things that you do not want. On the other hand,
449
           email user agents like thunderbird usually store messages in
450
           hidden directories, and you probably want this indexed. One
451
           possible solution is to have .* in skippedNames, and add things
452
           like ~/.thunderbird or ~/.evolution in topdirs.
453
454
           Not even the file names are indexed for patterns in this list. See
455
           the recoll_noindex variable in mimemap for an alternative approach
456
           which indexes the file names.
457
458
   skippedPaths and daemSkippedPaths
459
460
           A space-separated list of patterns for paths of files or
461
           directories that should be skipped. There is no default in the
462
           sample configuration file, but the code always adds the
463
           configuration and database directories in there.
464
465
           skippedPaths is used both by batch and real time indexing.
466
           daemSkippedPaths can be used to specify things that should be
467
           indexed at startup, but not monitored.
468
469
           Example of use for skipping text files only in a specific
470
           directory:
471
472
 skippedPaths = ~/somedir/*.txt
473
              
474
475
   skippedPathsFnmPathname
476
477
           The values in the *skippedPaths variables are matched by default
478
           with fnmatch(3), with the FNM_PATHNAME and FNM_LEADING_DIR flags.
479
           This means that '/' characters must be matched explicitely. You
480
           can set skippedPathsFnmPathname to 0 to disable the use of
481
           FNM_PATHNAME (meaning that /*/dir3 will match /dir1/dir2/dir3).
482
483
   followLinks
484
485
           Specifies if the indexer should follow symbolic links while
486
           walking the file tree. The default is to ignore symbolic links to
487
           avoid multiple indexing of linked files. No effort is made to
488
           avoid duplication when this option is set to true. This option can
489
           be set individually for each of the topdirs members by using
490
           sections. It can not be changed below the topdirs level.
491
492
   indexedmimetypes
493
494
           Recoll normally indexes any file which it knows how to read. This
495
           list lets you restrict the indexed mime types to what you specify.
496
           If the variable is unspecified or the list empty (the default),
497
           all supported types are processed.
498
499
   compressedfilemaxkbs
500
501
           Size limit for compressed (.gz or .bz2) files. These need to be
502
           decompressed in a temporary directory for identification, which
503
           can be very wasteful if 'uninteresting' big compressed files are
504
           present. Negative means no limit, 0 means no processing of any
505
           compressed file. Defaults to -1.
506
507
   textfilemaxmbs
508
509
           Maximum size for text files. Very big text files are often
510
           uninteresting logs. Set to -1 to disable (default 20MB).
511
512
   textfilepagekbs
513
514
           If set to other than -1, text files will be indexed as multiple
515
           documents of the given page size. This may be useful if you do
516
           want to index very big text files as it will both reduce memory
517
           usage at index time and help with loading data to the preview
518
           window. A size of a few megabytes would seem reasonable (default:
519
           1MB).
520
521
   membermaxkbs
522
523
           This defines the maximum size in kilobytes for an archive member
524
           (zip, tar or rar at the moment). Bigger entries will be skipped.
525
526
   indexallfilenames
527
528
           Recoll indexes file names in a special section of the database to
529
           allow specific file names searches using wild cards. This
530
           parameter decides if file name indexing is performed only for
531
           files with mime types that would qualify them for full text
532
           indexing, or for all files inside the selected subtrees,
533
           independently of mime type.
534
535
   usesystemfilecommand
536
537
           Decide if we use the file -i system command as a final step for
538
           determining the mime type for a file (the main procedure uses
539
           suffix associations as defined in the mimemap file). This can be
540
           useful for files with suffix-less names, but it will also cause
541
           the indexing of many bogus "text" files.
542
543
   processwebqueue
544
545
           If this is set, process the directory where Web browser plugins
546
           copy visited pages for indexing.
547
548
   webqueuedir
549
550
           The path to the web indexing queue. This is hard-coded in the
551
           Firefox plugin as ~/.recollweb/ToIndex so there should be no need
552
           to change it.
553
554
    5.4.1.2. Parameters affecting how we generate terms:
555
556
   Changing some of these parameters will imply a full reindex. Also, when
557
   using multiple indexes, it may not make sense to search indexes that don't
558
   share the values for these parameters, because they usually affect both
559
   search and index operations.
560
561
   indexStripChars
562
563
           Decide if we strip characters of diacritics and convert them to
564
           lower-case before terms are indexed. If we don't, searches
565
           sensitive to case and diacritics can be performed, but the index
566
           will be bigger, and some marginal weirdness may sometimes occur.
567
           The default is a stripped index (indexStripChars = 1) for now.
568
           When using multiple indexes for a search, this parameter must be
569
           defined identically for all. Changing the value implies an index
570
           reset.
571
572
   maxTermExpand
573
574
           Maximum expansion count for a single term (e.g.: when using
575
           wildcards). The default of 10000 is reasonable and will avoid
576
           queries that appear frozen while the engine is walking the term
577
           list.
578
579
   maxXapianClauses
580
581
           Maximum number of elementary clauses we can add to a single Xapian
582
           query. In some cases, the result of term expansion can be
583
           multiplicative, and we want to avoid using excessive memory. The
584
           default of 100 000 should be both high enough in most cases and
585
           compatible with current typical hardware configurations.
586
587
   nonumbers
588
589
           If this set to true, no terms will be generated for numbers. For
590
           example "123", "1.5e6", 192.168.1.4, would not be indexed
591
           ("value123" would still be). Numbers are often quite interesting
592
           to search for, and this should probably not be set except for
593
           special situations, ie, scientific documents with huge amounts of
594
           numbers in them. This can only be set for a whole index, not for a
595
           subtree.
596
597
   nocjk
598
599
           If this set to true, specific east asian (Chinese Korean Japanese)
600
           characters/word splitting is turned off. This will save a small
601
           amount of cpu if you have no CJK documents. If your document base
602
           does include such text but you are not interested in searching it,
603
           setting nocjk may be a significant time and space saver.
604
605
   cjkngramlen
606
607
           This lets you adjust the size of n-grams used for indexing CJK
608
           text. The default value of 2 is probably appropriate in most
609
           cases. A value of 3 would allow more precision and efficiency on
610
           longer words, but the index will be approximately twice as large.
611
612
   indexstemminglanguages
613
614
           A list of languages for which the stem expansion databases will be
615
           built. See recollindex(1) or use the recollindex -l command for
616
           possible values. You can add a stem expansion database for a
617
           different language by using recollindex -s, but it will be deleted
618
           during the next indexing. Only languages listed in the
619
           configuration file are permanent.
620
621
   defaultcharset
622
623
           The name of the character set used for files that do not contain a
624
           character set definition (ie: plain text files). This can be
625
           redefined for any sub-directory. If it is not set at all, the
626
           character set used is the one defined by the nls environment (
627
           LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
628
629
   unac_except_trans
630
631
           This is a list of characters, encoded in UTF-8, which should be
632
           handled specially when converting text to unaccented lowercase.
633
           For example, in Swedish, the letter a with diaeresis has full
634
           alphabet citizenship and should not be turned into an a. Each
635
           element in the space-separated list has the special character as
636
           first element and the translation following. The handling of both
637
           the lowercase and upper-case versions of a character should be
638
           specified, as appartenance to the list will turn-off both standard
639
           accent and case processing. Example for Swedish:
640
641
 unac_except_trans =  aaaa AAaa a:a: A:a: o:o: O:o:
642
            
643
644
           Note that the translation is not limited to a single character,
645
           you could very well have something like u:ue in the list.
646
647
           The default value set for unac_except_trans can't be listed here
648
           because I have trouble with SGML and UTF-8, but it only contains
649
           ligature decompositions: german ss, oe, ae, fi, fl.
650
651
           This parameter can't be defined for subdirectories, it is global,
652
           because there is no way to do otherwise when querying. If you have
653
           document sets which would need different values, you will have to
654
           index and query them separately.
655
656
   maildefcharset
657
658
           This can be used to define the default character set specifically
659
           for email messages which don't specify it. This is mainly useful
660
           for readpst (libpst) dumps, which are utf-8 but do not say so.
661
662
   localfields
663
664
           This allows setting fields for all documents under a given
665
           directory. Typical usage would be to set an "rclaptg" field, to be
666
           used in mimeview to select a specific viewer. If several fields
667
           are to be set, they should be separated with a colon (':')
668
           character (which there is currently no way to escape). Ie:
669
           localfields= rclaptg=gnus:other = val, then select specifier
670
           viewer with mimetype|tag=... in mimeview.
671
672
    5.4.1.3. Parameters affecting where and how we store things:
673
674
   dbdir
675
676
           The name of the Xapian data directory. It will be created if
677
           needed when the index is initialized. If this is not an absolute
678
           path, it will be interpreted relative to the configuration
679
           directory. The value can have embedded spaces but starting or
680
           trailing spaces will be trimmed. You cannot use quotes here.
681
682
   idxstatusfile
683
684
           The name of the scratch file where the indexer process updates its
685
           status. Default: idxstatus.txt inside the configuration directory.
686
687
   maxfsoccuppc
688
689
           Maximum file system occupation before we stop indexing. The value
690
           is a percentage, corresponding to what the "Capacity" df output
691
           column shows. The default value is 0, meaning no checking.
692
693
   mboxcachedir
694
695
           The directory where mbox message offsets cache files are held.
696
           This is normally $RECOLL_CONFDIR/mboxcache, but it may be useful
697
           to share a directory between different configurations.
698
699
   mboxcacheminmbs
700
701
           The minimum mbox file size over which we cache the offsets. There
702
           is really no sense in caching offsets for small files. The default
703
           is 5 MB.
704
705
   webcachedir
706
707
           This is only used by the web browser plugin indexing code, and
708
           defines where the cache for visited pages will live. Default:
709
           $RECOLL_CONFDIR/webcache
710
711
   webcachemaxmbs
712
713
           This is only used by the web browser plugin indexing code, and
714
           defines the maximum size for the web page cache. Default: 40 MB.
715
716
   idxflushmb
717
718
           Threshold (megabytes of new text data) where we flush from memory
719
           to disk index. Setting this can help control memory usage. A value
720
           of 0 means no explicit flushing, letting Xapian use its own
721
           default, which is flushing every 10000 (or XAPIAN_FLUSH_THRESHOLD)
722
           documents, which gives little memory usage control, as memory
723
           usage also depends on average document size. The default value is
724
           10, and it is probably a bit low. If your system usually has free
725
           memory, you can try higher values between 20 and 80. In my
726
           experience, values beyond 100 are always counterproductive.
727
728
    5.4.1.4. Miscellaneous parameters:
729
730
   autodiacsens
731
732
           IF the index is not stripped, decide if we automatically trigger
733
           diacritics sensitivity if the search term has accented characters
734
           (not in unac_except_trans). Else you need to use the query
735
           language and the D modifier to specify diacritics sensitivity.
736
           Default is no.
737
738
   autocasesens
739
740
           IF the index is not stripped, decide if we automatically trigger
741
           character case sensitivity if the search term has upper-case
742
           characters in any but the first position. Else you need to use the
743
           query language and the C modifier to specify character-case
744
           sensitivity. Default is yes.
745
746
   loglevel,daemloglevel
747
748
           Verbosity level for recoll and recollindex. A value of 4 lists
749
           quite a lot of debug/information messages. 2 only lists errors.
750
           The daemversion is specific to the indexing monitor daemon.
751
752
   logfilename, daemlogfilename
753
754
           Where the messages should go. 'stderr' can be used as a special
755
           value, and is the default. The daemversion is specific to the
756
           indexing monitor daemon.
757
758
   mondelaypatterns
759
760
           This allows specify wildcard path patterns (processed with
761
           fnmatch(3) with 0 flag), to match files which change too often and
762
           for which a delay should be observed before re-indexing. This is a
763
           space-separated list, each entry being a pattern and a time in
764
           seconds, separated by a colon. You can use double quotes if a path
765
           entry contains white space. Example:
766
767
 mondelaypatterns = *.log:20 "this one has spaces*:10"
768
              
769
770
   monixinterval
771
772
           Minimum interval (seconds) for processing the indexing queue. The
773
           real time monitor does not process each event when it comes in,
774
           but will wait this time for the queue to accumulate to diminish
775
           overhead and in order to aggregate multiple events to the same
776
           file. Default 30 S.
777
778
   monauxinterval
779
780
           Period (in seconds) at which the real time monitor will regenerate
781
           the auxiliary databases (spelling, stemming) if needed. The
782
           default is one hour.
783
784
   monioniceclass, monioniceclassdata
785
786
           These allow defining the ionice class and data used by the indexer
787
           (default class 3, no data).
788
789
   filtermaxseconds
790
791
           Maximum filter execution time, after which it is aborted. Some
792
           postscript programs just loop...
793
794
   filtersdir
795
796
           A directory to search for the external filter scripts used to
797
           index some types of files. The value should not be changed, except
798
           if you want to modify one of the default scripts. The value can be
799
           redefined for any sub-directory.
800
801
   iconsdir
802
803
           The name of the directory where recoll result list icons are
804
           stored. You can change this if you want different images.
805
806
   idxabsmlen
807
808
           Recoll stores an abstract for each indexed file inside the
809
           database. The text can come from an actual 'abstract' section in
810
           the document or will just be the beginning of the document. It is
811
           stored in the index so that it can be displayed inside the result
812
           lists without decoding the original file. The idxabsmlen parameter
813
           defines the size of the stored abstract. The default value is 250
814
           bytes. The search interface gives you the choice to display this
815
           stored text or a synthetic abstract built by extracting text
816
           around the search terms. If you always prefer the synthetic
817
           abstract, you can reduce this value and save a little space.
818
819
   aspellLanguage
820
821
           Language definitions to use when creating the aspell dictionary.
822
           The value must match a set of aspell language definition files.
823
           You can type "aspell config" to see where these are installed
824
           (look for data-dir). The default if the variable is not set is to
825
           use your desktop national language environment to guess the value.
826
827
   noaspell
828
829
           If this is set, the aspell dictionary generation is turned off.
830
           Useful for cases where you don't need the functionality or when it
831
           is unusable because aspell crashes during dictionary generation.
832
833
   mhmboxquirks
834
835
           This allows definining location-related quirks for the mailbox
836
           handler. Currently only the tbird flag is defined, and it should
837
           be set for directories which hold Thunderbird data, as their
838
           folder format is weird.
839
840
  5.4.2. The fields file
841
842
   This file contains information about dynamic fields handling in Recoll.
843
   Some very basic fields have hard-wired behaviour, and, mostly, you should
844
   not change the original data inside the fields file. But you can create
845
   custom fields fitting your data and handle them just like they were native
846
   ones.
847
848
   The fields file has several sections, which each define an aspect of
849
   fields processing. Quite often, you'll have to modify several sections to
850
   obtain the desired behaviour.
851
852
   We will only give a short description here, you should refer to the
853
   comments inside the file for more detailed information.
854
855
   Field names should be lowercase alphabetic ASCII.
856
857
   [prefixes]
858
859
           A field becomes indexed (searchable) by having a prefix defined in
860
           this section.
861
862
   [stored]
863
864
           A field becomes stored (displayable inside results) by having its
865
           name listed in this section (typically with an empty value).
866
867
   [aliases]
868
869
           This section defines lists of synonyms for the canonical names
870
           used inside the [prefixes] and [stored] sections
871
872
   filter-specific sections
873
874
           Some filters may need specific configuration for handling fields.
875
           Only the email message filter currently has such a section (named
876
           [mail]). It allows indexing arbitrary email headers in addition to
877
           the ones indexed by default. Other such sections may appear in the
878
           future.
879
880
   Here follows a small example of a personal fields file. This would extract
881
   a specific email header and use it as a searchable field, with data
882
   displayable inside result lists. (Side note: as the email filter does no
883
   decoding on the values, only plain ascii headers can be indexed, and only
884
   the first occurrence will be used for headers that occur several times).
885
886
 [prefixes]
887
 # Index mailmytag contents (with the given prefix)
888
 mailmytag = XMTAG
889
890
 [stored]
891
 # Store mailmytag inside the document data record (so that it can be
892
 # displayed - as %(mailmytag) - in result lists).
893
 mailmytag =
894
895
 [mail]
896
 # Extract the X-My-Tag mail header, and use it internally with the
897
 # mailmytag field name
898
 x-my-tag = mailmytag
899
900
  5.4.3. The mimemap file
901
902
   mimemap specifies the file name extension to mime type mappings.
903
904
   For file names without an extension, or with an unknown one, the system's
905
   file -i command will be executed to determine the mime type (this can be
906
   switched off inside the main configuration file).
907
908
   The mappings can be specified on a per-subtree basis, which may be useful
909
   in some cases. Example: gaim logs have a .txt extension but should be
910
   handled specially, which is possible because they are usually all located
911
   in one place.
912
913
   mimemap also has a recoll_noindex variable which is a list of suffixes.
914
   Matching files will be skipped (which avoids unnecessary decompressions or
915
   file executions). This is partially redundant with skippedNames in the
916
   main configuration file, with a few differences: it will not affect
917
   directories, it cannot be made dependant on the file-system location (it
918
   is a configuration-wide parameter), and the file names will still be
919
   indexed (not even the file names are indexed for patterns in skippedNames.
920
   recoll_noindex is used mostly for things known to be unindexable by a
921
   given Recoll version. Having it there avoids cluttering the more
922
   user-oriented and locally customized skippedNames.
923
924
  5.4.4. The mimeconf file
925
926
   mimeconf specifies how the different mime types are handled for indexing,
927
   and which icons are displayed in the recoll result lists.
928
929
   Changing the parameters in the [index] section is probably not a good idea
930
   except if you are a Recoll developer.
931
932
   The [icons] section allows you to change the icons which are displayed by
933
   recoll in the result lists (the values are the basenames of the png images
934
   inside the iconsdir directory (specified in recoll.conf).
935
936
  5.4.5. The mimeview file
937
938
   mimeview specifies which programs are started when you click on an Open
939
   link in a result list. Ie: HTML is normally displayed using firefox, but
940
   you may prefer Konqueror, your openoffice.org program might be named
941
   oofice instead of openoffice etc.
942
943
   Changes to this file can be done by direct editing, or through the recoll
944
   GUI preferences dialog.
945
946
   If Use desktop preferences to choose document editor is checked in the
947
   Recoll GUI preferences, all mimeview entries will be ignored except the
948
   one labelled application/x-all (which is set to use xdg-open by default).
949
950
   In this case, the xallexcepts top level variable defines a list of mime
951
   type exceptions which will be processed according to the local entries
952
   instead of being passed to the desktop. This is so that specific Recoll
953
   options such as a page number or a search string can be passed to
954
   applications that support them, such as the evince viewer.
955
956
   As for the other configuration files, the normal usage is to have a
957
   mimeview inside your own configuration directory, with just the
958
   non-default entries, which will override those from the central
959
   configuration file.
960
961
   All viewer definition entries must be placed under a [view] section.
962
963
   The keys in the file are normally mime types. You can add an application
964
   tag to specialize the choice for an area of the filesystem (using a
965
   localfields specification in mimeconf). The syntax for the key is
966
   mimetype|tag
967
968
   The nouncompforviewmts entry, (placed at the top level, outside of the
969
   [view] section), holds a list of mime types that should not be
970
   uncompressed before starting the viewer (if they are found compressed, ie:
971
   mydoc.doc.gz).
972
973
   The right side of each assignment holds a command to be executed for
974
   opening the file. The following substitutions are performed:
975
976
     o %D. Document date
977
978
     o %f. File name. This may be the name of a temporary file if it was
979
       necessary to create one (ie: to extract a subdocument from a
980
       container).
981
982
     o %F. Original file name. Same as %f except if a temporary file is used.
983
984
     o %i. Internal path, for subdocuments of containers. The format depends
985
       on the container type. If this appears in the command line, Recoll
986
       will not create a temporary file to extract the subdocument, expecting
987
       the called application (possibly a script) to be able to handle it.
988
989
     o %M. Mime type
990
991
     o %p. Page index. Only significant for a subset of document types,
992
       currently only PDF, Postscript and DVI files. Can be used to start the
993
       editor at the right page for a match or snippet.
994
995
     o %s. Search term. The value will only be set for documents with indexed
996
       page numbers (ie: PDF). The value will be one of the matched search
997
       terms. It would allow pre-setting the value in the "Find" entry inside
998
       Evince for example, for easy highlighting of the term.
999
1000
     o %U, %u. Url.
1001
1002
   In addition to the predefined values above, all strings like %(fieldname)
1003
   will be replaced by the value of the field named fieldname for the
1004
   document. This could be used in combination with field customisation to
1005
   help with opening the document.
1006
1007
  5.4.6. Examples of configuration adjustments
1008
1009
    5.4.6.1. Adding an external viewer for an non-indexed type
1010
1011
   Imagine that you have some kind of file which does not have indexable
1012
   content, but for which you would like to have a functional Open link in
1013
   the result list (when found by file name). The file names end in .blob and
1014
   can be displayed by application blobviewer.
1015
1016
   You need two entries in the configuration files for this to work:
1017
1018
     o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
1019
       following line:
1020
1021
 .blob = application/x-blobapp
1022
1023
       Note that the mime type is made up here, and you could call it
1024
       diesel/oil just the same.
1025
1026
     o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
1027
1028
 application/x-blobapp = blobviewer %f
1029
1030
       We are supposing that blobviewer wants a file name parameter here, you
1031
       would use %u if it liked URLs better.
1032
1033
   If you just wanted to change the application used by Recoll to display a
1034
   mime type which it already knows, you would just need to edit mimeview.
1035
   The entries you add in your personal file override those in the central
1036
   configuration, which you do not need to alter. mimeview can also be
1037
   modified from the Gui.
1038
1039
    5.4.6.2. Adding indexing support for a new file type
1040
1041
   Let us now imagine that the above .blob files actually contain indexable
1042
   text and that you know how to extract it with a command line program.
1043
   Getting Recoll to index the files is easy. You need to perform the above
1044
   alteration, and also to add data to the mimeconf file (typically in
1045
   ~/.recoll/mimeconf):
1046
1047
     o Under the [index] section, add the following line (more about the
1048
       rclblob indexing script later):
1049
1050
 application/x-blobapp = exec rclblob
1051
1052
     o Under the [icons] section, you should choose an icon to be displayed
1053
       for the files inside the result lists. Icons are normally 64x64 pixels
1054
       PNG files which live in /usr/[local/]share/recoll/images.
1055
1056
     o Under the [categories] section, you should add the mime type where it
1057
       makes sense (you can also create a category). Categories may be used
1058
       for filtering in advanced search.
1059
1060
   The rclblob filter should be an executable program or script which exists
1061
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
1062
   argument and should output the text or html contents on the standard
1063
   output.
1064
1065
   The filter programming section describes in more detail how to write a
1066
   filter.
1067
1068
     ----------------------------------------------------------------------
1069
1070
   Prev                                Up                                     
1071
   5.3. Building from source          Home