Switch to unified view

a/src/README b/src/README
...
...
132
132
133
                             3.8.2. The KDE Kicker Recoll applet
133
                             3.8.2. The KDE Kicker Recoll applet
134
134
135
   4. Programming interface
135
   4. Programming interface
136
136
137
                4.1. Writing a document filter
137
                4.1. Writing a document input handler
138
138
139
                             4.1.1. Simple filters
139
                             4.1.1. Simple input handlers
140
140
141
                             4.1.2. "Multiple" filters
141
                             4.1.2. "Multiple" handlers
142
142
143
                             4.1.3. Telling Recoll about the filter
143
                             4.1.3. Telling Recoll about the handler
144
144
145
                             4.1.4. Filter HTML output
145
                             4.1.4. Input handler HTML output
146
146
147
                             4.1.5. Page numbers
147
                             4.1.5. Page numbers
148
148
149
                4.2. Field data processing
149
                4.2. Field data processing
150
150
...
...
257
   index, but the result is not nice, as all formatting, punctuation and
257
   index, but the result is not nice, as all formatting, punctuation and
258
   capitalization are lost).
258
   capitalization are lost).
259
259
260
   Recoll stores all internal data in Unicode UTF-8 format, and it can index
260
   Recoll stores all internal data in Unicode UTF-8 format, and it can index
261
   files with different character sets, encodings, and languages into the
261
   files with different character sets, encodings, and languages into the
262
   same index. It has input filters for many document types.
262
   same index. It has can process many document types.
263
263
264
   Stemming is the process by which Recoll reduces words to their radicals so
264
   Stemming is the process by which Recoll reduces words to their radicals so
265
   that searching does not depend, for example, on a word being singular or
265
   that searching does not depend, for example, on a word being singular or
266
   plural (floor, floors), or on a verb tense (flooring, floored). Because
266
   plural (floor, floors), or on a verb tense (flooring, floored). Because
267
   the mechanisms used for stemming depend on the specific grammatical rules
267
   the mechanisms used for stemming depend on the specific grammatical rules
...
...
416
   to be indexed. In the latter case, any type not in the list will be
416
   to be indexed. In the latter case, any type not in the list will be
417
   ignored.
417
   ignored.
418
418
419
   Excluding types can be done by adding wildcard name patterns to the
419
   Excluding types can be done by adding wildcard name patterns to the
420
   skippedNames list, which can be done from the GUI Index configuration
420
   skippedNames list, which can be done from the GUI Index configuration
421
   menu. It is also possible to exclude a mime type independantly of the file
421
   menu. For versions 1.20 and later, you can alternatively set the
422
   name by associating it with the rclnull filter. This can be done by
422
   excludedmimetypes list in the configuration file. This can be redefined
423
   editing the mimeconf configuration file.
423
   for subdirectories.
424
424
425
   In order to define a positive list, You need to edit the main
425
   You can also define an exclusive list of MIME types to be indexed (no
426
   configuration file (recoll.conf) and set the indexedmimetypes
426
   others will be indexed), by settting the indexedmimetypes configuration
427
   configuration variable. Example:
427
   variable. Example:
428
428
429
 indexedmimetypes = text/html application/pdf
429
 indexedmimetypes = text/html application/pdf
430
          
430
          
431
431
432
   It is possible to redefine this parameter for subdirectories. Example:
432
   It is possible to redefine this parameter for subdirectories. Example:
...
...
434
 [/path/to/my/dir]
434
 [/path/to/my/dir]
435
 indexedmimetypes = application/pdf
435
 indexedmimetypes = application/pdf
436
          
436
          
437
437
438
   (When using sections like this, don't forget that they remain in effect
438
   (When using sections like this, don't forget that they remain in effect
439
   until the end of the file or another section indicator). There is no GUI
439
   until the end of the file or another section indicator).
440
   way to edit the parameter, because this option runs contrary to Recoll
440
441
   main goal which is to help you find information, independantly of how it
441
   excludedmimetypes or indexedmimetypes, can be set either by editing the
442
   may be stored.
442
   main configuration file (recoll.conf), or from the GUI index configuration
443
   tool.
443
444
444
  2.1.4. Recovery
445
  2.1.4. Recovery
445
446
446
   In the rare case where the index becomes corrupted (which can signal
447
   In the rare case where the index becomes corrupted (which can signal
447
   itself by weird search results or crashes), the index files need to be
448
   itself by weird search results or crashes), the index files need to be
...
...
700
   A freedesktop standard defines a few special attributes, which are handled
701
   A freedesktop standard defines a few special attributes, which are handled
701
   as such by Recoll:
702
   as such by Recoll:
702
703
703
   mime_type
704
   mime_type
704
705
705
           If set, this overrides any other determination of the file mime
706
           If set, this overrides any other determination of the file MIME
706
           type.
707
           type.
707
708
708
   charset
709
   charset
709
           If set, this defines the file character set (mostly useful for
710
           If set, this defines the file character set (mostly useful for
710
           plain text files).
711
           plain text files).
...
...
1016
   default, Recoll lets the desktop choose the appropriate application for
1017
   default, Recoll lets the desktop choose the appropriate application for
1017
   most document types (there is a short list of exceptions, see further). If
1018
   most document types (there is a short list of exceptions, see further). If
1018
   you prefer to completely customize the choice of applications, you can
1019
   you prefer to completely customize the choice of applications, you can
1019
   uncheck the Use desktop preferences option in the GUI preferences dialog,
1020
   uncheck the Use desktop preferences option in the GUI preferences dialog,
1020
   and click the Choose editor applications button to adjust the predefined
1021
   and click the Choose editor applications button to adjust the predefined
1021
   Recoll choices. The tool accepts multiple selections of mime types (e.g.
1022
   Recoll choices. The tool accepts multiple selections of MIME types (e.g.
1022
   to set up the editor for the dozens of office file types).
1023
   to set up the editor for the dozens of office file types).
1023
1024
1024
   Even when Use desktop preferences is checked, there is a small list of
1025
   Even when Use desktop preferences is checked, there is a small list of
1025
   exceptions, for mime types where the Recoll choice should override the
1026
   exceptions, for MIME types where the Recoll choice should override the
1026
   desktop one. These are applications which are well integrated with Recoll,
1027
   desktop one. These are applications which are well integrated with Recoll,
1027
   especially evince for viewing PDF and Postscript files because of its
1028
   especially evince for viewing PDF and Postscript files because of its
1028
   support for opening the document at a specific page and passing a search
1029
   support for opening the document at a specific page and passing a search
1029
   string as an argument. Of course, you can edit the list (in the GUI
1030
   string as an argument. Of course, you can edit the list (in the GUI
1030
   preferences) if you would prefer to lose the functionality and use the
1031
   preferences) if you would prefer to lose the functionality and use the
...
...
1240
1241
1241
    1. The first tab lets you specify terms to search for, and permits
1242
    1. The first tab lets you specify terms to search for, and permits
1242
       specifying multiple clauses which are combined to build the search.
1243
       specifying multiple clauses which are combined to build the search.
1243
1244
1244
    2. The second tab lets filter the results according to file size, date of
1245
    2. The second tab lets filter the results according to file size, date of
1245
       modification, mime type, or location.
1246
       modification, MIME type, or location.
1246
1247
1247
   Click on the Start Search button in the advanced search dialog, or type
1248
   Click on the Start Search button in the advanced search dialog, or type
1248
   Enter in any text field to start the search. The button in the main window
1249
   Enter in any text field to start the search. The button in the main window
1249
   always performs a simple search.
1250
   always performs a simple search.
1250
1251
...
...
1303
     o The next section allows filtering the results by file size. There are
1304
     o The next section allows filtering the results by file size. There are
1304
       two entries for minimum and maximum size. Enter decimal numbers. You
1305
       two entries for minimum and maximum size. Enter decimal numbers. You
1305
       can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
1306
       can use suffix multipliers: k/K, m/M, g/G, t/T for 1E3, 1E6, 1E9, 1E12
1306
       respectively.
1307
       respectively.
1307
1308
1308
     o The next section allows filtering the results by their mime types, or
1309
     o The next section allows filtering the results by their MIME types, or
1309
       mime categories (ie: media/text/message/etc.).
1310
       MIME categories (ie: media/text/message/etc.).
1310
1311
1311
       You can transfer the types between two boxes, to define which will be
1312
       You can transfer the types between two boxes, to define which will be
1312
       included or excluded by the search.
1313
       included or excluded by the search.
1313
1314
1314
       The state of the file type selection can be saved as the default (the
1315
       The state of the file type selection can be saved as the default (the
...
...
1645
       Open link in the result list, instead of the application defined in
1646
       Open link in the result list, instead of the application defined in
1646
       mimeview. xdg-open will in term use your desktop preferences to choose
1647
       mimeview. xdg-open will in term use your desktop preferences to choose
1647
       an appropriate application.
1648
       an appropriate application.
1648
1649
1649
     o Exceptions: when using the desktop preferences for opening documents,
1650
     o Exceptions: when using the desktop preferences for opening documents,
1650
       these are mime types that will still be opened according to Recoll
1651
       these are MIME types that will still be opened according to Recoll
1651
       preferences. This is useful for passing parameters like page numbers
1652
       preferences. This is useful for passing parameters like page numbers
1652
       or search strings to applications that support them (e.g. evince).
1653
       or search strings to applications that support them (e.g. evince).
1653
       This cannot be done with xdg-open which only supports passing one
1654
       This cannot be done with xdg-open which only supports passing one
1654
       parameter.
1655
       parameter.
1655
1656
...
...
1787
1788
1788
     o %A. Abstract
1789
     o %A. Abstract
1789
1790
1790
     o %D. Date
1791
     o %D. Date
1791
1792
1792
     o %I. Icon image name. This is normally determined from the mime type.
1793
     o %I. Icon image name. This is normally determined from the MIME type.
1793
       The associations are defined inside the mimeconf configuration file.
1794
       The associations are defined inside the mimeconf configuration file.
1794
       If a thumbnail for the file is found at the standard Freedesktop
1795
       If a thumbnail for the file is found at the standard Freedesktop
1795
       location, this will be displayed instead.
1796
       location, this will be displayed instead.
1796
1797
1797
     o %K. Keywords (if any)
1798
     o %K. Keywords (if any)
1798
1799
1799
     o %L. Precooked Preview, Edit, and possibly Snippets links
1800
     o %L. Precooked Preview, Edit, and possibly Snippets links
1800
1801
1801
     o %M. Mime type
1802
     o %M. MIME type
1802
1803
1803
     o %N. result Number inside the result page
1804
     o %N. result Number inside the result page
1804
1805
1805
     o %R. Relevance percentage
1806
     o %R. Relevance percentage
1806
1807
...
...
1822
   indexed but not stored fields is not known at this point in the search
1823
   indexed but not stored fields is not known at this point in the search
1823
   process (see field configuration). There are currently very few fields
1824
   process (see field configuration). There are currently very few fields
1824
   stored by default, apart from the values above (only author and filename),
1825
   stored by default, apart from the values above (only author and filename),
1825
   so this feature will need some custom local configuration to be useful. An
1826
   so this feature will need some custom local configuration to be useful. An
1826
   example candidate would be the recipient field which is generated by the
1827
   example candidate would be the recipient field which is generated by the
1827
   message filters.
1828
   message input handlers.
1828
1829
1829
   The default value for the paragraph format string is:
1830
   The default value for the paragraph format string is:
1830
1831
1831
 <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
1832
 <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
1832
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
1833
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
...
...
1947
     -b : basic. Just output urls, no mime types or titles
1948
     -b : basic. Just output urls, no mime types or titles
1948
     -Q : no result lines, just the processed query and result count
1949
     -Q : no result lines, just the processed query and result count
1949
     -m : dump the whole document meta[] array for each result
1950
     -m : dump the whole document meta[] array for each result
1950
     -A : output the document abstracts
1951
     -A : output the document abstracts
1951
     -S fld : sort by field <fld>
1952
     -S fld : sort by field <fld>
1953
     -s stemlang : set stemming language to use (must exist in index...)
1954
        Use -s "" to turn off stem expansion
1952
     -D : sort descending
1955
     -D : sort descending
1953
     -i <dbdir> : additional index, several can be given
1956
     -i <dbdir> : additional index, several can be given
1954
     -e use url encoding (%xx) for urls
1957
     -e use url encoding (%xx) for urls
1955
     -F <field name list> : output exactly these fields for each result.
1958
     -F <field name list> : output exactly these fields for each result.
1956
        The field values are encoded in base64, output in one line and
1959
        The field values are encoded in base64, output in one line and
...
...
2137
2140
2138
          o /2003 all documents from 2003 or older.
2141
          o /2003 all documents from 2003 or older.
2139
2142
2140
       Periods can also be specified with small letters (ie: p2y).
2143
       Periods can also be specified with small letters (ie: p2y).
2141
2144
2142
     o mime or format for specifying the mime type. This one is quite special
2145
     o mime or format for specifying the MIME type. This one is quite special
2143
       because you can specify several values which will be OR'ed (the normal
2146
       because you can specify several values which will be OR'ed (the normal
2144
       default for the language is AND). Ex: mime:text/plain mime:text/html.
2147
       default for the language is AND). Ex: mime:text/plain mime:text/html.
2145
       Specifying an explicit boolean operator before a mime specification is
2148
       Specifying an explicit boolean operator before a mime specification is
2146
       not supported and will produce strange results. You can filter out
2149
       not supported and will produce strange results. You can filter out
2147
       certain types by using negation (-mime:some/type), and you can use
2150
       certain types by using negation (-mime:some/type), and you can use
2148
       wildcards in the value (mime:text/*). Note that mime is the ONLY field
2151
       wildcards in the value (mime:text/*). Note that mime is the ONLY field
2149
       with an OR default. You do need to use OR with ext terms for example.
2152
       with an OR default. You do need to use OR with ext terms for example.
2150
2153
2151
     o type or rclcat for specifying the category (as in
2154
     o type or rclcat for specifying the category (as in
2152
       text/media/presentation/etc.). The classification of mime types in
2155
       text/media/presentation/etc.). The classification of MIME types in
2153
       categories is defined in the Recoll configuration (mimeconf), and can
2156
       categories is defined in the Recoll configuration (mimeconf), and can
2154
       be modified or extended. The default category names are those which
2157
       be modified or extended. The default category names are those which
2155
       permit filtering results in the main GUI screen. Categories are OR'ed
2158
       permit filtering results in the main GUI screen. Categories are OR'ed
2156
       like mime types above. This can't be negated with - either.
2159
       like MIME types above. This can't be negated with - either.
2157
2160
2158
   Words inside phrases and capitalized words are not stem-expanded.
2161
   Words inside phrases and capitalized words are not stem-expanded.
2159
   Wildcards may be used anywhere inside a term. Specifying a wild-card on
2162
   Wildcards may be used anywhere inside a term. Specifying a wild-card on
2160
   the left of a term can produce a very slow search (or even an incorrect
2163
   the left of a term can produce a very slow search (or even an incorrect
2161
   one if the expansion is truncated because of excessive size). Also see
2164
   one if the expansion is truncated because of excessive size). Also see
2162
   More about wildcards.
2165
   More about wildcards.
2163
2166
2164
   The document filters used while indexing have the possibility to create
2167
   The document input handlers used while indexing have the possibility to
2165
   other fields with arbitrary names, and aliases may be defined in the
2168
   create other fields with arbitrary names, and aliases may be defined in
2166
   configuration, so that the exact field search possibilities may be
2169
   the configuration, so that the exact field search possibilities may be
2167
   different for you if someone took care of the customisation.
2170
   different for you if someone took care of the customisation.
2168
2171
2169
  3.5.1. Modifiers
2172
  3.5.1. Modifiers
2170
2173
2171
   Some characters are recognized as search modifiers when found immediately
2174
   Some characters are recognized as search modifiers when found immediately
...
...
2376
Chapter 4. Programming interface
2379
Chapter 4. Programming interface
2377
2380
2378
   Recoll has an Application Programming Interface, usable both for indexing
2381
   Recoll has an Application Programming Interface, usable both for indexing
2379
   and searching, currently accessible from the Python language.
2382
   and searching, currently accessible from the Python language.
2380
2383
2381
   Another less radical way to extend the application is to write filters for
2384
   Another less radical way to extend the application is to write input
2382
   new types of documents.
2385
   handlers for new types of documents.
2383
2386
2384
   The processing of metadata attributes for documents (fields) is highly
2387
   The processing of metadata attributes for documents (fields) is highly
2385
   configurable.
2388
   configurable.
2386
2389
2387
4.1. Writing a document filter
2390
4.1. Writing a document input handler
2388
2391
2392
  Terminology
2393
2394
   The small programs or pieces of code which handle the processing of the
2395
   different document types for Recoll used to be called filters, which is
2396
   still reflected in the name of the directory which holds them and many
2397
   configuration variables. They were named this way because one of their
2398
   primary functions is to filter out the formatting directives and keep the
2399
   text content. However these modules may have other behaviours, and the
2400
   term input handler is now progressively substituted in the documentation.
2401
   filter is still used in many places though.
2402
2389
   Recoll filters cooperate to translate from the multitude of input document
2403
   Recoll input handlers cooperate to translate from the multitude of input
2390
   formats, simple ones as opendocument, acrobat), or compound ones such as
2404
   document formats, simple ones as opendocument, acrobat), or compound ones
2391
   Zip or Email, into the final Recoll indexing input format, which may be
2405
   such as Zip or Email, into the final Recoll indexing input format, which
2392
   text/plain or text/html. Most filters are executable programs or scripts.
2406
   is plain text. Most input handlers are executable programs or scripts. A
2393
   A few filters are coded in C++ and live inside recollindex. This latter
2407
   few handlers are coded in C++ and live inside recollindex. This latter
2394
   kind will not be described here.
2408
   kind will not be described here.
2395
2409
2396
   There are currently (1.18 and since 1.13) two kinds of external executable
2410
   There are currently (1.18 and since 1.13) two kinds of external executable
2397
   filters:
2411
   input handlers:
2398
2412
2399
     o Simple filters (exec filters) run once and exit. They can be bare
2413
     o Simple exec handlers run once and exit. They can be bare programs like
2400
       programs like antiword, or scripts using other programs. They are very
2414
       antiword, or scripts using other programs. They are very simple to
2401
       simple to write, because they just need to print the converted
2415
       write, because they just need to print the converted document to the
2402
       document to the standard output. Their output can be text/plain or
2416
       standard output. Their output can be plain text or HTML. HTML is
2403
       text/html.
2417
       usually preferred because it can store metadata fields and it allows
2418
       preserving some of the formatting for the GUI preview.
2404
2419
2405
     o Multiple filters (execm filters), run as long as their master process
2420
     o Multiple execm handlers can process multiple files (sparing the
2406
       (recollindex) is active. They can process multiple files (sparing the
2407
       process startup time which can be very significant), or multiple
2421
       process startup time which can be very significant), or multiple
2408
       documents per file (e.g.: for zip or chm files). They communicate with
2422
       documents per file (e.g.: for zip or chm files). They communicate with
2409
       the indexer through a simple protocol, but are nevertheless a bit more
2423
       the indexer through a simple protocol, but are nevertheless a bit more
2410
       complicated than the older kind. Most of new filters are written in
2424
       complicated than the older kind. Most of new handlers are written in
2411
       Python, using a common module to handle the protocol. There is an
2425
       Python, using a common module to handle the protocol. There is an
2412
       exception, rclimg which is written in Perl. The subdocuments output by
2426
       exception, rclimg which is written in Perl. The subdocuments output by
2413
       these filters can be directly indexable (text or HTML), or they can be
2427
       these handlers can be directly indexable (text or HTML), or they can
2414
       other simple or compound documents that will need to be processed by
2428
       be other simple or compound documents that will need to be processed
2415
       another filter.
2429
       by another handler.
2416
2430
2417
   In both cases, filters deal with regular file system files, and can
2431
   In both cases, handlers deal with regular file system files, and can
2418
   process either a single document, or a linear list of documents in each
2432
   process either a single document, or a linear list of documents in each
2419
   file. Recoll is responsible for performing up to date checks, deal with
2433
   file. Recoll is responsible for performing up to date checks, deal with
2420
   more complex embedding and other upper level issues.
2434
   more complex embedding and other upper level issues.
2421
2435
2422
   In the extreme case of a simple filter returning a document in text/plain
2436
   A simple handler returning a document in text/plain format, can transfer
2423
   format, no metadata can be transferred from the filter to the indexer.
2437
   no metadata to the indexer. Generic metadata, like document size or
2424
   Generic metadata, like document size or modification date, will be
2438
   modification date, will be gathered and stored by the indexer.
2425
   gathered and stored by the indexer.
2426
2439
2427
   Filters that produce text/html format can return an arbitrary amount of
2440
   Handlers that produce text/html format can return an arbitrary amount of
2428
   metadata inside HTML meta tags. These will be processed according to the
2441
   metadata inside HTML meta tags. These will be processed according to the
2429
   directives found in the fields configuration file.
2442
   directives found in the fields configuration file.
2430
2443
2431
   The filters that can handle multiple documents per file return a single
2444
   The handlers that can handle multiple documents per file return a single
2432
   piece of data to identify each document inside the file. This piece of
2445
   piece of data to identify each document inside the file. This piece of
2433
   data, called an ipath element will be sent back by Recoll to extract the
2446
   data, called an ipath element will be sent back by Recoll to extract the
2434
   document at query time, for previewing, or for creating a temporary file
2447
   document at query time, for previewing, or for creating a temporary file
2435
   to be opened by a viewer.
2448
   to be opened by a viewer.
2436
2449
2437
   The following section describes the simple filters, and the next one gives
2450
   The following section describes the simple handlers, and the next one
2438
   a few explanations about the execm ones. You could conceivably write a
2451
   gives a few explanations about the execm ones. You could conceivably write
2439
   simple filter with only the elements in the manual. This will not be the
2452
   a simple handler with only the elements in the manual. This will not be
2440
   case for the other ones, for which you will have to look at the code.
2453
   the case for the other ones, for which you will have to look at the code.
2441
2454
2442
  4.1.1. Simple filters
2455
  4.1.1. Simple input handlers
2443
2456
2444
   Recoll simple filters are usually shell-scripts, but this is in no way
2457
   Recoll simple handlers are usually shell-scripts, but this is in no way
2445
   necessary. Extracting the text from the native format is the difficult
2458
   necessary. Extracting the text from the native format is the difficult
2446
   part. Outputting the format expected by Recoll is trivial. Happily enough,
2459
   part. Outputting the format expected by Recoll is trivial. Happily enough,
2447
   most document formats have translators or text extractors which can be
2460
   most document formats have translators or text extractors which can be
2448
   called from the filter. In some cases the output of the translating
2461
   called from the handler. In some cases the output of the translating
2449
   program is completely appropriate, and no intermediate shell-script is
2462
   program is completely appropriate, and no intermediate shell-script is
2450
   needed.
2463
   needed.
2451
2464
2452
   Filters are called with a single argument which is the source file name.
2465
   Input handlers are called with a single argument which is the source file
2453
   They should output the result to stdout.
2466
   name. They should output the result to stdout.
2454
2467
2455
   When writing a filter, you should decide if it will output plain text or
2468
   When writing a handler, you should decide if it will output plain text or
2456
   HTML. Plain text is simpler, but you will not be able to add metadata or
2469
   HTML. Plain text is simpler, but you will not be able to add metadata or
2457
   vary the output character encoding (this will be defined in a
2470
   vary the output character encoding (this will be defined in a
2458
   configuration file). Additionally, some formatting may be easier to
2471
   configuration file). Additionally, some formatting may be easier to
2459
   preserve when previewing HTML. Actually the deciding factor is metadata:
2472
   preserve when previewing HTML. Actually the deciding factor is metadata:
2460
   Recoll has a way to extract metadata from the HTML header and use it for
2473
   Recoll has a way to extract metadata from the HTML header and use it for
2461
   field searches..
2474
   field searches..
2462
2475
2463
   The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
2476
   The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
2464
   the filter if the operation is for indexing or previewing. Some filters
2477
   the handler if the operation is for indexing or previewing. Some handlers
2465
   use this to output a slightly different format, for example stripping
2478
   use this to output a slightly different format, for example stripping
2466
   uninteresting repeated keywords (ie: Subject: for email) when indexing.
2479
   uninteresting repeated keywords (ie: Subject: for email) when indexing.
2467
   This is not essential.
2480
   This is not essential.
2468
2481
2469
   You should look at one of the simple filters, for example rclps for a
2482
   You should look at one of the simple handlers, for example rclps for a
2470
   starting point.
2483
   starting point.
2471
2484
2472
   Don't forget to make your filter executable before testing !
2485
   Don't forget to make your handler executable before testing !
2473
2486
2474
  4.1.2. "Multiple" filters
2487
  4.1.2. "Multiple" handlers
2475
2488
2476
   If you can program and want to write an execm filter, it should not be too
2489
   If you can program and want to write an execm handler, it should not be
2477
   difficult to make sense of one of the existing modules. For example, look
2490
   too difficult to make sense of one of the existing modules. For example,
2478
   at rclzip which uses Zip file paths as identifiers (ipath), and rclics,
2491
   look at rclzip which uses Zip file paths as identifiers (ipath), and
2479
   which uses an integer index. Also have a look at the comments inside the
2492
   rclics, which uses an integer index. Also have a look at the comments
2480
   internfile/mh_execm.h file and possibly at the corresponding module.
2493
   inside the internfile/mh_execm.h file and possibly at the corresponding
2494
   module.
2481
2495
2482
   execm filters sometimes need to make a choice for the nature of the ipath
2496
   execm handlers sometimes need to make a choice for the nature of the ipath
2483
   elements that they use in communication with the indexer. Here are a few
2497
   elements that they use in communication with the indexer. Here are a few
2484
   guidelines:
2498
   guidelines:
2485
2499
2486
     o Use ASCII or UTF-8 (if the identifier is an integer print it, for
2500
     o Use ASCII or UTF-8 (if the identifier is an integer print it, for
2487
       example, like printf %d would do).
2501
       example, like printf %d would do).
...
...
2489
     o If at all possible, the data should make some kind of sense when
2503
     o If at all possible, the data should make some kind of sense when
2490
       printed to a log file to help with debugging.
2504
       printed to a log file to help with debugging.
2491
2505
2492
     o Recoll uses a colon (:) as a separator to store a complex path
2506
     o Recoll uses a colon (:) as a separator to store a complex path
2493
       internally (for deeper embedding). Colons inside the ipath elements
2507
       internally (for deeper embedding). Colons inside the ipath elements
2494
       output by a filter will be escaped, but would be a bad choice as a
2508
       output by a handler will be escaped, but would be a bad choice as a
2495
       filter-specific separator (mostly, again, for debugging issues).
2509
       handler-specific separator (mostly, again, for debugging issues).
2496
2510
2497
   In any case, the main goal is that it should be easy for the filter to
2511
   In any case, the main goal is that it should be easy for the handler to
2498
   extract the target document, given the file name and the ipath element.
2512
   extract the target document, given the file name and the ipath element.
2499
2513
2500
   execm filters will also produce a document with a null ipath element.
2514
   execm handlers will also produce a document with a null ipath element.
2501
   Depending on the type of document, this may have some associated data
2515
   Depending on the type of document, this may have some associated data
2502
   (e.g. the body of an email message), or none (typical for an archive
2516
   (e.g. the body of an email message), or none (typical for an archive
2503
   file). If it is empty, this document will be useful anyway for some
2517
   file). If it is empty, this document will be useful anyway for some
2504
   operations, as the parent of the actual data documents.
2518
   operations, as the parent of the actual data documents.
2505
2519
2506
  4.1.3. Telling Recoll about the filter
2520
  4.1.3. Telling Recoll about the handler
2507
2521
2508
   There are two elements that link a file to the filter which should process
2522
   There are two elements that link a file to the handler which should
2509
   it: the association of file to mime type and the association of a mime
2523
   process it: the association of file to MIME type and the association of a
2510
   type with a filter.
2524
   MIME type with a handler.
2511
2525
2512
   The association of files to mime types is mostly based on name suffixes.
2526
   The association of files to MIME types is mostly based on name suffixes.
2513
   The types are defined inside the mimemap file. Example:
2527
   The types are defined inside the mimemap file. Example:
2514
2528
2515
2529
2516
 .doc = application/msword
2530
 .doc = application/msword
2517
2531
2518
   If no suffix association is found for the file name, Recoll will try to
2532
   If no suffix association is found for the file name, Recoll will try to
2519
   execute the file -i command to determine a mime type.
2533
   execute the file -i command to determine a MIME type.
2520
2534
2521
   The association of file types to filters is performed in the mimeconf
2535
   The association of file types to handlers is performed in the mimeconf
2522
   file. A sample will probably be of better help than a long explanation:
2536
   file. A sample will probably be of better help than a long explanation:
2523
2537
2524
2538
2525
 [index]
2539
 [index]
2526
 application/msword = exec antiword -t -i 1 -m UTF-8;\
2540
 application/msword = exec antiword -t -i 1 -m UTF-8;\
...
...
2543
2557
2544
     o text/rtf is processed by unrtf, which outputs text/html. The
2558
     o text/rtf is processed by unrtf, which outputs text/html. The
2545
       iso-8859-1 encoding is specified because it is not the utf-8 default,
2559
       iso-8859-1 encoding is specified because it is not the utf-8 default,
2546
       and not output by unrtf in the HTML header section.
2560
       and not output by unrtf in the HTML header section.
2547
2561
2548
     o application/x-chm is processed by a persistant filter. This is
2562
     o application/x-chm is processed by a persistant handler. This is
2549
       determined by the execm keyword.
2563
       determined by the execm keyword.
2550
2564
2551
  4.1.4. Filter HTML output
2565
  4.1.4. Input handler HTML output
2552
2566
2553
   The output HTML could be very minimal like the following example:
2567
   The output HTML could be very minimal like the following example:
2554
2568
2555
 <html>
2569
 <html>
2556
   <head>
2570
   <head>
...
...
2598
   Example:
2612
   Example:
2599
2613
2600
 <meta name="date" content="2013-02-24 17:50:00">
2614
 <meta name="date" content="2013-02-24 17:50:00">
2601
          
2615
          
2602
2616
2603
   Filters also have the possibility to "invent" field names. This should
2617
   Input handlers also have the possibility to "invent" field names. This
2604
   also be output as meta tags:
2618
   should also be output as meta tags:
2605
2619
2606
 <meta name="somefield" content="Some textual data" />
2620
 <meta name="somefield" content="Some textual data" />
2607
2621
2608
   You can embed HTML markup inside the content of custom fields, for
2622
   You can embed HTML markup inside the content of custom fields, for
2609
   improving the display inside result lists. In this case, add a (wildly
2623
   improving the display inside result lists. In this case, add a (wildly
...
...
2615
   As written above, the processing of fields is described in a further
2629
   As written above, the processing of fields is described in a further
2616
   section.
2630
   section.
2617
2631
2618
  4.1.5. Page numbers
2632
  4.1.5. Page numbers
2619
2633
2620
   The indexer will interpret ^L characters in the filter output as
2634
   The indexer will interpret ^L characters in the handler output as
2621
   indicating page breaks, and will record them. At query time, this allows
2635
   indicating page breaks, and will record them. At query time, this allows
2622
   starting a viewer on the right page for a hit or a snippet. Currently,
2636
   starting a viewer on the right page for a hit or a snippet. Currently,
2623
   only the PDF, Postscript and DVI filters generate page breaks.
2637
   only the PDF, Postscript and DVI handlers generate page breaks.
2624
2638
2625
4.2. Field data processing
2639
4.2. Field data processing
2626
2640
2627
   Fields are named pieces of information in or about documents, like title,
2641
   Fields are named pieces of information in or about documents, like title,
2628
   author, abstract.
2642
   author, abstract.
2629
2643
2630
   The field values for documents can appear in several ways during indexing:
2644
   The field values for documents can appear in several ways during indexing:
2631
   either output by filters as meta fields in the HTML header section, or
2645
   either output by input handlers as meta fields in the HTML header section,
2632
   extracted from file extended attributes, or added as attributes of the Doc
2646
   or extracted from file extended attributes, or added as attributes of the
2633
   object when using the API, or again synthetized internally by Recoll.
2647
   Doc object when using the API, or again synthetized internally by Recoll.
2634
2648
2635
   The Recoll query language allows searching for text in a specific field.
2649
   The Recoll query language allows searching for text in a specific field.
2636
2650
2637
   Recoll defines a number of default fields. Additional ones can be output
2651
   Recoll defines a number of default fields. Additional ones can be output
2638
   by filters, and described in the fields configuration file.
2652
   by handlers, and described in the fields configuration file.
2639
2653
2640
   Fields can be:
2654
   Fields can be:
2641
2655
2642
     o indexed, meaning that their terms are separately stored in inverted
2656
     o indexed, meaning that their terms are separately stored in inverted
2643
       lists (with a specific prefix), and that a field-specific search is
2657
       lists (with a specific prefix), and that a field-specific search is
...
...
2792
2806
2793
      Classes
2807
      Classes
2794
2808
2795
        The Db class
2809
        The Db class
2796
2810
2797
   A Db object is created by a connect() function and holds a connection to a
2811
   A Db object is created by a connect() call and holds a connection to a
2798
   Recoll index.
2812
   Recoll index.
2799
2813
2800
   Methods
2814
   Methods
2801
2815
2802
   Db.close()
2816
   Db.close()
...
...
3086
   After an indexing pass, the commands that were found missing can be
3100
   After an indexing pass, the commands that were found missing can be
3087
   displayed from the recoll File menu. The list is stored in the missing
3101
   displayed from the recoll File menu. The list is stored in the missing
3088
   text file inside the configuration directory.
3102
   text file inside the configuration directory.
3089
3103
3090
   A list of common file types which need external commands follows. Many of
3104
   A list of common file types which need external commands follows. Many of
3091
   the filters need the iconv command, which is not always listed as a
3105
   the handlers need the iconv command, which is not always listed as a
3092
   dependancy.
3106
   dependancy.
3093
3107
3094
   Please note that, due to the relatively dynamic nature of this
3108
   Please note that, due to the relatively dynamic nature of this
3095
   information, the most up to date version is now kept on
3109
   information, the most up to date version is now kept on
3096
   http://www.recoll.org/features.html along with links to the home pages or
3110
   http://www.recoll.org/features.html along with links to the home pages or
...
...
3101
   from the package repositories. However, the packages are sometimes
3115
   from the package repositories. However, the packages are sometimes
3102
   outdated, or not the best version for Recoll, so you should take a look at
3116
   outdated, or not the best version for Recoll, so you should take a look at
3103
   http://www.recoll.org/features.html if a file type is important to you.
3117
   http://www.recoll.org/features.html if a file type is important to you.
3104
3118
3105
   As of Recoll release 1.14, a number of XML-based formats that were handled
3119
   As of Recoll release 1.14, a number of XML-based formats that were handled
3106
   by ad hoc filter code now use the xsltproc command, which usually comes
3120
   by ad hoc handler code now use the xsltproc command, which usually comes
3107
   with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
3121
   with libxslt. These are: abiword, fb2 (ebooks), kword, openoffice, svg.
3108
3122
3109
   Now for the list:
3123
   Now for the list:
3110
3124
3111
     o Openoffice files need unzip and xsltproc.
3125
     o Openoffice files need unzip and xsltproc.
...
...
3119
3133
3120
     o MS Word needs antiword. It is also useful to have wvWare installed as
3134
     o MS Word needs antiword. It is also useful to have wvWare installed as
3121
       it may be be used as a fallback for some files which antiword does not
3135
       it may be be used as a fallback for some files which antiword does not
3122
       handle.
3136
       handle.
3123
3137
3124
     o MS Excel and PowerPoint need catdoc.
3138
     o MS Excel and PowerPoint are processed by internal Python handlers.
3125
3139
3126
     o MS Open XML (docx) needs xsltproc.
3140
     o MS Open XML (docx) needs xsltproc.
3127
3141
3128
     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3142
     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3129
       Ubuntu) package.
3143
       Ubuntu) package.
...
...
3138
3152
3139
     o dvi files need dvips.
3153
     o dvi files need dvips.
3140
3154
3141
     o djvu files need djvutxt and djvused from the DjVuLibre package.
3155
     o djvu files need djvutxt and djvused from the DjVuLibre package.
3142
3156
3143
     o Audio files: Recoll releases before 1.13 used the id3info command from
3157
     o Audio files: Recoll releases 1.14 and later use a single Python
3144
       the id3lib package to extract mp3 tag information, metaflac (standard
3158
       handler based on mutagen for all audio file types.
3145
       flac tools) for flac files, and ogginfo (vorbis tools) for ogg files.
3146
       Releases 1.14 and later use a single Python filter based on mutagen
3147
       for all audio file types.
3148
3159
3149
     o Pictures: Recoll uses the Exiftool Perl package to extract tag
3160
     o Pictures: Recoll uses the Exiftool Perl package to extract tag
3150
       information. Most image file formats are supported. Note that there
3161
       information. Most image file formats are supported. Note that there
3151
       may not be much interest in indexing the technical tags (image size,
3162
       may not be much interest in indexing the technical tags (image size,
3152
       aperture, etc.). This is only of interest if you store personal tags
3163
       aperture, etc.). This is only of interest if you store personal tags
3153
       or textual descriptions inside the image files.
3164
       or textual descriptions inside the image files.
3154
3165
3155
     o chm: files in microsoft help format need Python and the pychm module
3166
     o chm: files in Microsoft help format need Python and the pychm module
3156
       (which needs chmlib).
3167
       (which needs chmlib).
3157
3168
3158
     o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
3169
     o ICS: up to Recoll 1.13, iCalendar files need Python and the icalendar
3159
       module. icalendar is not needed for newer versions, which use internal
3170
       module. icalendar is not needed for newer versions, which use internal
3160
       code.
3171
       code.
...
...
3166
3177
3167
     o Midi karaoke files need Python and the Midi module
3178
     o Midi karaoke files need Python and the Midi module
3168
3179
3169
     o Konqueror webarchive format with Python (uses the Tarfile module).
3180
     o Konqueror webarchive format with Python (uses the Tarfile module).
3170
3181
3171
     o mimehtml web archive format (support based on the email filter, which
3182
     o Mimehtml web archive format (support based on the email handler, which
3172
       introduces some mild weirdness, but still usable).
3183
       introduces some mild weirdness, but still usable).
3173
3184
3174
   Text, HTML, email folders, and Scribus files are processed internally. Lyx
3185
   Text, HTML, email folders, and Scribus files are processed internally. Lyx
3175
   is used to index Lyx files. Many filters need iconv and the standard sed
3186
   is used to index Lyx files. Many handlers need iconv and the standard sed
3176
   and awk.
3187
   and awk.
3177
3188
3178
5.3. Building from source
3189
5.3. Building from source
3179
3190
3180
  5.3.1. Prerequisites
3191
  5.3.1. Prerequisites
...
...
3493
3504
3494
   zipSkippedNames
3505
   zipSkippedNames
3495
3506
3496
           A space-separated list of patterns for names of files or
3507
           A space-separated list of patterns for names of files or
3497
           directories that should be ignored inside zip archives. This is
3508
           directories that should be ignored inside zip archives. This is
3498
           used directly by the zip filter, and has a function similar to
3509
           used directly by the zip handler, and has a function similar to
3499
           skippedNames, but works independantly. Can be redefined for
3510
           skippedNames, but works independantly. Can be redefined for
3500
           filesystem subdirectories. For versions up to 1.19, you will need
3511
           filesystem subdirectories. For versions up to 1.19, you will need
3501
           to update the Zip filter and install a supplementary Python
3512
           to update the Zip handler and install a supplementary Python
3502
           module. The details are described on the Recoll wiki.
3513
           module. The details are described on the Recoll wiki.
3503
3514
3504
   followLinks
3515
   followLinks
3505
3516
3506
           Specifies if the indexer should follow symbolic links while
3517
           Specifies if the indexer should follow symbolic links while
...
...
3511
           sections. It can not be changed below the topdirs level.
3522
           sections. It can not be changed below the topdirs level.
3512
3523
3513
   indexedmimetypes
3524
   indexedmimetypes
3514
3525
3515
           Recoll normally indexes any file which it knows how to read. This
3526
           Recoll normally indexes any file which it knows how to read. This
3516
           list lets you restrict the indexed mime types to what you specify.
3527
           list lets you restrict the indexed MIME types to what you specify.
3517
           If the variable is unspecified or the list empty (the default),
3528
           If the variable is unspecified or the list empty (the default),
3518
           all supported types are processed. Can be redefined for
3529
           all supported types are processed. Can be redefined for
3519
           subdirectories.
3530
           subdirectories.
3531
3532
   excludedmimetypes
3533
3534
           This list lets you exclude some MIME types from indexing. Can be
3535
           redefined for subdirectories.
3520
3536
3521
   compressedfilemaxkbs
3537
   compressedfilemaxkbs
3522
3538
3523
           Size limit for compressed (.gz or .bz2) files. These need to be
3539
           Size limit for compressed (.gz or .bz2) files. These need to be
3524
           decompressed in a temporary directory for identification, which
3540
           decompressed in a temporary directory for identification, which
...
...
3548
   indexallfilenames
3564
   indexallfilenames
3549
3565
3550
           Recoll indexes file names in a special section of the database to
3566
           Recoll indexes file names in a special section of the database to
3551
           allow specific file names searches using wild cards. This
3567
           allow specific file names searches using wild cards. This
3552
           parameter decides if file name indexing is performed only for
3568
           parameter decides if file name indexing is performed only for
3553
           files with mime types that would qualify them for full text
3569
           files with MIME types that would qualify them for full text
3554
           indexing, or for all files inside the selected subtrees,
3570
           indexing, or for all files inside the selected subtrees,
3555
           independently of mime type.
3571
           independently of MIME type.
3556
3572
3557
   usesystemfilecommand
3573
   usesystemfilecommand
3558
3574
3559
           Decide if we use the file -i system command as a final step for
3575
           Decide if we use the file -i system command as a final step for
3560
           determining the mime type for a file (the main procedure uses
3576
           determining the MIME type for a file (the main procedure uses
3561
           suffix associations as defined in the mimemap file). This can be
3577
           suffix associations as defined in the mimemap file). This can be
3562
           useful for files with suffix-less names, but it will also cause
3578
           useful for files with suffix-less names, but it will also cause
3563
           the indexing of many bogus "text" files.
3579
           the indexing of many bogus "text" files.
3564
3580
3565
   processwebqueue
3581
   processwebqueue
...
...
3768
3784
3769
   webcachemaxmbs
3785
   webcachemaxmbs
3770
3786
3771
           This is only used by the web browser plugin indexing code, and
3787
           This is only used by the web browser plugin indexing code, and
3772
           defines the maximum size for the web page cache. Default: 40 MB.
3788
           defines the maximum size for the web page cache. Default: 40 MB.
3789
           Quite unfortunately, this is only taken into account when creating
3790
           the cache file. You need to delete the file for a change to be
3791
           taken into account.
3773
3792
3774
   idxflushmb
3793
   idxflushmb
3775
3794
3776
           Threshold (megabytes of new text data) where we flush from memory
3795
           Threshold (megabytes of new text data) where we flush from memory
3777
           to disk index. Setting this can help control memory usage. A value
3796
           to disk index. Setting this can help control memory usage. A value
...
...
3907
           These allow defining the ionice class and data used by the indexer
3926
           These allow defining the ionice class and data used by the indexer
3908
           (default class 3, no data).
3927
           (default class 3, no data).
3909
3928
3910
   filtermaxseconds
3929
   filtermaxseconds
3911
3930
3912
           Maximum filter execution time, after which it is aborted. Some
3931
           Maximum handler execution time, after which it is aborted. Some
3913
           postscript programs just loop...
3932
           postscript programs just loop...
3914
3933
3915
   filtersdir
3934
   filtersdir
3916
3935
3917
           A directory to search for the external filter scripts used to
3936
           A directory to search for the external input handler scripts used
3918
           index some types of files. The value should not be changed, except
3937
           to index some types of files. The value should not be changed,
3919
           if you want to modify one of the default scripts. The value can be
3938
           except if you want to modify one of the default scripts. The value
3920
           redefined for any sub-directory.
3939
           can be redefined for any sub-directory.
3921
3940
3922
   iconsdir
3941
   iconsdir
3923
3942
3924
           The name of the directory where recoll result list icons are
3943
           The name of the directory where recoll result list icons are
3925
           stored. You can change this if you want different images.
3944
           stored. You can change this if you want different images.
...
...
3996
   [aliases]
4015
   [aliases]
3997
4016
3998
           This section defines lists of synonyms for the canonical names
4017
           This section defines lists of synonyms for the canonical names
3999
           used inside the [prefixes] and [stored] sections
4018
           used inside the [prefixes] and [stored] sections
4000
4019
4001
   filter-specific sections
4020
   handler-specific sections
4002
4021
4003
           Some filters may need specific configuration for handling fields.
4022
           Some input handlers may need specific configuration for handling
4004
           Only the email message filter currently has such a section (named
4023
           fields. Only the email message handler currently has such a
4005
           [mail]). It allows indexing arbitrary email headers in addition to
4024
           section (named [mail]). It allows indexing arbitrary email headers
4006
           the ones indexed by default. Other such sections may appear in the
4025
           in addition to the ones indexed by default. Other such sections
4007
           future.
4026
           may appear in the future.
4008
4027
4009
   Here follows a small example of a personal fields file. This would extract
4028
   Here follows a small example of a personal fields file. This would extract
4010
   a specific email header and use it as a searchable field, with data
4029
   a specific email header and use it as a searchable field, with data
4011
   displayable inside result lists. (Side note: as the email filter does no
4030
   displayable inside result lists. (Side note: as the email handler does no
4012
   decoding on the values, only plain ascii headers can be indexed, and only
4031
   decoding on the values, only plain ascii headers can be indexed, and only
4013
   the first occurrence will be used for headers that occur several times).
4032
   the first occurrence will be used for headers that occur several times).
4014
4033
4015
 [prefixes]
4034
 [prefixes]
4016
 # Index mailmytag contents (with the given prefix)
4035
 # Index mailmytag contents (with the given prefix)
...
...
4038
   translations from extended attributes names to Recoll field names. An
4057
   translations from extended attributes names to Recoll field names. An
4039
   empty translation disables use of the corresponding attribute data.
4058
   empty translation disables use of the corresponding attribute data.
4040
4059
4041
  5.4.3. The mimemap file
4060
  5.4.3. The mimemap file
4042
4061
4043
   mimemap specifies the file name extension to mime type mappings.
4062
   mimemap specifies the file name extension to MIME type mappings.
4044
4063
4045
   For file names without an extension, or with an unknown one, the system's
4064
   For file names without an extension, or with an unknown one, the system's
4046
   file -i command will be executed to determine the mime type (this can be
4065
   file -i command will be executed to determine the MIME type (this can be
4047
   switched off inside the main configuration file).
4066
   switched off inside the main configuration file).
4048
4067
4049
   The mappings can be specified on a per-subtree basis, which may be useful
4068
   The mappings can be specified on a per-subtree basis, which may be useful
4050
   in some cases. Example: gaim logs have a .txt extension but should be
4069
   in some cases. Example: gaim logs have a .txt extension but should be
4051
   handled specially, which is possible because they are usually all located
4070
   handled specially, which is possible because they are usually all located
...
...
4062
   given Recoll version. Having it there avoids cluttering the more
4081
   given Recoll version. Having it there avoids cluttering the more
4063
   user-oriented and locally customized skippedNames.
4082
   user-oriented and locally customized skippedNames.
4064
4083
4065
  5.4.4. The mimeconf file
4084
  5.4.4. The mimeconf file
4066
4085
4067
   mimeconf specifies how the different mime types are handled for indexing,
4086
   mimeconf specifies how the different MIME types are handled for indexing,
4068
   and which icons are displayed in the recoll result lists.
4087
   and which icons are displayed in the recoll result lists.
4069
4088
4070
   Changing the parameters in the [index] section is probably not a good idea
4089
   Changing the parameters in the [index] section is probably not a good idea
4071
   except if you are a Recoll developer.
4090
   except if you are a Recoll developer.
4072
4091
...
...
4086
4105
4087
   If Use desktop preferences to choose document editor is checked in the
4106
   If Use desktop preferences to choose document editor is checked in the
4088
   Recoll GUI preferences, all mimeview entries will be ignored except the
4107
   Recoll GUI preferences, all mimeview entries will be ignored except the
4089
   one labelled application/x-all (which is set to use xdg-open by default).
4108
   one labelled application/x-all (which is set to use xdg-open by default).
4090
4109
4091
   In this case, the xallexcepts top level variable defines a list of mime
4110
   In this case, the xallexcepts top level variable defines a list of MIME
4092
   type exceptions which will be processed according to the local entries
4111
   type exceptions which will be processed according to the local entries
4093
   instead of being passed to the desktop. This is so that specific Recoll
4112
   instead of being passed to the desktop. This is so that specific Recoll
4094
   options such as a page number or a search string can be passed to
4113
   options such as a page number or a search string can be passed to
4095
   applications that support them, such as the evince viewer.
4114
   applications that support them, such as the evince viewer.
4096
4115
...
...
4099
   non-default entries, which will override those from the central
4118
   non-default entries, which will override those from the central
4100
   configuration file.
4119
   configuration file.
4101
4120
4102
   All viewer definition entries must be placed under a [view] section.
4121
   All viewer definition entries must be placed under a [view] section.
4103
4122
4104
   The keys in the file are normally mime types. You can add an application
4123
   The keys in the file are normally MIME types. You can add an application
4105
   tag to specialize the choice for an area of the filesystem (using a
4124
   tag to specialize the choice for an area of the filesystem (using a
4106
   localfields specification in mimeconf). The syntax for the key is
4125
   localfields specification in mimeconf). The syntax for the key is
4107
   mimetype|tag
4126
   mimetype|tag
4108
4127
4109
   The nouncompforviewmts entry, (placed at the top level, outside of the
4128
   The nouncompforviewmts entry, (placed at the top level, outside of the
4110
   [view] section), holds a list of mime types that should not be
4129
   [view] section), holds a list of MIME types that should not be
4111
   uncompressed before starting the viewer (if they are found compressed, ie:
4130
   uncompressed before starting the viewer (if they are found compressed, ie:
4112
   mydoc.doc.gz).
4131
   mydoc.doc.gz).
4113
4132
4114
   The right side of each assignment holds a command to be executed for
4133
   The right side of each assignment holds a command to be executed for
4115
   opening the file. The following substitutions are performed:
4134
   opening the file. The following substitutions are performed:
...
...
4125
     o %i. Internal path, for subdocuments of containers. The format depends
4144
     o %i. Internal path, for subdocuments of containers. The format depends
4126
       on the container type. If this appears in the command line, Recoll
4145
       on the container type. If this appears in the command line, Recoll
4127
       will not create a temporary file to extract the subdocument, expecting
4146
       will not create a temporary file to extract the subdocument, expecting
4128
       the called application (possibly a script) to be able to handle it.
4147
       the called application (possibly a script) to be able to handle it.
4129
4148
4130
     o %M. Mime type
4149
     o %M. MIME type
4131
4150
4132
     o %p. Page index. Only significant for a subset of document types,
4151
     o %p. Page index. Only significant for a subset of document types,
4133
       currently only PDF, Postscript and DVI files. Can be used to start the
4152
       currently only PDF, Postscript and DVI files. Can be used to start the
4134
       editor at the right page for a match or snippet.
4153
       editor at the right page for a match or snippet.
4135
4154
...
...
4178
     o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
4197
     o In $RECOLL_CONFDIR/mimemap (typically ~/.recoll/mimemap), add the
4179
       following line:
4198
       following line:
4180
4199
4181
 .blob = application/x-blobapp
4200
 .blob = application/x-blobapp
4182
4201
4183
       Note that the mime type is made up here, and you could call it
4202
       Note that the MIME type is made up here, and you could call it
4184
       diesel/oil just the same.
4203
       diesel/oil just the same.
4185
4204
4186
     o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
4205
     o In $RECOLL_CONFDIR/mimeview under the [view] section, add:
4187
4206
4188
 application/x-blobapp = blobviewer %f
4207
 application/x-blobapp = blobviewer %f
4189
4208
4190
       We are supposing that blobviewer wants a file name parameter here, you
4209
       We are supposing that blobviewer wants a file name parameter here, you
4191
       would use %u if it liked URLs better.
4210
       would use %u if it liked URLs better.
4192
4211
4193
   If you just wanted to change the application used by Recoll to display a
4212
   If you just wanted to change the application used by Recoll to display a
4194
   mime type which it already knows, you would just need to edit mimeview.
4213
   MIME type which it already knows, you would just need to edit mimeview.
4195
   The entries you add in your personal file override those in the central
4214
   The entries you add in your personal file override those in the central
4196
   configuration, which you do not need to alter. mimeview can also be
4215
   configuration, which you do not need to alter. mimeview can also be
4197
   modified from the Gui.
4216
   modified from the Gui.
4198
4217
4199
    5.4.7.2. Adding indexing support for a new file type
4218
    5.4.7.2. Adding indexing support for a new file type
...
...
4211
4230
4212
     o Under the [icons] section, you should choose an icon to be displayed
4231
     o Under the [icons] section, you should choose an icon to be displayed
4213
       for the files inside the result lists. Icons are normally 64x64 pixels
4232
       for the files inside the result lists. Icons are normally 64x64 pixels
4214
       PNG files which live in /usr/[local/]share/recoll/images.
4233
       PNG files which live in /usr/[local/]share/recoll/images.
4215
4234
4216
     o Under the [categories] section, you should add the mime type where it
4235
     o Under the [categories] section, you should add the MIME type where it
4217
       makes sense (you can also create a category). Categories may be used
4236
       makes sense (you can also create a category). Categories may be used
4218
       for filtering in advanced search.
4237
       for filtering in advanced search.
4219
4238
4220
   The rclblob filter should be an executable program or script which exists
4239
   The rclblob handler should be an executable program or script which exists
4221
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
4240
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
4222
   argument and should output the text or html contents on the standard
4241
   argument and should output the text or html contents on the standard
4223
   output.
4242
   output.
4224
4243
4225
   The filter programming section describes in more detail how to write a
4244
   The filter programming section describes in more detail how to write an
4226
   filter.
4245
   input handler.