Switch to unified view

a/src/README b/src/README
...
...
9
   <jean-francois.dockes@wanadoo.fr>
9
   <jean-francois.dockes@wanadoo.fr>
10
10
11
   Copyright (c) 2005 Jean-Francois Dockes
11
   Copyright (c) 2005 Jean-Francois Dockes
12
12
13
   This document introduces full text search notions and describes the
13
   This document introduces full text search notions and describes the
14
   installation and use of the Recoll application.
14
   installation and use of the Recoll application. It currently describes
15
   Recoll 1.9.
15
16
16
   [ Split HTML / Single HTML ]
17
   [ Split HTML / Single HTML ]
17
18
18
     ----------------------------------------------------------------------
19
     ----------------------------------------------------------------------
19
20
...
...
102
                             4.4.3. The mimeconf file
103
                             4.4.3. The mimeconf file
103
104
104
                             4.4.4. The mimeview file
105
                             4.4.4. The mimeview file
105
106
106
                             4.4.5. Examples of configuration adjustments
107
                             4.4.5. Examples of configuration adjustments
108
109
                4.5. Extending Recoll
110
111
                             4.5.1. Writing a document filter
107
112
108
     ----------------------------------------------------------------------
113
     ----------------------------------------------------------------------
109
114
110
                            Chapter 1. Introduction
115
                            Chapter 1. Introduction
111
116
...
...
368
   configuration before indexing, just click Cancel at this point. That way,
373
   configuration before indexing, just click Cancel at this point. That way,
369
   recoll will have created a ~/.recoll directory containing empty
374
   recoll will have created a ~/.recoll directory containing empty
370
   configuration files.
375
   configuration files.
371
376
372
   The configuration is documented inside the installation chapter of this
377
   The configuration is documented inside the installation chapter of this
373
   document, or in the recoll.conf(5) man page. The most immediately useful
378
   document, or in the recoll.conf(5) man page, but the most current
374
   variable you may interested in is probably topdirs, which determines what
379
   information will most likely be the comments inside the sample file. The
375
   subtrees get indexed.
380
   most immediately useful variable you may interested in is probably
381
   topdirs, which determines what subtrees get indexed.
376
382
377
   The applications needed to index file types other than text, HTML or email
383
   The applications needed to index file types other than text, HTML or email
378
   (ie: pdf, postscript, ms-word...) are described in the external packages
384
   (ie: pdf, postscript, ms-word...) are described in the external packages
379
   section
385
   section
380
386
...
...
658
   the author field (exactly what this is would depend on the document type,
664
   the author field (exactly what this is would depend on the document type,
659
   ie: the From: header, for an email message), and containing either beatles
665
   ie: the From: header, for an email message), and containing either beatles
660
   or lennon and either live or unplugged but not potatoes (in any part of
666
   or lennon and either live or unplugged but not potatoes (in any part of
661
   the document).
667
   the document).
662
668
663
   The first element author:"john doe" is a phrase search limited to a
664
   specific field. Phrase searches are specified as usual by enclosing the
665
   words in double quotes. The field specification appears before the colon
666
   (of course this is not limited to phrases, author:Balzac would be ok too).
667
   Recoll currently manages the following fields:
668
669
     * title, subject or caption are synonyms which specify data to be
670
       searched for in the document title or subject.
671
672
     * author or from for searching the documents originators.
673
674
     * keyword for searching the document specified keywords (few documents
675
       actually have any).
676
677
   The query language is currently the only way to use the Recoll field
678
   search capability.
679
680
   All elements in the search entry are normally combined with an implicit
669
   All elements in the search entry are normally combined with an implicit
681
   AND. It is possible to specify that elements be OR'ed instead, as in
670
   AND. It is possible to specify that elements be OR'ed instead, as in
682
   Beatles OR Lennon. The OR must be entered literally (capitals), and it has
671
   Beatles OR Lennon. The OR must be entered literally (capitals), and it has
683
   priority over the AND associations: word1 word2 OR word3 means word1 AND
672
   priority over the AND associations: word1 word2 OR word3 means word1 AND
684
   (word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
673
   (word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
685
   parenthesis, they are not supported for now.
674
   parenthesis, they are not supported for now.
686
675
687
   An entry preceded by a - specifies a term that should not appear.
676
   An entry preceded by a - specifies a term that should not appear.
688
677
678
   The first element in the above exemple, author:"john doe" is a phrase
679
   search limited to a specific field. Phrase searches are specified as usual
680
   by enclosing the words in double quotes. The field specification appears
681
   before the colon (of course this is not limited to phrases, author:Balzac
682
   would be ok too). Recoll currently manages the following fields:
683
684
     * title, subject or caption are synonyms which specify data to be
685
       searched for in the document title or subject.
686
687
     * author or from for searching the documents originators.
688
689
     * keyword for searching the document specified keywords (few documents
690
       actually have any).
691
692
   As of release 1.9, the filters have the possibility to create other fields
693
   with arbitrary names. No standard filters use this possibility yet.
694
695
   There are two other elements which may be specified through the field
696
   syntax, but are somewhat special:
697
698
     * ext for specifying the file name extension (Ex: ext:html)
699
700
     * mime for specifying the mime type. This one is quite special because
701
       you can specify several values which will be OR'ed (the normal default
702
       for the language is AND). Ex: mime:text/plain mime:text/html.
703
       Specifying an explicit boolean operator or negation (-) before a mime
704
       specification is not supported and will produce strange results.
705
706
   The query language is currently the only way to use the Recoll field
707
   search capability.
708
689
   Words inside phrases and capitalized words are not stem-expanded.
709
   Words inside phrases and capitalized words are not stem-expanded.
690
   Wildcards may be used anywhere.
710
   Wildcards may be used anywhere inside a term. Specifying a wild-card on
711
   the left of a term can produce a very slow search.
691
712
692
   You can use the show query link at the top of the result list to check the
713
   You can use the show query link at the top of the result list to check the
693
   exact query which was finally executed by Xapian.
714
   exact query which was finally executed by Xapian.
694
715
695
     ----------------------------------------------------------------------
716
     ----------------------------------------------------------------------
...
...
871
     ----------------------------------------------------------------------
892
     ----------------------------------------------------------------------
872
893
873
3.9. Document history
894
3.9. Document history
874
895
875
   Documents that you actually view (with the internal preview or an external
896
   Documents that you actually view (with the internal preview or an external
876
   tool) are entered into the document history, which is remembered. You can
897
   tool) are entered into the document history, which is remembered.
898
877
   display the history list by using the Tools/Doc History menu entry.
899
   You can display the history list by using the Tools/Doc History menu
900
   entry.
901
902
   You can erase the document history by using the Erase document history
903
   entry in the File menu.
878
904
879
     ----------------------------------------------------------------------
905
     ----------------------------------------------------------------------
880
906
881
3.10. Sorting search results
907
3.10. Sorting search results
882
908
...
...
888
   result list, according to specified criteria. The currently available
914
   result list, according to specified criteria. The currently available
889
   criteria are date and mime type.
915
   criteria are date and mime type.
890
916
891
   The sort parameters stay in effect until they are explicitly reset, or the
917
   The sort parameters stay in effect until they are explicitly reset, or the
892
   program exits. An activated sort is indicated in the result list header.
918
   program exits. An activated sort is indicated in the result list header.
919
920
   Sort parameters are remembered between program invocations, but result
921
   sorting is normally always inactive when the program starts. It is
922
   possible to keep the sorting activation state between program invocations
923
   by checking the Remember sort activation state option in the preferences.
893
924
894
     ----------------------------------------------------------------------
925
     ----------------------------------------------------------------------
895
926
896
3.11. Search tips, shortcuts
927
3.11. Search tips, shortcuts
897
928
...
...
982
1013
983
          * %A. Abstract
1014
          * %A. Abstract
984
1015
985
          * %D. Date
1016
          * %D. Date
986
1017
1018
          * %I. Icon image name
1019
987
          * %K. Keywords (if any)
1020
          * %K. Keywords (if any)
988
1021
989
          * %L. Preview and Edit links
1022
          * %L. Preview and Edit links
990
1023
991
          * %M. Mime type
1024
          * %M. Mime type
...
...
1000
1033
1001
          * %U. Url
1034
          * %U. Url
1002
1035
1003
       The default value for the string is:
1036
       The default value for the string is:
1004
1037
1005
 %R %S %L &nbsp;&nbsp;<b>%T</b><br>
1038
 <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
1006
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i><br>
1039
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i><br>
1007
 %A %K
1040
 %A %K
1008
       
1041
       
1009
1042
1010
       You may, for example, try the following for a more web-like
1043
       You may, for example, try the following for a more web-like
...
...
1012
1045
1013
 <u><b><a href="P%N">%T</a></b></u><br>
1046
 <u><b><a href="P%N">%T</a></b></u><br>
1014
 %A<font color=#008000>%U - %S</font> - %L
1047
 %A<font color=#008000>%U - %S</font> - %L
1015
       
1048
       
1016
1049
1050
       Or the clean looking:
1051
1052
 <img src="%I" align="left">%L <font color="#900000">%R</font>
1053
   <b>%T</b><br>%S 
1054
 <font color="#808080"><i>%U</i></font>
1055
 <table bgcolor="#e0e0e0">
1056
 <tr><td><div>%A</div></td></tr>
1057
 </table>%K
1058
       
1059
1017
       The format of the Preview and Edit links is <a href="Pdocnum"> and <a
1060
       The format of the Preview and Edit links is <a href="Pdocnum"> and <a
1018
       href="Edocnum"> where docnum is what %N would print. This makes the
1061
       href="Edocnum"> where docnum is what %N would print. This makes the
1019
       title a preview link in the above format.
1062
       title a preview link in the above format.
1063
1064
       Please note that, due to the way the program handles right mouse
1065
       clicks in the result list, if the custom formatting results in
1066
       multiple paragraphs per result, right clicks will only work inside the
1067
       first one.
1020
1068
1021
     * HTML help browser: this will let you chose your preferred browser
1069
     * HTML help browser: this will let you chose your preferred browser
1022
       which will be started from the Help menu to read the user manual. You
1070
       which will be started from the Help menu to read the user manual. You
1023
       can enter a simple name if the command is in your PATH, or browse for
1071
       can enter a simple name if the command is in your PATH, or browse for
1024
       a full pathname.
1072
       a full pathname.
1025
1026
     * Show document type icons in result list: icons in the result list can
1027
       be turned off. They take quite a lot of space and convey relatively
1028
       little useful information.
1029
1073
1030
     * Auto-start simple search on white space entry: if this is checked, a
1074
     * Auto-start simple search on white space entry: if this is checked, a
1031
       search will be executed each time you enter a space in the simple
1075
       search will be executed each time you enter a space in the simple
1032
       search input field. This lets you look at the result list as you enter
1076
       search input field. This lets you look at the result list as you enter
1033
       new terms. This is off by default, you may like it or not...
1077
       new terms. This is off by default, you may like it or not...
...
...
1084
1128
1085
                            Chapter 4. Installation
1129
                            Chapter 4. Installation
1086
1130
1087
4.1. Installing a prebuilt copy
1131
4.1. Installing a prebuilt copy
1088
1132
1089
   Recoll binary installations are always linked statically to the xapian
1133
   Recoll binary packages from the Recoll web site are always linked
1090
   libraries, and have no other dependencies. You will only have to check or
1134
   statically to the Xapian libraries, and have no other dependencies. You
1091
   install supporting applications for the file types that you want to index
1135
   will only have to check or install supporting applications for the file
1092
   beyond text, HTML and mail files.
1136
   types that you want to index beyond text, HTML and mail files, and maybe
1137
   have a look at the configuration section (but this may not be necessary
1138
   for a quick test with default parameters).
1093
1139
1094
     ----------------------------------------------------------------------
1140
     ----------------------------------------------------------------------
1095
1141
1096
  4.1.1. Installing through a package system
1142
  4.1.1. Installing through a package system
1097
1143
1098
   If you use a BSD-type port system or a prebuilt package (RPM or other),
1144
   If you use a BSD-type port system or a prebuilt package (RPM or other),
1099
   just follow the usual procedure, and maybe have a look at the
1145
   just follow the usual procedure for your system.
1100
   configuration section (but this may not be necessary for a quick test with
1101
   default parameters).
1102
1146
1103
     ----------------------------------------------------------------------
1147
     ----------------------------------------------------------------------
1104
1148
1105
  4.1.2. Installing a prebuilt Recoll
1149
  4.1.2. Installing a prebuilt Recoll
1106
1150
1107
   The unpackaged binary versions are just compressed tar files of a build
1151
   The unpackaged binary versions on the Recoll web site are just compressed
1108
   tree, where only the useful parts were kept (executables and sample
1152
   tar files of a build tree, where only the useful parts were kept
1109
   configuration).
1153
   (executables and sample configuration).
1110
1154
1111
   The executable binary files are built with a static link to libxapian and
1155
   The executable binary files are built with a static link to libxapian and
1112
   libiconv, to make installation easier (no dependencies). However, this
1156
   libiconv, to make installation easier (no dependencies).
1113
   also means that you cannot change the versions which are used.
1114
1157
1115
   After extracting the tar file, you can proceed with installation as if you
1158
   After extracting the tar file, you can proceed with installation as if you
1116
   had built the package from source (that is, just type make install). The
1159
   had built the package from source (that is, just type make install). The
1117
   binary trees are built for installation to /usr/local.
1160
   binary trees are built for installation to /usr/local.
1118
1161
1119
   You may then need to install external applications to process some file
1120
   types that you want indexed (ie: acrobat, postscript ...). See next
1121
   section.
1122
1123
   Finally, you may want to have a look at the configuration section.
1124
1125
     ----------------------------------------------------------------------
1162
     ----------------------------------------------------------------------
1126
1163
1127
4.2. Supporting packages
1164
4.2. Supporting packages
1128
1165
1129
   Recoll uses external applications to index some file types. You need to
1166
   Recoll uses external applications to index some file types. You need to
...
...
1159
4.3. Building from source
1196
4.3. Building from source
1160
1197
1161
  4.3.1. Prerequisites
1198
  4.3.1. Prerequisites
1162
1199
1163
   At the very least, you will need to download and install the xapian core
1200
   At the very least, you will need to download and install the xapian core
1164
   package (Recoll development currently uses version 0.9.5), and the qt
1201
   package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
1165
   run-time and development packages (Recoll development currently uses
1202
   version will work too), and the qt run-time and development packages
1166
   version 3.3.5, but any 3.3 version is probably OK).
1203
   (Recoll development currently uses version 3.3.5, but any 3.3 version is
1204
   probably OK).
1167
1205
1168
   You will most probably be able to find a binary package for qt for your
1206
   You will most probably be able to find a binary package for qt for your
1169
   system. You may have to compile Xapian but this is not difficult (if you
1207
   system. You may have to compile Xapian but this is not difficult (if you
1170
   are using FreeBSD, there is a port).
1208
   are using FreeBSD, there is a port).
1171
1209
...
...
1176
     ----------------------------------------------------------------------
1214
     ----------------------------------------------------------------------
1177
1215
1178
  4.3.2. Building
1216
  4.3.2. Building
1179
1217
1180
   Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
1218
   Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
1181
   3/4/5), FreeBSD and Solaris 8. If you build on another system, I would
1219
   3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
1182
   very much welcome patches.
1220
   system, and need to modify things, I would very much welcome patches.
1183
1221
1184
   Depending on the qt configuration on your system, you may have to set the
1222
   Depending on the qt configuration on your system, you may have to set the
1185
   QTDIR and QMAKESPECS variables in your environment:
1223
   QTDIR and QMAKESPECS variables in your environment:
1186
1224
1187
     * QTDIR should point to the directory above the one that holds the qt
1225
     * QTDIR should point to the directory above the one that holds the qt
...
...
1368
1406
1369
           Where the messages should go. 'stderr' can be used as a special
1407
           Where the messages should go. 'stderr' can be used as a special
1370
           value, and is the default. The daemversion is specific to the
1408
           value, and is the default. The daemversion is specific to the
1371
           indexing monitor daemon.
1409
           indexing monitor daemon.
1372
1410
1373
   filtersdir
1374
1375
           A directory to search for the external filter scripts used to
1376
           index some types of files. The value should not be changed, except
1377
           if you want to modify one of the default scripts. The value can be
1378
           redefined for any sub-directory.
1379
1380
   indexstemminglanguages
1411
   indexstemminglanguages
1381
1412
1382
           A list of languages for which the stem expansion databases will be
1413
           A list of languages for which the stem expansion databases will be
1383
           built. See recollindex(1) for possible values. You can add a stem
1414
           built. See recollindex(1) or use the recollindex -l command for
1384
           expansion database for a different language by using recollindex
1415
           possible values. You can add a stem expansion database for a
1385
           -s, but it will be deleted during the next indexing. Only
1416
           different language by using recollindex -s, but it will be deleted
1417
           during the next indexing. Only languages listed in the
1386
           languages listed in the configuration file are permanent.
1418
           configuration file are permanent.
1387
1419
1388
   defaultcharset
1420
   defaultcharset
1389
1421
1390
           The name of the character set used for files that do not contain a
1422
           The name of the character set used for files that do not contain a
1391
           character set definition (ie: plain text files). This can be
1423
           character set definition (ie: plain text files). This can be
1392
           redefined for any sub-directory. If it is not set at all, the
1424
           redefined for any sub-directory. If it is not set at all, the
1393
           character set used is the one defined by the nls environment
1425
           character set used is the one defined by the nls environment
1394
           (LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
1426
           (LC_ALL, LC_CTYPE, LANG), or iso8859-1 if nothing is set.
1427
1428
   maxfsoccuppc
1429
1430
           Maximum file system occupation before we stop indexing. The value
1431
           is a percentage, corresponding to what the "Capacity" df output
1432
           column shows. The default value is 0, meaning no checking.
1433
1434
   idxflushmb
1435
1436
           Threshold (megabytes of new text data) where we flush from memory
1437
           to disk index. Setting this can help control memory usage. A value
1438
           of 0 means no explicit flushing, letting Xapian use its own
1439
           default, which is flushing every 10000 documents (memory usage
1440
           depends on average document size). The default value is 10.
1441
1442
   filtersdir
1443
1444
           A directory to search for the external filter scripts used to
1445
           index some types of files. The value should not be changed, except
1446
           if you want to modify one of the default scripts. The value can be
1447
           redefined for any sub-directory.
1448
1449
   iconsdir
1450
1451
           The name of the directory where recoll result list icons are
1452
           stored. You can change this if you want different images.
1395
1453
1396
   guesscharset
1454
   guesscharset
1397
1455
1398
           Decide if we try to guess the character set of files if no
1456
           Decide if we try to guess the character set of files if no
1399
           internal value is available (ie: for plain text files). This does
1457
           internal value is available (ie: for plain text files). This does
...
...
1422
           database. This is so that they can be displayed inside the result
1480
           database. This is so that they can be displayed inside the result
1423
           lists without decoding the original file. This parameter defines
1481
           lists without decoding the original file. This parameter defines
1424
           the size of the stored abstract (which can come from an actual
1482
           the size of the stored abstract (which can come from an actual
1425
           section or just be the beginning of the text). The default value
1483
           section or just be the beginning of the text). The default value
1426
           is 250.
1484
           is 250.
1427
1428
   iconsdir
1429
1430
           The name of the directory where recoll result list icons are
1431
           stored. You can change this if you want different images.
1432
1485
1433
   aspellLanguage
1486
   aspellLanguage
1434
1487
1435
           Language definitions to use when creating the aspell dictionary.
1488
           Language definitions to use when creating the aspell dictionary.
1436
           The value must match a set of aspell language definition files.
1489
           The value must match a set of aspell language definition files.
...
...
1569
   The rclblob filter should be an executable program or script which exists
1622
   The rclblob filter should be an executable program or script which exists
1570
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
1623
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
1571
   argument and should output the text contents in html format on the
1624
   argument and should output the text contents in html format on the
1572
   standard output.
1625
   standard output.
1573
1626
1627
   You can find more details about writing a Recoll filter in the section
1628
   about writing filters
1629
1630
     ----------------------------------------------------------------------
1631
1632
4.5. Extending Recoll
1633
1634
  4.5.1. Writing a document filter
1635
1636
   Recoll filters are executable programs which translate from a specific
1637
   format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
1638
   format, which was chosen to be HTML.
1639
1640
   Recoll filters are usually shell-scripts, but this is in no way necessary.
1641
   These programs are extremely simple and most of the difficulty lies in
1642
   extracting the text from the native format, not outputting what is
1643
   expected by Recoll. Happily enough, most document formats already have
1644
   translators or text extractors which handle the difficult part and can be
1645
   called from the filter.
1646
1647
   Filters are called with a single argument which is the source file name.
1648
   They should output the result to stdout.
1649
1650
   The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
1651
   the filter if the operation is for indexing or previewing. Some filters
1652
   use this to output a slightly different format. This is not essential.
1653
1574
   The html could be very minimal like the following example:
1654
   The output HTML could be very minimal like the following example:
1575
1655
1576
 <html><head>
1656
 <html><head>
1577
 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
1657
 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
1578
 </head>
1658
 </head>
1579
 <body>some text content</body></html>
1659
 <body>some text content</body></html>
...
...
1588
   accurate for good results.
1668
   accurate for good results.
1589
1669
1590
   Recoll will also make use of other header fields if they are present:
1670
   Recoll will also make use of other header fields if they are present:
1591
   title, description, keywords.
1671
   title, description, keywords.
1592
1672
1673
   As of Recoll release 1.9, filters also have the possibility to "invent"
1674
   field names. This should be output as meta tags:
1675
1676
 <meta name="somefield" content="Some textual data" />
1677
1678
   In this case, a correspondance between field name and Xapian prefix should
1679
   also be added to the mimeconf file. See the existing entries for
1680
   inspiration. The field can then be used inside the query language to
1681
   narrow searches.
1682
1593
   The easiest way to write a new filter is probably to start from an
1683
   The easiest way to write a new filter is probably to start from an
1594
   existing one.
1684
   existing one.
1595
1685
1596
     ----------------------------------------------------------------------
1686
     ----------------------------------------------------------------------