Switch to unified view

a/src/README b/src/README
...
...
7
  Jean-Francois Dockes
7
  Jean-Francois Dockes
8
8
9
   <jfd@recoll.org>
9
   <jfd@recoll.org>
10
10
11
   Copyright (c) 2005-2012 Jean-Francois Dockes
11
   Copyright (c) 2005-2012 Jean-Francois Dockes
12
13
   Permission is granted to copy, distribute and/or modify this document
14
   under the terms of the GNU Free Documentation License, Version 1.3 or any
15
   later version published by the Free Software Foundation; with no Invariant
16
   Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the
17
   license can be found at the following location: GNU web site.
12
18
13
   This document introduces full text search notions and describes the
19
   This document introduces full text search notions and describes the
14
   installation and use of the Recoll application. It currently describes
20
   installation and use of the Recoll application. It currently describes
15
   Recoll 1.19.
21
   Recoll 1.19.
16
22
...
...
50
56
51
                             2.3.2. Index case and diacritics sensitivity
57
                             2.3.2. Index case and diacritics sensitivity
52
58
53
                             2.3.3. The index configuration GUI
59
                             2.3.3. The index configuration GUI
54
60
55
                2.4. Index WEB visited page history
61
                2.4. Indexing WEB pages you wisit
56
62
63
                2.5. Extended attributes data
64
65
                2.6. Importing external tags
66
57
                2.5. Periodic indexing
67
                2.7. Periodic indexing
58
68
59
                             2.5.1. Running indexing
69
                             2.7.1. Running indexing
60
70
61
                             2.5.2. Using cron to automate indexing
71
                             2.7.2. Using cron to automate indexing
62
72
63
                2.6. Real time indexing
73
                2.8. Real time indexing
64
74
65
                             2.6.1. Slowing down the reindexing rate for fast
75
                             2.8.1. Slowing down the reindexing rate for fast
66
                             changing files
76
                             changing files
67
77
68
   3. Searching
78
   3. Searching
69
79
70
                3.1. Searching with the Qt graphical user interface
80
                3.1. Searching with the Qt graphical user interface
...
...
100
110
101
                             3.2.2. Searchable documents
111
                             3.2.2. Searchable documents
102
112
103
                3.3. Searching on the command line
113
                3.3. Searching on the command line
104
114
115
                3.4. Path translations
116
105
                3.4. The query language
117
                3.5. The query language
106
118
107
                             3.4.1. Modifiers
119
                             3.5.1. Modifiers
108
120
109
                3.5. Search case and diacritics sensitivity
121
                3.6. Search case and diacritics sensitivity
110
122
111
                3.6. Anchored searches and wildcards
123
                3.7. Anchored searches and wildcards
112
124
113
                             3.6.1. More about wildcards
125
                             3.7.1. More about wildcards
114
126
115
                             3.6.2. Anchored searches
127
                             3.7.2. Anchored searches
116
128
117
                3.7. Desktop integration
129
                3.8. Desktop integration
118
130
119
                             3.7.1. Hotkeying recoll
131
                             3.8.1. Hotkeying recoll
120
132
121
                             3.7.2. The KDE Kicker Recoll applet
133
                             3.8.2. The KDE Kicker Recoll applet
122
134
123
   4. Programming interface
135
   4. Programming interface
124
136
125
                4.1. Writing a document filter
137
                4.1. Writing a document filter
126
138
...
...
170
182
171
                             5.4.4. The mimeconf file
183
                             5.4.4. The mimeconf file
172
184
173
                             5.4.5. The mimeview file
185
                             5.4.5. The mimeview file
174
186
187
                             5.4.6. The ptrans file
188
175
                             5.4.6. Examples of configuration adjustments
189
                             5.4.7. Examples of configuration adjustments
176
190
177
Chapter 1. Introduction
191
Chapter 1. Introduction
178
192
179
1.1. Giving it a try
193
1.1. Giving it a try
180
194
...
...
394
   would be needed for indexing existing files types. This list can be
408
   would be needed for indexing existing files types. This list can be
395
   displayed by selecting the menu option File -> Show Missing Helpers in the
409
   displayed by selecting the menu option File -> Show Missing Helpers in the
396
   recoll GUI. It is stored in the missing text file inside the configuration
410
   recoll GUI. It is stored in the missing text file inside the configuration
397
   directory.
411
   directory.
398
412
413
   By default, Recoll will try to index any file type that it has a way to
414
   read. This is sometimes not desirable, and there are ways to either
415
   exclude some types, or on the contrary to define a positive list of types
416
   to be indexed. In the latter case, any type not in the list will be
417
   ignored.
418
419
   Excluding types can be done by adding name patterns to the skippedNames
420
   list, which can be done from the GUI Index configuration menu. It is also
421
   possible to exclude a mime type independantly of the file name by
422
   associating it with the rclnull filter. This can be done by editing the
423
   mimeconf configuration file.
424
425
   In order to define a positive list, You need to edit the main
426
   configuration file (recoll.conf) and set the indexedmimetypes
427
   configuration variable. Example:
428
429
 indexedmimetypes = text/html application/pdf
430
          
431
432
   There is no GUI way to do this, because this option runs a bit contrary to
433
   Recoll main goal which is to help you find information, independantly of
434
   how it may be stored.
435
399
  2.1.4. Recovery
436
  2.1.4. Recovery
400
437
401
   In the rare case where the index becomes corrupted (which can signal
438
   In the rare case where the index becomes corrupted (which can signal
402
   itself by weird search results or crashes), the index files need to be
439
   itself by weird search results or crashes), the index files need to be
403
   erased before restarting a clean indexing pass. Just delete the xapiandb
440
   erased before restarting a clean indexing pass. Just delete the xapiandb
...
...
613
   The configuration tool normally respects the comments and most of the
650
   The configuration tool normally respects the comments and most of the
614
   formatting inside the configuration file, so that it is quite possible to
651
   formatting inside the configuration file, so that it is quite possible to
615
   use it on hand-edited files, which you might nevertheless want to backup
652
   use it on hand-edited files, which you might nevertheless want to backup
616
   first...
653
   first...
617
654
618
2.4. Index WEB visited page history
655
2.4. Indexing WEB pages you wisit
619
656
620
   With the help of a Firefox extension, Recoll can index the Internet pages
657
   With the help of a Firefox extension, Recoll can index the Internet pages
621
   that you visit. The extension was initially designed for the Beagle
658
   that you visit. The extension was initially designed for the Beagle
622
   indexer, but it has recently be renamed and better adapted to Recoll.
659
   indexer, but it has recently be renamed and better adapted to Recoll.
623
660
...
...
636
   the Index configuration / Web history panel. Once the maximum size is
673
   the Index configuration / Web history panel. Once the maximum size is
637
   reached, old pages are purged - both from the cache and the index - to
674
   reached, old pages are purged - both from the cache and the index - to
638
   make room for new ones, so you need to explicitly archive in some other
675
   make room for new ones, so you need to explicitly archive in some other
639
   place the pages that you want to keep indefinitely.
676
   place the pages that you want to keep indefinitely.
640
677
678
2.5. Extended attributes data
679
680
   User extended attributes are named pieces of information that most modern
681
   file systems can attach to any file.
682
683
   Recoll versions 1.19 and later process extended attributes as document
684
   fields by default. For older versions, this has to be activated at build
685
   time.
686
687
   A freedesktop standard defines a few special attributes, which are handled
688
   as such by Recoll:
689
690
   mime_type
691
692
           If set, this overrides any other determination of the file mime
693
           type.
694
695
   charset
696
           If set, this defines the file character set (mostly useful for
697
           plain text files).
698
699
   By default, other attributes are handled as Recoll fields. On Linux, the
700
   user prefix is removed from the name. This can be configured more
701
   precisely inside the fields configuration file.
702
703
2.6. Importing external tags
704
705
   During indexing, it is possible to import metadata for each file by
706
   executing commands. For example, this could extract user tag data for the
707
   file and store it in a field for indexing.
708
709
   See the section about the metadatacmds field in the main configuration
710
   chapter for more detail.
711
641
2.5. Periodic indexing
712
2.7. Periodic indexing
642
713
643
  2.5.1. Running indexing
714
  2.7.1. Running indexing
644
715
645
   Indexing is always performed by the recollindex program, which can be
716
   Indexing is always performed by the recollindex program, which can be
646
   started either from the command line or from the File menu in the recoll
717
   started either from the command line or from the File menu in the recoll
647
   GUI program. When started from the GUI, the indexing will run on the same
718
   GUI program. When started from the GUI, the indexing will run on the same
648
   configuration recoll was started on. When started from the command line,
719
   configuration recoll was started on. When started from the command line,
...
...
694
765
695
   recollindex -i will not descend into subdirectories specified as
766
   recollindex -i will not descend into subdirectories specified as
696
   parameters, but just add them as index entries. It is up to the external
767
   parameters, but just add them as index entries. It is up to the external
697
   file selection method to build the complete file list.
768
   file selection method to build the complete file list.
698
769
699
  2.5.2. Using cron to automate indexing
770
  2.7.2. Using cron to automate indexing
700
771
701
   The most common way to set up indexing is to have a cron task execute it
772
   The most common way to set up indexing is to have a cron task execute it
702
   every night. For example the following crontab entry would do it every day
773
   every night. For example the following crontab entry would do it every day
703
   at 3:30AM (supposing recollindex is in your PATH):
774
   at 3:30AM (supposing recollindex is in your PATH):
704
775
...
...
720
   Please be aware that there may be differences between your usual
791
   Please be aware that there may be differences between your usual
721
   interactive command line environment and the one seen by crontab commands.
792
   interactive command line environment and the one seen by crontab commands.
722
   Especially the PATH variable may be of concern. Please check the crontab
793
   Especially the PATH variable may be of concern. Please check the crontab
723
   manual pages about possible issues.
794
   manual pages about possible issues.
724
795
725
2.6. Real time indexing
796
2.8. Real time indexing
726
797
727
   Real time monitoring/indexing is performed by starting the recollindex -m
798
   Real time monitoring/indexing is performed by starting the recollindex -m
728
   command. With this option, recollindex will detach from the terminal and
799
   command. With this option, recollindex will detach from the terminal and
729
   become a daemon, permanently monitoring file changes and updating the
800
   become a daemon, permanently monitoring file changes and updating the
730
   index.
801
   index.
...
...
779
   email folders change. Also, monitoring large file trees by itself
850
   email folders change. Also, monitoring large file trees by itself
780
   significantly taxes system resources. You probably do not want to enable
851
   significantly taxes system resources. You probably do not want to enable
781
   it if your system is short on resources. Periodic indexing is adequate in
852
   it if your system is short on resources. Periodic indexing is adequate in
782
   most cases.
853
   most cases.
783
854
784
  2.6.1. Slowing down the reindexing rate for fast changing files
855
  2.8.1. Slowing down the reindexing rate for fast changing files
785
856
786
   When using the real time monitor, it may happen that some files need to be
857
   When using the real time monitor, it may happen that some files need to be
787
   indexed, but change so often that they impose an excessive load for the
858
   indexed, but change so often that they impose an excessive load for the
788
   system.
859
   system.
789
860
...
...
1273
1344
1274
   Note that in cases where Recoll does not know the beginning of the string
1345
   Note that in cases where Recoll does not know the beginning of the string
1275
   to search for (ie a wildcard expression like *coll), the expansion can
1346
   to search for (ie a wildcard expression like *coll), the expansion can
1276
   take quite a long time because the full index term list will have to be
1347
   take quite a long time because the full index term list will have to be
1277
   processed. The expansion is currently limited at 10000 results for
1348
   processed. The expansion is currently limited at 10000 results for
1278
   wildcards and regular expressions.
1349
   wildcards and regular expressions. It is possible to change the limit in
1350
   the configuration file.
1279
1351
1280
   Double-clicking on a term in the result list will insert it into the
1352
   Double-clicking on a term in the result list will insert it into the
1281
   simple search entry field. You can also cut/paste between the result list
1353
   simple search entry field. You can also cut/paste between the result list
1282
   and any entry field (the end of lines will be taken care of).
1354
   and any entry field (the end of lines will be taken care of).
1283
1355
...
...
1292
   can be selected through the external indexes tab in the preferences
1364
   can be selected through the external indexes tab in the preferences
1293
   dialog.
1365
   dialog.
1294
1366
1295
   Index selection is performed in two phases. A set of all usable indexes
1367
   Index selection is performed in two phases. A set of all usable indexes
1296
   must first be defined, and then the subset of indexes to be used for
1368
   must first be defined, and then the subset of indexes to be used for
1297
   searching. Of course, these parameters are retained across program
1369
   searching. These parameters are retained across program executions (there
1298
   executions (there are kept separately for each Recoll configuration). The
1370
   are kept separately for each Recoll configuration). The set of all indexes
1299
   set of all indexes is usually quite stable, while the active ones might
1371
   is usually quite stable, while the active ones might typically be adjusted
1300
   typically be adjusted quite frequently.
1372
   quite frequently.
1301
1373
1302
   The main index (defined by RECOLL_CONFDIR) is always active. If this is
1374
   The main index (defined by RECOLL_CONFDIR) is always active. If this is
1303
   undesirable, you can set up your base configuration to index an empty
1375
   undesirable, you can set up your base configuration to index an empty
1304
   directory.
1376
   directory.
1377
1378
   When adding a new index to the set, you can select either a Recoll
1379
   configuration directory, or directly a Xapian index directory. In the
1380
   first case, the Xapian index directory will be obtained from the selected
1381
   configuration.
1305
1382
1306
   As building the set of all indexes can be a little tedious when done
1383
   As building the set of all indexes can be a little tedious when done
1307
   through the user interface, you can use the RECOLL_EXTRA_DBS environment
1384
   through the user interface, you can use the RECOLL_EXTRA_DBS environment
1308
   variable to provide an initial set. This might typically be set up by a
1385
   variable to provide an initial set. This might typically be set up by a
1309
   system administrator so that every user does not have to do it. The
1386
   system administrator so that every user does not have to do it. The
...
...
1453
1530
1454
   Scrolling the result list from the keyboard. You can use PageUp and
1531
   Scrolling the result list from the keyboard. You can use PageUp and
1455
   PageDown to scroll the result list, Shift+Home to go back to the first
1532
   PageDown to scroll the result list, Shift+Home to go back to the first
1456
   page. These work even while the focus is in the search entry.
1533
   page. These work even while the focus is in the search entry.
1457
1534
1535
   Editing a new search while the focus is not in the search entry. You can
1536
   use the Ctrl-Shift-S shortcut to return the cursor to the search entry
1537
   (and select the current search text), while the focus is anywhere in the
1538
   main window.
1539
1458
   Forced opening of a preview window. You can use Shift+Click on a result
1540
   Forced opening of a preview window. You can use Shift+Click on a result
1459
   list Preview link to force the creation of a preview window instead of a
1541
   list Preview link to force the creation of a preview window instead of a
1460
   new tab in the existing one.
1542
   new tab in the existing one.
1461
1543
1462
   Closing previews. Entering Ctrl-W in a tab will close it (and, for the
1544
   Closing previews. Entering Ctrl-W in a tab will close it (and, for the
...
...
1488
       to the whole Recoll application on startup. The default value is
1570
       to the whole Recoll application on startup. The default value is
1489
       empty, but there is a skeleton style sheet (recoll.qss) inside the
1571
       empty, but there is a skeleton style sheet (recoll.qss) inside the
1490
       /usr/share/recoll/examples directory. Using a style sheet, you can
1572
       /usr/share/recoll/examples directory. Using a style sheet, you can
1491
       change most recoll graphical parameters: colors, fonts, etc. See the
1573
       change most recoll graphical parameters: colors, fonts, etc. See the
1492
       sample file for a few simple examples.
1574
       sample file for a few simple examples.
1575
1576
       You should be aware that parameters (e.g.: the background color) set
1577
       inside the Recoll GUI style sheet will override global system
1578
       preferences, with possible strange side effects: for example if you
1579
       set the foreground to a light color and the background to a dark one
1580
       in the desktop preferences, but only the background is set inside the
1581
       Recoll style sheet, and it is light too, then text will appear
1582
       light-on-light inside the Recoll GUI.
1493
1583
1494
     o Maximum text size highlighted for preview Inserting highlights on
1584
     o Maximum text size highlighted for preview Inserting highlights on
1495
       search term inside the text before inserting it in the preview window
1585
       search term inside the text before inserting it in the preview window
1496
       involves quite a lot of processing, and can be disabled over the given
1586
       involves quite a lot of processing, and can be disabled over the given
1497
       text size to speed up loading.
1587
       text size to speed up loading.
...
...
1691
   will be replaced by the value of the field named fieldname for this
1781
   will be replaced by the value of the field named fieldname for this
1692
   document. Only stored fields can be accessed in this way, the value of
1782
   document. Only stored fields can be accessed in this way, the value of
1693
   indexed but not stored fields is not known at this point in the search
1783
   indexed but not stored fields is not known at this point in the search
1694
   process (see field configuration). There are currently very few fields
1784
   process (see field configuration). There are currently very few fields
1695
   stored by default, apart from the values above (only author and filename),
1785
   stored by default, apart from the values above (only author and filename),
1696
   so this feature will need some custom local configuration to be useful.
1786
   so this feature will need some custom local configuration to be useful. An
1697
   For example, you could look at the fields for the document types of
1787
   example candidate would be the recipient field which is generated by the
1698
   interest (use the right-click menu inside the preview window), and add
1788
   message filters.
1699
   what you want to the list of stored fields. A candidate example would be
1700
   the recipient field which is generated by the message filters.
1701
1789
1702
   The default value for the paragraph format string is:
1790
   The default value for the paragraph format string is:
1703
1791
1704
 <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
1792
 <img src="%I" align="left">%R %S %L &nbsp;&nbsp;<b>%T</b><br>
1705
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
1793
 %M&nbsp;%D&nbsp;&nbsp;&nbsp;<i>%U</i>&nbsp;%i<br>
...
...
1757
1845
1758
   As a sample application, the Recoll KIO slave could allow preparing a set
1846
   As a sample application, the Recoll KIO slave could allow preparing a set
1759
   of HTML documents (for example a manual) so that they become their own
1847
   of HTML documents (for example a manual) so that they become their own
1760
   search interface inside konqueror.
1848
   search interface inside konqueror.
1761
1849
1762
   This can be done by either explicitly inserting <a href="recoll:/...">
1850
   This can be done by either explicitly inserting <a href="recoll://...">
1763
   links around some document areas, or automatically by adding a very small
1851
   links around some document areas, or automatically by adding a very small
1764
   javascript program to the documents, like the following example, which
1852
   javascript program to the documents, like the following example, which
1765
   would initiate a search by double-clicking any term:
1853
   would initiate a search by double-clicking any term:
1766
1854
1767
 <script language="JavaScript">
1855
 <script language="JavaScript">
...
...
1840
 text/html       [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html]      [comptes.html]  18593   bytes  
1928
 text/html       [file:///Users/uncrypted-dockes/projets/bateaux/ilur/comptes.html]      [comptes.html]  18593   bytes  
1841
 text/html       [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
1929
 text/html       [file:///Users/uncrypted-dockes/projets/nautique/webnautique/articles/ilur1/index.html] [Constructio...
1842
 text/html       [file:///Users/uncrypted-dockes/projets/pagepers/index.html]    [psxtcl/writemime/recoll]...
1930
 text/html       [file:///Users/uncrypted-dockes/projets/pagepers/index.html]    [psxtcl/writemime/recoll]...
1843
 text/html       [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
1931
 text/html       [file:///Users/uncrypted-dockes/projets/bateaux/ilur/factEtCie/recu-chasse-maree....
1844
1932
1933
3.4. Path translations
1934
1935
   In some cases, the document paths stored inside the index do not match the
1936
   actual ones, so that document previews and accesses will fail. This can
1937
   occur in a number of circumstances:
1938
1939
     o When using multiple indexes it is a relatively common occurrence that
1940
       some will actually reside on a remote volume, for exemple mounted via
1941
       NFS. In this case, the paths used to access the documents on the local
1942
       machine are not necessarily the same than the ones used while indexing
1943
       on the remote machine. For example, /home/me may have been used as a
1944
       topdirs elements while indexing, but the directory might be mounted as
1945
       /net/server/home/me on the local machine.
1946
1947
     o The case may also occur with removable disks. It is perfectly possible
1948
       to configure an index to live with the documents on the removable
1949
       disk, but it may happen that the disk is not mounted at the same place
1950
       so that the documents paths from the index are invalid.
1951
1952
     o As a last exemple, one could imagine that a big directory has been
1953
       moved, but that it is currently inconvenient to run the indexer.
1954
1955
   More generally, the path translation facility may be useful whenever the
1956
   documents paths seen by the indexer are not the same as the ones which
1957
   should be used at query time.
1958
1959
   Recoll has a facility for rewriting access paths when extracting the data
1960
   from the index. The translations can be defined for the main index and for
1961
   any additional query index.
1962
1963
   In the above NFS example, Recoll could be instructed to rewrite any
1964
   file:///home/me URL from the index to file:///net/server/home/me, allowing
1965
   accesses from the client.
1966
1967
   The translations are defined in the ptrans configuration file, which can
1968
   be edited by hand or from the GUI external indexes configuration dialog.
1969
1845
3.4. The query language
1970
3.5. The query language
1846
1971
1847
   The query language processor is activated in the GUI simple search entry
1972
   The query language processor is activated in the GUI simple search entry
1848
   when the search mode selector is set to Query Language. It can also be
1973
   when the search mode selector is set to Query Language. It can also be
1849
   used with the KIO slave or the command line search. It broadly has the
1974
   used with the KIO slave or the command line search. It broadly has the
1850
   same capabilities as the complex search interface in the GUI.
1975
   same capabilities as the complex search interface in the GUI.
...
...
1912
   The field syntax also supports a few field-like, but special, criteria:
2037
   The field syntax also supports a few field-like, but special, criteria:
1913
2038
1914
     o dir for filtering the results on file location (Ex:
2039
     o dir for filtering the results on file location (Ex:
1915
       dir:/home/me/somedir). -dir also works to find results not in the
2040
       dir:/home/me/somedir). -dir also works to find results not in the
1916
       specified directory (release >= 1.15.8). A tilde inside the value will
2041
       specified directory (release >= 1.15.8). A tilde inside the value will
1917
       be expanded to the home directory. Wildcards will not be expanded. You
2042
       be expanded to the home directory. Wildcards will be expanded, but
1918
       cannot use OR with dir clauses (this restriction may go away in the
2043
       please have a look at an important limitation of wildcards in path
1919
       future).
2044
       filters.
1920
2045
1921
       Relative paths also make sense, for example, dir:share/doc would match
2046
       Relative paths also make sense, for example, dir:share/doc would match
1922
       either /usr/share/doc or /usr/local/share/doc
2047
       either /usr/share/doc or /usr/local/share/doc
1923
2048
1924
       Several dir clauses can be specified, both positive and negative. For
2049
       Several dir clauses can be specified, both positive and negative. For
...
...
1928
            
2053
            
1929
2054
1930
       This would select results which have both recoll and src in the path
2055
       This would select results which have both recoll and src in the path
1931
       (in any order), and which have not either utils or common.
2056
       (in any order), and which have not either utils or common.
1932
2057
2058
       You can also use OR conjunctions with dir: clauses.
2059
1933
       Another special aspect of dir clauses is that the values in the index
2060
       A special aspect of dir clauses is that the values in the index are
1934
       are not transcoded to UTF-8, and never lower-cased or unaccented, but
2061
       not transcoded to UTF-8, and never lower-cased or unaccented, but
1935
       stored as binary. This means that you need to enter the values in the
2062
       stored as binary. This means that you need to enter the values in the
1936
       exact lower or upper case, and that searches for names with diacritics
2063
       exact lower or upper case, and that searches for names with diacritics
1937
       may sometimes be impossible because of character set conversion
2064
       may sometimes be impossible because of character set conversion
1938
       issues. Non-ASCII UNIX file paths are an unending source of trouble
2065
       issues. Non-ASCII UNIX file paths are an unending source of trouble
1939
       and are best avoided.
2066
       and are best avoided.
...
...
1998
   The document filters used while indexing have the possibility to create
2125
   The document filters used while indexing have the possibility to create
1999
   other fields with arbitrary names, and aliases may be defined in the
2126
   other fields with arbitrary names, and aliases may be defined in the
2000
   configuration, so that the exact field search possibilities may be
2127
   configuration, so that the exact field search possibilities may be
2001
   different for you if someone took care of the customisation.
2128
   different for you if someone took care of the customisation.
2002
2129
2003
  3.4.1. Modifiers
2130
  3.5.1. Modifiers
2004
2131
2005
   Some characters are recognized as search modifiers when found immediately
2132
   Some characters are recognized as search modifiers when found immediately
2006
   after the closing double quote of a phrase, as in "some
2133
   after the closing double quote of a phrase, as in "some
2007
   term"modifierchars. The actual "phrase" can be a single term of course.
2134
   term"modifierchars. The actual "phrase" can be a single term of course.
2008
   Supported modifiers:
2135
   Supported modifiers:
...
...
2023
     o D will turn on diacritics sensitivity (if the index supports it).
2150
     o D will turn on diacritics sensitivity (if the index supports it).
2024
2151
2025
     o A weight can be specified for a query element by specifying a decimal
2152
     o A weight can be specified for a query element by specifying a decimal
2026
       value at the start of the modifiers. Example: "Important"2.5.
2153
       value at the start of the modifiers. Example: "Important"2.5.
2027
2154
2028
3.5. Search case and diacritics sensitivity
2155
3.6. Search case and diacritics sensitivity
2029
2156
2030
   For Recoll versions 1.18 and later, and when working with a raw index (not
2157
   For Recoll versions 1.18 and later, and when working with a raw index (not
2031
   the default), searches can be made sensitive to character case and
2158
   the default), searches can be made sensitive to character case and
2032
   diacritics. How this happens is controlled by configuration variables and
2159
   diacritics. How this happens is controlled by configuration variables and
2033
   what search data is entered.
2160
   what search data is entered.
...
...
2073
   will search for the term resume exactly (resume will not be a match).
2200
   will search for the term resume exactly (resume will not be a match).
2074
2201
2075
   When either case or diacritics sensitivity is activated, stem expansion is
2202
   When either case or diacritics sensitivity is activated, stem expansion is
2076
   turned off. Having both does not make much sense.
2203
   turned off. Having both does not make much sense.
2077
2204
2078
3.6. Anchored searches and wildcards
2205
3.7. Anchored searches and wildcards
2079
2206
2080
   Some special characters are interpreted by Recoll in search strings to
2207
   Some special characters are interpreted by Recoll in search strings to
2081
   expand or specialize the search. Wildcards expand a root term in
2208
   expand or specialize the search. Wildcards expand a root term in
2082
   controlled ways. Anchor characters can restrict a search to succeed only
2209
   controlled ways. Anchor characters can restrict a search to succeed only
2083
   if the match is found at or near the beginning of the document or one of
2210
   if the match is found at or near the beginning of the document or one of
2084
   its fields.
2211
   its fields.
2085
2212
2086
  3.6.1. More about wildcards
2213
  3.7.1. More about wildcards
2087
2214
2088
   All words entered in Recoll search fields will be processed for wildcard
2215
   All words entered in Recoll search fields will be processed for wildcard
2089
   expansion before the request is finally executed.
2216
   expansion before the request is finally executed.
2090
2217
2091
   The wildcard characters are:
2218
   The wildcard characters are:
...
...
2096
2223
2097
     o [] which allow defining sets of characters to be matched (ex: [abc]
2224
     o [] which allow defining sets of characters to be matched (ex: [abc]
2098
       matches a single character which may be 'a' or 'b' or 'c', [0-9]
2225
       matches a single character which may be 'a' or 'b' or 'c', [0-9]
2099
       matches any number.
2226
       matches any number.
2100
2227
2101
   You should be aware of a few things before using wildcards.
2228
   You should be aware of a few things when using wildcards.
2102
2229
2103
     o Using a wildcard character at the beginning of a word can make for a
2230
     o Using a wildcard character at the beginning of a word can make for a
2104
       slow search because Recoll will have to scan the whole index term list
2231
       slow search because Recoll will have to scan the whole index term list
2105
       to find the matches.
2232
       to find the matches. However, this is much less a problem for field
2233
       searches, and queries like author:*@domain.com can sometimes be very
2234
       useful.
2106
2235
2107
     o When working with a raw index (preserving character case and
2236
     o For Recoll version 18 only, when working with a raw index (preserving
2108
       diacritics), the literal part of a wildcard expression will be matched
2237
       character case and diacritics), the literal part of a wildcard
2109
       exactly for case and diacritics.
2238
       expression will be matched exactly for case and diacritics. This is
2239
       not true any more for versions 19 and later.
2110
2240
2111
     o Using a * at the end of a word can produce more matches than you would
2241
     o Using a * at the end of a word can produce more matches than you would
2112
       think, and strange search results. You can use the term explorer tool
2242
       think, and strange search results. You can use the term explorer tool
2113
       to check what completions exist for a given term. You can also see
2243
       to check what completions exist for a given term. You can also see
2114
       exactly what search was performed by clicking on the link at the top
2244
       exactly what search was performed by clicking on the link at the top
2115
       of the result list. In general, for natural language terms, stem
2245
       of the result list. In general, for natural language terms, stem
2116
       expansion will produce better results than an ending * (stem expansion
2246
       expansion will produce better results than an ending * (stem expansion
2117
       is turned off when any wildcard character appears in the term).
2247
       is turned off when any wildcard character appears in the term).
2118
2248
2249
    3.7.1.1. Wildcards and path filtering
2250
2251
   Due to the way that Recoll processes wildcards inside dir path filtering
2252
   clauses, they will have a multiplicative effect on the query size. A
2253
   clause containg wildcards in several paths elements, like, for example,
2254
   dir:/home/me/*/*/docdir, will almost certainly fail if your indexed tree
2255
   is of any realistic size.
2256
2257
   Depending on the case, you may be able to work around the issue by
2258
   specifying the paths elements more narrowly, with a constant prefix, or by
2259
   using 2 separate dir: clauses instead of multiple wildcards, as in
2260
   dir:/home/me dir:docdir. The latter query is not equivalent to the initial
2261
   one because it does not specify a number of directory levels, but that's
2262
   the best we can do (and it may be actually more useful in some cases).
2263
2119
  3.6.2. Anchored searches
2264
  3.7.2. Anchored searches
2120
2265
2121
   Two characters are used to specify that a search hit should occur at the
2266
   Two characters are used to specify that a search hit should occur at the
2122
   beginning or at the end of the text. ^ at the beginning of a term or
2267
   beginning or at the end of the text. ^ at the beginning of a term or
2123
   phrase constrains the search to happen at the start, $ at the end force it
2268
   phrase constrains the search to happen at the start, $ at the end force it
2124
   to happen at the end.
2269
   to happen at the end.
...
...
2143
   structured documents like scientific articles, in case explicit metadata
2288
   structured documents like scientific articles, in case explicit metadata
2144
   has not been supplied (a most frequent case), for example for looking for
2289
   has not been supplied (a most frequent case), for example for looking for
2145
   matches inside the abstract or the list of authors (which occur at the top
2290
   matches inside the abstract or the list of authors (which occur at the top
2146
   of the document).
2291
   of the document).
2147
2292
2148
3.7. Desktop integration
2293
3.8. Desktop integration
2149
2294
2150
   Being independant of the desktop type has its drawbacks: Recoll desktop
2295
   Being independant of the desktop type has its drawbacks: Recoll desktop
2151
   integration is minimal. However there are a few tools available:
2296
   integration is minimal. However there are a few tools available:
2152
2297
2153
     o The KDE KIO Slave was described in a previous section.
2298
     o The KDE KIO Slave was described in a previous section.
...
...
2157
2302
2158
     o There is also an independantly developed Krunner plugin.
2303
     o There is also an independantly developed Krunner plugin.
2159
2304
2160
   Here follow a few other things that may help.
2305
   Here follow a few other things that may help.
2161
2306
2162
  3.7.1. Hotkeying recoll
2307
  3.8.1. Hotkeying recoll
2163
2308
2164
   It is surprisingly convenient to be able to show or hide the Recoll GUI
2309
   It is surprisingly convenient to be able to show or hide the Recoll GUI
2165
   with a single keystroke. Recoll comes with a small Python script, based on
2310
   with a single keystroke. Recoll comes with a small Python script, based on
2166
   the libwnck window manager interface library, which will allow you to do
2311
   the libwnck window manager interface library, which will allow you to do
2167
   just this. The detailed instructions are on this wiki page.
2312
   just this. The detailed instructions are on this wiki page.
2168
2313
2169
  3.7.2. The KDE Kicker Recoll applet
2314
  3.8.2. The KDE Kicker Recoll applet
2170
2315
2171
   This is probably obsolete now. Anyway:
2316
   This is probably obsolete now. Anyway:
2172
2317
2173
   The Recoll source tree contains the source code to the recoll_applet, a
2318
   The Recoll source tree contains the source code to the recoll_applet, a
2174
   small application derived from the find_applet. This can be used to add a
2319
   small application derived from the find_applet. This can be used to add a
...
...
2366
2511
2367
  4.1.4. Filter HTML output
2512
  4.1.4. Filter HTML output
2368
2513
2369
   The output HTML could be very minimal like the following example:
2514
   The output HTML could be very minimal like the following example:
2370
2515
2371
 <html><head>
2516
 <html>
2517
   <head>
2372
 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2518
     <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
2373
 </head>
2519
   </head>
2374
 <body>some text content</body></html>
2520
   <body>
2521
    Some text content
2522
   </body>
2523
 </html>
2375
          
2524
          
2376
2525
2377
   You should take care to escape some characters inside the text by
2526
   You should take care to escape some characters inside the text by
2378
   transforming them into appropriate entities. "&" should be transformed
2527
   transforming them into appropriate entities. At the very minimum, "&"
2379
   into "&amp;", "<" should be transformed into "&lt;". This is not always
2528
   should be transformed into "&amp;", "<" should be transformed into "&lt;".
2380
   properly done by translating programs which output HTML, and of course
2529
   This is not always properly done by translating programs which output
2381
   never by those which output plain text.
2530
   HTML, and of course never by those which output plain text.
2531
2532
   When encapsulating plain text in an HTML body, the display of a preview
2533
   may be improved by enclosing the text inside <pre> tags.
2382
2534
2383
   The character set needs to be specified in the header. It does not need to
2535
   The character set needs to be specified in the header. It does not need to
2384
   be UTF-8 (Recoll will take care of translating it), but it must be
2536
   be UTF-8 (Recoll will take care of translating it), but it must be
2385
   accurate for good results.
2537
   accurate for good results.
2386
2538
2387
   Recoll will also make use of other header fields if they are present:
2539
   Recoll will process meta tags inside the header as possible document
2388
   title, description, keywords.
2540
   fields candidates. Documents fields can be processed by the indexer in
2541
   different ways, for searching or displaying inside query results. This is
2542
   described in a following section.
2389
2543
2544
   By default, the indexer will process the standard header fields if they
2545
   are present: title, meta/description, and meta/keywords are both indexed
2546
   and stored for query-time display.
2547
2548
   A predefined non-standard meta tag will also be processed by Recoll
2549
   without further configuration: if a date tag is present and has the right
2550
   format, it will be used as the document date (for display and sorting), in
2551
   preference to the file modification date. The date format should be as
2552
   follows:
2553
2554
 <meta name="date" content="YYYY-mm-dd HH:MM:SS">
2555
 or
2556
 <meta name="date" content="YYYY-mm-ddTHH:MM:SS">
2557
          
2558
2559
   Example:
2560
2561
 <meta name="date" content="2013-02-24 17:50:00">
2562
          
2563
2390
   Filters also have the possibility to "invent" field names. This should be
2564
   Filters also have the possibility to "invent" field names. This should
2391
   output as meta tags:
2565
   also be output as meta tags:
2392
2566
2393
 <meta name="somefield" content="Some textual data" />
2567
 <meta name="somefield" content="Some textual data" />
2394
2568
2395
   See the following section for details about configuring how field data is
2569
   You can embed HTML markup inside the content of custom fields, for
2396
   processed by the indexer.
2570
   improving the display inside result lists. In this case, add a (wildly
2571
   non-standard) markup attribute to tell Recoll that the value is HTML and
2572
   should not be escaped for display.
2573
2574
 <meta name="somefield" markup="html" content="Some <i>textual</i> data" />
2575
2576
   As written above, the processing of fields is described in a further
2577
   section.
2397
2578
2398
  4.1.5. Page numbers
2579
  4.1.5. Page numbers
2399
2580
2400
   The indexer will interpret ^L characters in the filter output as
2581
   The indexer will interpret ^L characters in the filter output as
2401
   indicating page breaks, and will record them. At query time, this allows
2582
   indicating page breaks, and will record them. At query time, this allows
...
...
2407
   Fields are named pieces of information in or about documents, like title,
2588
   Fields are named pieces of information in or about documents, like title,
2408
   author, abstract.
2589
   author, abstract.
2409
2590
2410
   The field values for documents can appear in several ways during indexing:
2591
   The field values for documents can appear in several ways during indexing:
2411
   either output by filters as meta fields in the HTML header section, or
2592
   either output by filters as meta fields in the HTML header section, or
2412
   added as attributes of the Doc object when using the API, or again
2593
   extracted from file extended attributes, or added as attributes of the Doc
2413
   synthetized internally by Recoll.
2594
   object when using the API, or again synthetized internally by Recoll.
2414
2595
2415
   The Recoll query language allows searching for text in a specific field.
2596
   The Recoll query language allows searching for text in a specific field.
2416
2597
2417
   Recoll defines a number of default fields. Additional ones can be output
2598
   Recoll defines a number of default fields. Additional ones can be output
2418
   by filters, and described in the fields configuration file.
2599
   by filters, and described in the fields configuration file.
...
...
2509
    4.3.2.1. Introduction
2690
    4.3.2.1. Introduction
2510
2691
2511
   Recoll versions after 1.11 define a Python programming interface, both for
2692
   Recoll versions after 1.11 define a Python programming interface, both for
2512
   searching and indexing.
2693
   searching and indexing.
2513
2694
2695
   The API is inspired by the Python database API specification, version 1.0
2696
   for Recoll versions up to 1.18, version 2.0 for Recoll versions 1.19 and
2697
   later. The package structure changed with Recoll 1.19 too. We will mostly
2698
   describe the new API and package structure here. A paragraph at the end of
2699
   this section will explain a few differences and ways to write code
2700
   compatible with both versions.
2701
2514
   The Python interface can be found in the source package, under
2702
   The Python interface can be found in the source package, under
2515
   python/recoll.
2703
   python/recoll.
2516
2704
2517
   In order to build the module, you should first build or re-build the
2518
   Recoll library using position-independant objects:
2519
2520
   cd recoll-xxx/
2521
   configure --enable-pic
2522
   make
2523
2524
   There is no significant disadvantage in using PIC objects for the main
2525
   Recoll executables, so you can use the --enable-pic option for the main
2526
   build too.
2527
2528
   The python/recoll/ directory contains the usual setup.py script which you
2705
   The python/recoll/ directory contains the usual setup.py. After
2529
   can then use to build and install the module:
2706
   configuring the main Recoll code, you can use the script to build and
2707
   install the Python module:
2530
2708
2531
   cd recoll-xxx/python/recoll
2709
             cd recoll-xxx/python/recoll
2532
   python setup.py build
2710
             python setup.py build
2533
   python setup.py install
2711
             python setup.py install
2534
2535
    4.3.2.2. Interface manual
2536
2537
   NAME
2538
       recoll - This is an interface to the Recoll full text indexer.
2539
2540
   FILE
2541
       /usr/local/lib/python2.5/site-packages/recoll.so
2542
2543
   CLASSES
2544
           Db
2545
           Doc
2546
           Query
2547
           SearchData
2548
       
2549
       class Db(__builtin__.object)
2550
        |  Db([confdir=None], [extra_dbs=None], [writable = False])
2551
        |  
2712
          
2552
        |  A Db object holds a connection to a Recoll index. Use the connect()
2713
2553
        |  function to create one.
2714
    4.3.2.2. Recoll package
2554
        |  confdir specifies a Recoll configuration directory (default: 
2715
2555
        |   $RECOLL_CONFDIR or ~/.recoll).
2716
   The recoll package contains two modules:
2556
        |  extra_dbs is a list of external databases (xapian directories)
2717
2718
     o The recoll module contains functions and classes used to query (or
2719
       update) the index.
2720
2721
     o The rclextract module contains functions and classes used to access
2722
       document data.
2723
2724
    4.3.2.3. The recoll module
2725
2726
      Functions
2727
2728
   connect(confdir=None, extra_dbs=None, writable = False)
2729
           The connect() function connects to one or several Recoll index(es)
2730
           and returns a Db object.
2731
              o confdir may specify a configuration directory. The usual
2732
                defaults apply.
2733
              o extra_dbs is a list of additional indexes (Xapian
2734
                directories).
2557
        |  writable decides if we can index new data through this connection
2735
              o writable decides if we can index new data through this
2558
        |  
2736
                connection.
2559
        |  Methods defined here:
2737
           This call initializes the recoll module, and it should always be
2560
        |  
2738
           performed before any other call or object creation.
2561
        |  
2739
2562
        |  addOrUpdate(...)
2740
      Classes
2563
        |      addOrUpdate(udi, doc, parent_udi=None) -> None
2741
2564
        |      Add or update index data for a given document
2742
        The Db class
2565
        |      The udi string must define a unique id for the document. It is not
2743
2566
        |      interpreted inside Recoll
2744
   A Db object is created by a connect() function and holds a connection to a
2567
        |      doc is a Doc object
2745
   Recoll index.
2568
        |      if parent_udi is set, this is a unique identifier for the
2746
2569
        |      top-level container (ie mbox file)
2747
   Methods
2570
        |  
2748
2571
        |  delete(...)
2749
   Db.close()
2572
        |      delete(udi) -> Bool.
2750
           Closes the connection. You can't do anything with the Db object
2573
        |      Purge index from all data for udi. If udi matches a container
2751
           after this.
2574
        |      document, purge all subdocs (docs with a parent_udi matching udi).
2752
2575
        |  
2753
   Db.query(), Db.cursor()
2576
        |  makeDocAbstract(...)
2754
           These aliases return a blank Query object for this index.
2577
        |      makeDocAbstract(Doc, Query) -> string
2755
2578
        |      Build and return 'keyword-in-context' abstract for document
2579
        |      and query.
2580
        |  
2581
        |  needUpdate(...)
2582
        |      needUpdate(udi, sig) -> Bool.
2583
        |      Check if the index is up to date for the document defined by udi,
2584
        |      having the current signature sig.
2585
        |  
2586
        |  purge(...)
2587
        |      purge() -> Bool.
2588
        |      Delete all documents that were not touched during the just finished
2589
        |      indexing pass (since open-for-write). These are the documents for
2590
        |      the needUpdate() call was not performed, indicating that they no
2591
        |      longer exist in the primary storage system.
2592
        |  
2593
        |  query(...)
2594
        |      query() -> Query. Return a new, blank query object for this index.
2595
        |  
2596
        |  setAbstractParams(...)
2597
        |      setAbstractParams(maxchars, contextwords).
2756
   Db.setAbstractParams(maxchars, contextwords)
2598
        |      Set the parameters used to build 'keyword-in-context' abstracts
2757
           Set the parameters used to build snippets.
2599
        |  
2758
2600
        |  ----------------------------------------------------------------------
2759
        The Query class
2601
        |  Data and other attributes defined here:
2760
2602
        |  
2761
   A Query object (equivalent to a cursor in the Python DB API) is created by
2603
       
2762
   a Db.query() call. It is used to execute index searches.
2604
       class Doc(__builtin__.object)
2763
2605
        |  Doc()
2764
   Methods
2606
        |  
2765
2607
        |  A Doc object contains index data for a given document.
2766
   Query.sortby(fieldname, ascending=True)
2608
        |  The data is extracted from the index when searching, or set by the
2767
           Sort results by fieldname, in ascending or descending order. Must
2609
        |  indexer program when updating. The Doc object has no useful methods but
2768
           be called before executing the search.
2610
        |  many attributes to be read or set by its user. It matches exactly the
2769
2611
        |  Rcl::Doc c++ object. Some of the attributes are predefined, but, 
2612
        |  especially when indexing, others can be set, the name of which will be
2613
        |  processed as field names by the indexing configuration.
2614
        |  Inputs can be specified as unicode or strings.
2615
        |  Outputs are unicode objects.
2616
        |  All dates are specified as unix timestamps, printed as strings
2617
        |  Predefined attributes (index/query/both):
2618
        |   text (index): document plain text
2619
        |   url (both)
2620
        |   fbytes (both) optional) file size in bytes
2621
        |   filename (both)
2622
        |   fmtime (both) optional file modification date. Unix time printed 
2623
        |      as string
2624
        |   dbytes (both) document text bytes
2625
        |   dmtime (both) document creation/modification date
2626
        |   ipath (both) value private to the app.: internal access path
2627
        |      inside file
2628
        |   mtype (both) mime type for original document
2629
        |   mtime (query) dmtime if set else fmtime
2630
        |   origcharset (both) charset the text was converted from
2631
        |   size (query) dbytes if set, else fbytes
2632
        |   sig (both) app-defined file modification signature. 
2633
        |      For up to date checks
2634
        |   relevancyrating (query)
2635
        |   abstract (both)
2636
        |   author (both)
2637
        |   title (both)
2638
        |   keywords (both)
2639
        |  
2640
        |  Methods defined here:
2641
        |  
2642
        |  
2643
        |  ----------------------------------------------------------------------
2644
        |  Data and other attributes defined here:
2645
        |  
2646
       
2647
       class Query(__builtin__.object)
2648
        |  Recoll Query objects are used to execute index searches. 
2649
        |  They must be created by the Db.query() method.
2650
        |  
2651
        |  Methods defined here:
2652
        |  
2653
        |  
2654
        |  execute(...)
2655
        |      execute(query_string, stemming=1|0, stemlang="stemming language")
2770
   Query.execute(query_string, stemming=1, stemlang="english")
2656
        |      
2657
        |      Starts a search for query_string, a Recoll search language string
2771
           Starts a search for query_string, a Recoll search language string.
2658
        |      (mostly Xesam-compatible).
2772
2659
        |      The query can be a simple list of terms (and'ed by default), or more
2773
   Query.executesd(SearchData)
2660
        |      complicated with field specs etc. See the Recoll manual.
2661
        |  
2662
        |  executesd(...)
2663
        |      executesd(SearchData)
2664
        |      
2665
        |      Starts a search for the query defined by the SearchData object.
2774
           Starts a search for the query defined by the SearchData object.
2666
        |  
2775
2667
        |  fetchone(...)
2776
   Query.fetchmany(size=query.arraysize)
2668
        |      fetchone(None) -> Doc
2777
           Fetches the next Doc objects in the current search results, and
2669
        |      
2778
           returns them as an array of the required size, which is by default
2779
           the value of the arraysize data member.
2780
2781
   Query.fetchone()
2670
        |      Fetches the next Doc object in the current search results.
2782
           Fetches the next Doc object from the current search results.
2671
        |  
2783
2672
        |  sortby(...)
2784
   Query.close()
2673
        |      sortby(field=fieldname, ascending=true)
2785
           Closes the connection. The object is unusable after the call.
2674
        |      Sort results by 'fieldname', in ascending or descending order.
2786
2675
        |      Only one field can be used, no subsorts for now.
2787
   Query.scroll(value, mode='relative')
2676
        |      Must be called before executing the search
2788
           Adjusts the position in the current result set. mode can be
2677
        |  
2789
           relative or absolute.
2678
        |  ----------------------------------------------------------------------
2790
2679
        |  Data descriptors defined here:
2791
   Query.getgroups()
2680
        |  
2792
           Retrieves the expanded query terms as a list of pairs. Meaningful
2681
        |  next
2793
           only after executexx In each pair, the first entry is a list of
2794
           user terms, the second a list of query terms as derived from the
2795
           user terms and used in the Xapian Query. The size of each list is
2796
           one for simple terms, or more for group and phrase clauses.
2797
2798
   Query.getxquery()
2799
           Return the Xapian query description as a Unicode string.
2800
           Meaningful only after executexx.
2801
2802
   Query.highlight(text, ishtml = 0, methods = object)
2803
           Will insert <span "class=rclmatch">, </span> tags around the match
2804
           areas in the input text and return the modified text. ishtml can
2805
           be set to indicate that the input text is HTML and that HTML
2806
           special characters should not be escaped. methods if set should be
2807
           an object with methods startMatch(i) and endMatch() which will be
2808
           called for each match and should return a begin and end tag
2809
2810
   Query.makedocabstract(doc, methods = object))
2811
           Create a snippets abstract for doc (a Doc object) by selecting
2812
           text around the match terms. If methods is set, will also perform
2813
           highlighting. See the highlight method.
2814
2815
   Query.__iter__() and Query.next()
2816
           So that things like for doc in query: will work.
2817
2818
   Data descriptors
2819
2820
   Query.arraysize
2821
           Default number of records processed by fetchmany (r/w).
2822
2823
   Query.rowcount
2824
           Number of records returned by the last execute.
2825
2826
   Query.rownumber
2682
        |      Next index to be fetched from results. Normally increments after
2827
           Next index to be fetched from results. Normally increments after
2683
        |      each fetchone() call, but can be set/reset before the call effect
2828
           each fetchone() call, but can be set/reset before the call effect
2684
        |      seeking. Starts at 0
2829
           seeking. Starts at 0.
2685
        |  
2830
2686
        |  ----------------------------------------------------------------------
2831
        The Doc class
2687
        |  Data and other attributes defined here:
2832
2688
        |  
2833
   A Doc object contains index data for a given document. The data is
2689
       
2834
   extracted from the index when searching, or set by the indexer program
2690
       class SearchData(__builtin__.object)
2835
   when updating. The Doc object has many attributes to be read or set by its
2836
   user. It matches exactly the Rcl::Doc C++ object. Some of the attributes
2837
   are predefined, but, especially when indexing, others can be set, the name
2838
   of which will be processed as field names by the indexing configuration.
2839
   Inputs can be specified as Unicode or strings. Outputs are Unicode
2840
   objects. All dates are specified as Unix timestamps, printed as strings.
2841
   Please refer to the rcldb/rcldoc.h C++ file for a description of the
2842
   predefined attributes.
2843
2844
   At query time, only the fields that are defined as stored either by
2845
   default or in the fields configuration file will be meaningful in the Doc
2846
   object. Especially this will not be the case for the document text. See
2847
   the rclextract module for accessing document contents.
2848
2849
   Methods
2850
2851
   get(key), [] operator
2852
           Retrieve the named doc attribute
2853
2854
   getbinurl()
2855
           Retrieve the URL in byte array format (no transcoding), for use as
2856
           parameter to a system call.
2857
2858
   items()
2859
           Return a dictionary of doc object keys/values
2860
2861
   keys()
2862
           list of doc object keys (attribute names).
2863
2691
        |  SearchData()
2864
        The SearchData class
2692
        |  
2865
2693
        |  A SearchData object describes a query. It has a number of global
2866
   A SearchData object allows building a query by combining clauses, for
2694
        |  parameters and a chain of search clauses.
2867
   execution by Query.executesd(). It can be used in replacement of the query
2695
        |  
2868
   language approach. The interface is going to change a little, so no
2696
        |  Methods defined here:
2869
   detailed doc for now...
2697
        |  
2870
2698
        |  
2871
   Methods
2699
        |  addclause(...)
2872
2700
        |      addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
2873
   addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub', qstring=string,
2701
        |                qstring=string, slack=int, field=string, stemming=1|0,
2874
   slack=0, field='', stemming=1, subSearch=SearchData)
2702
        |                subSearch=SearchData)
2703
        |      Adds a simple clause to the SearchData And/Or chain, or a subquery
2704
        |      defined by another SearchData object
2705
        |  
2706
        |  ----------------------------------------------------------------------
2707
        |  Data and other attributes defined here:
2708
        |  
2709
2875
2710
   FUNCTIONS
2876
    4.3.2.4. The rclextract module
2711
       connect(...)
2877
2712
           connect([confdir=None], [extra_dbs=None], [writable = False])
2878
   Document content is not provided by an index query. To access it, the data
2713
                    -> Db.
2879
   extraction part of the indexing process must be performed (subdocument
2880
   access and format translation). This is not trivial in general. The
2881
   rclextract module currently provides a single class which can be used to
2882
   access the data content for result documents.
2883
2884
      Classes
2885
2886
        The Extractor class
2887
2888
   Methods
2889
2890
   Extractor(doc)
2891
           An Extractor object is built from a Doc object, output from a
2714
           
2892
           query.
2715
           Connects to a Recoll database and returns a Db object.
2716
           confdir specifies a Recoll configuration directory
2717
           (the default is built like for any Recoll program).
2718
           extra_dbs is a list of external databases (xapian directories)
2719
           writable decides if we can index new data through this connection
2720
2893
2894
   Extractor.textextract(ipath)
2895
           Extract document defined by ipath and return a Doc object. The
2896
           doc.text field has the document text as either text/plain or
2897
           text/html according to doc.mimetype.
2898
2899
   Extractor.idoctofile()
2900
           Extracts document into an output file, which can be given
2901
           explicitly or will be created as a temporary file to be deleted by
2902
           the caller.
2903
2721
    4.3.2.3. Example code
2904
    4.3.2.5. Example code
2722
2905
2723
   The following sample would query the index with a user language string.
2906
   The following sample would query the index with a user language string.
2724
   See the python/samples directory inside the Recoll source for other
2907
   See the python/samples directory inside the Recoll source for other
2725
   examples.
2908
   examples. The recollgui subdirectory has a very embryonic GUI which
2909
   demonstrates the highlighting and data extraction functions.
2726
2910
2727
 #!/usr/bin/env python
2911
 #!/usr/bin/env python
2728
2912
2729
 import recoll
2913
 from recoll import recoll
2730
2914
2731
 db = recoll.connect()
2915
 db = recoll.connect()
2732
 db.setAbstractParams(maxchars=80, contextwords=2)
2916
 db.setAbstractParams(maxchars=80, contextwords=4)
2733
2917
2734
 query = db.query()
2918
 query = db.query()
2735
 nres = query.execute("some user question")
2919
 nres = query.execute("some user question")
2736
 print "Result count: ", nres
2920
 print "Result count: ", nres
2737
 if nres > 5:
2921
 if nres > 5:
2738
     nres = 5
2922
     nres = 5
2739
 while query.next >= 0 and query.next < nres:
2923
 for i in range(nres):
2740
     doc = query.fetchone()
2924
     doc = query.fetchone()
2741
     print query.next
2925
     print "Result #%d" % (query.rownumber,)
2742
     for k in ("title", "size"):
2926
     for k in ("title", "size"):
2743
         print k, ":", getattr(doc, k).encode('utf-8')
2927
         print k, ":", getattr(doc, k).encode('utf-8')
2744
     abs = db.makeDocAbstract(doc, query).encode('utf-8')
2928
     abs = db.makeDocAbstract(doc, query).encode('utf-8')
2745
     print abs
2929
     print abs
2746
     print
2930
     print
2747
2931
2748
2932
2933
2934
    4.3.2.6. Compatibility with the previous version
2935
2936
   The following code fragments can be used to ensure that code can run with
2937
   both the old and the new API (as long as it does not use the new abilities
2938
   of the new API of course).
2939
2940
   Adapting to the new package structure:
2941
2942
2943
 try:
2944
     from recoll import recoll
2945
     from recoll import rclextract
2946
     hasextract = True
2947
 except:
2948
     import recoll
2949
     hasextract = False
2950
2951
2952
   Adapting to the change of nature of the next Query member. The same test
2953
   can be used to choose to use the scroll() method (new) or set the next
2954
   value (old).
2955
2956
2957
        rownum = query.next if type(query.next) == int else \
2958
                  query.rownumber
2749
2959
2750
2960
2751
Chapter 5. Installation and configuration
2961
Chapter 5. Installation and configuration
2752
2962
2753
5.1. Installing a binary copy
2963
5.1. Installing a binary copy
...
...
3357
   localfields
3567
   localfields
3358
3568
3359
           This allows setting fields for all documents under a given
3569
           This allows setting fields for all documents under a given
3360
           directory. Typical usage would be to set an "rclaptg" field, to be
3570
           directory. Typical usage would be to set an "rclaptg" field, to be
3361
           used in mimeview to select a specific viewer. If several fields
3571
           used in mimeview to select a specific viewer. If several fields
3362
           are to be set, they should be separated with a colon (':')
3572
           are to be set, they should be separated with a semi-colon (';')
3363
           character (which there is currently no way to escape). Ie:
3573
           character, which there is currently no way to escape. Also note
3364
           localfields= rclaptg=gnus:other = val, then select specifier
3574
           the initial semi-colon. Example: localfields= ;rclaptg=gnus;other
3365
           viewer with mimetype|tag=... in mimeview.
3575
           = val, then select specifier viewer with mimetype|tag=... in
3576
           mimeview.
3577
3578
   metadatacmds
3579
3580
           This allows executing external commands for each file and storing
3581
           the output in a Recoll field. This could be used for example to
3582
           index external tag data. The value is a list of field names and
3583
           commands, don't forget an initial semi-colon. Example:
3584
3585
 [/some/area/of/the/fs]
3586
 metadatacmds = ; tags = tmsu tags %f; otherfield = somecmd -xx %f
3587
                
3366
3588
3367
    5.4.1.3. Parameters affecting where and how we store things:
3589
    5.4.1.3. Parameters affecting where and how we store things:
3368
3590
3369
   dbdir
3591
   dbdir
3370
3592
...
...
3590
 [mail]
3812
 [mail]
3591
 # Extract the X-My-Tag mail header, and use it internally with the
3813
 # Extract the X-My-Tag mail header, and use it internally with the
3592
 # mailmytag field name
3814
 # mailmytag field name
3593
 x-my-tag = mailmytag
3815
 x-my-tag = mailmytag
3594
3816
3817
    5.4.2.1. Extended attributes in the fields file
3818
3819
   Recoll versions 1.19 and later process user extended file attributes as
3820
   documents fields by default.
3821
3822
   Attributes are processed as fields of the same name, after removing the
3823
   user prefix on Linux.
3824
3825
   The [xattrtofields] section of the fields file allows specifying
3826
   translations from extended attributes names to Recoll field names. An
3827
   empty translation disables use of the corresponding attribute data.
3828
3595
  5.4.3. The mimemap file
3829
  5.4.3. The mimemap file
3596
3830
3597
   mimemap specifies the file name extension to mime type mappings.
3831
   mimemap specifies the file name extension to mime type mappings.
3598
3832
3599
   For file names without an extension, or with an unknown one, the system's
3833
   For file names without an extension, or with an unknown one, the system's
...
...
3697
   In addition to the predefined values above, all strings like %(fieldname)
3931
   In addition to the predefined values above, all strings like %(fieldname)
3698
   will be replaced by the value of the field named fieldname for the
3932
   will be replaced by the value of the field named fieldname for the
3699
   document. This could be used in combination with field customisation to
3933
   document. This could be used in combination with field customisation to
3700
   help with opening the document.
3934
   help with opening the document.
3701
3935
3936
  5.4.6. The ptrans file
3937
3938
   ptrans specifies query-time path translations. These can be useful in
3939
   multiple cases.
3940
3941
   The file has a section for any index which needs translations, either the
3942
   main one or additional query indexes. The sections are named with the
3943
   Xapian index directory names. No slash character should exist at the end
3944
   of the paths (all comparisons are textual). An exemple should make things
3945
   sufficiently clear
3946
3947
           [/home/me/.recoll/xapiandb]
3948
           /this/directory/moved = /to/this/place
3949
3950
           [/path/to/additional/xapiandb]
3951
           /server/volume1/docdir = /net/server/volume1/docdir
3952
           /server/volume2/docdir = /net/server/volume2/docdir
3953
        
3954
3702
  5.4.6. Examples of configuration adjustments
3955
  5.4.7. Examples of configuration adjustments
3703
3956
3704
    5.4.6.1. Adding an external viewer for an non-indexed type
3957
    5.4.7.1. Adding an external viewer for an non-indexed type
3705
3958
3706
   Imagine that you have some kind of file which does not have indexable
3959
   Imagine that you have some kind of file which does not have indexable
3707
   content, but for which you would like to have a functional Open link in
3960
   content, but for which you would like to have a functional Open link in
3708
   the result list (when found by file name). The file names end in .blob and
3961
   the result list (when found by file name). The file names end in .blob and
3709
   can be displayed by application blobviewer.
3962
   can be displayed by application blobviewer.
...
...
3729
   mime type which it already knows, you would just need to edit mimeview.
3982
   mime type which it already knows, you would just need to edit mimeview.
3730
   The entries you add in your personal file override those in the central
3983
   The entries you add in your personal file override those in the central
3731
   configuration, which you do not need to alter. mimeview can also be
3984
   configuration, which you do not need to alter. mimeview can also be
3732
   modified from the Gui.
3985
   modified from the Gui.
3733
3986
3734
    5.4.6.2. Adding indexing support for a new file type
3987
    5.4.7.2. Adding indexing support for a new file type
3735
3988
3736
   Let us now imagine that the above .blob files actually contain indexable
3989
   Let us now imagine that the above .blob files actually contain indexable
3737
   text and that you know how to extract it with a command line program.
3990
   text and that you know how to extract it with a command line program.
3738
   Getting Recoll to index the files is easy. You need to perform the above
3991
   Getting Recoll to index the files is easy. You need to perform the above
3739
   alteration, and also to add data to the mimeconf file (typically in
3992
   alteration, and also to add data to the mimeconf file (typically in