Switch to unified view

a/src/README b/src/README
...
...
76
76
77
                3.11. Search tips, shortcuts
77
                3.11. Search tips, shortcuts
78
78
79
                3.12. Customizing the search interface
79
                3.12. Customizing the search interface
80
80
81
   4. Programming interface
82
83
                4.1. Writing a document filter
84
85
                             4.1.1. Filter HTML output
86
87
                4.2. Field data processing configuration
88
89
                4.3. API
90
91
                             4.3.1. Interface elements
92
93
                             4.3.2. Python interface
94
81
   4. Installation
95
   5. Installation
82
96
83
                4.1. Installing a prebuilt copy
97
                5.1. Installing a prebuilt copy
84
98
85
                             4.1.1. Installing through a package system
99
                             5.1.1. Installing through a package system
86
100
87
                             4.1.2. Installing a prebuilt Recoll
101
                             5.1.2. Installing a prebuilt Recoll
88
102
89
                4.2. Supporting packages
103
                5.2. Supporting packages
90
104
91
                4.3. Building from source
105
                5.3. Building from source
92
106
93
                             4.3.1. Prerequisites
107
                             5.3.1. Prerequisites
94
108
95
                             4.3.2. Building
109
                             5.3.2. Building
96
110
97
                             4.3.3. Installation
111
                             5.3.3. Installation
98
112
99
                4.4. Configuration overview
113
                5.4. Configuration overview
100
114
101
                             4.4.1. Main configuration file
115
                             5.4.1. Main configuration file
102
116
103
                             4.4.2. The mimemap file
117
                             5.4.2. The mimemap file
104
118
105
                             4.4.3. The mimeconf file
119
                             5.4.3. The mimeconf file
106
120
107
                             4.4.4. The mimeview file
121
                             5.4.4. The mimeview file
108
122
109
                             4.4.5. Examples of configuration adjustments
123
                             5.4.5. Examples of configuration adjustments
110
124
111
                4.5. The KDE Kicker Recoll applet
125
                5.5. The KDE Kicker Recoll applet
112
113
                4.6. Extending Recoll
114
115
                             4.6.1. Writing a document filter
116
126
117
     ----------------------------------------------------------------------
127
     ----------------------------------------------------------------------
118
128
119
                            Chapter 1. Introduction
129
                            Chapter 1. Introduction
120
130
...
...
254
   files Most file types, like HTML or word processing files, only hold one
264
   files Most file types, like HTML or word processing files, only hold one
255
   document. Some file types, like mail folder files can hold many
265
   document. Some file types, like mail folder files can hold many
256
   individually indexed documents.
266
   individually indexed documents.
257
267
258
   Recoll indexing processes plain text, HTML, openoffice and e-mail files
268
   Recoll indexing processes plain text, HTML, openoffice and e-mail files
269
   internally.
270
259
   internally. Other types (ie: postscript, pdf, ms-word, rtf) need external
271
   Other file types (ie: postscript, pdf, ms-word, rtf ...) need external
260
   applications for preprocessing. The list is in the installation section.
272
   applications for preprocessing. The list is in the installation section.
273
   After every indexing operation, Recoll updates a list of commands that
274
   would be needed for indexing existing files types. This list can be
275
   displayed from the recoll File menu. It is stored in the missing text file
276
   inside the configuration directory.
261
277
262
   Without further configuration, Recoll will index all appropriate files
278
   Without further configuration, Recoll will index all appropriate files
263
   from your home directory, with a reasonable set of defaults.
279
   from your home directory, with a reasonable set of defaults.
264
280
265
   In some cases, it may be interesting to index different areas of the file
281
   In some cases, it may be interesting to index different areas of the file
...
...
715
3.4. The query language
731
3.4. The query language
716
732
717
   The query language processor is activated on the simple search entry when
733
   The query language processor is activated on the simple search entry when
718
   the search mode selector is set to Query Language.
734
   the search mode selector is set to Query Language.
719
735
736
   The language is roughly based on the Xesam user search language
737
   specification.
738
720
   Here follows a sample request that we are going to explain:
739
   Here follows a sample request that we are going to explain:
721
740
722
           author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
741
           author:"john doe" Beatles OR Lennon Live OR Unplugged -potatoes
723
     
742
     
724
743
725
   This would search for all documents with John Doe appearing as a phrase in
744
   This would search for all documents with John Doe appearing as a phrase in
726
   the author field (exactly what this is would depend on the document type,
745
   the author field (exactly what this is would depend on the document type,
727
   ie: the From: header, for an email message), and containing either beatles
746
   ie: the From: header, for an email message), and containing either beatles
728
   or lennon and either live or unplugged but not potatoes (in any part of
747
   or lennon and either live or unplugged but not potatoes (in any part of
729
   the document).
748
   the document).
749
750
   An element is composed of an optional field specification, and a value,
751
   separated by a colon. Exemple: Beatles, author:balzac, dc:title:grandet
752
753
   The colon, if present, means "contains". Xesam defines other relations,
754
   which are not supported for now.
730
755
731
   All elements in the search entry are normally combined with an implicit
756
   All elements in the search entry are normally combined with an implicit
732
   AND. It is possible to specify that elements be OR'ed instead, as in
757
   AND. It is possible to specify that elements be OR'ed instead, as in
733
   Beatles OR Lennon. The OR must be entered literally (capitals), and it has
758
   Beatles OR Lennon. The OR must be entered literally (capitals), and it has
734
   priority over the AND associations: word1 word2 OR word3 means word1 AND
759
   priority over the AND associations: word1 word2 OR word3 means word1 AND
735
   (word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
760
   (word2 OR word3) not (word1 AND word2) OR word3. Do not enter explicit
736
   parenthesis, they are not supported for now.
761
   parenthesis, they are not supported for now.
737
762
738
   An entry preceded by a - specifies a term that should not appear.
763
   An element preceded by a - specifies a term that should not appear. Pure
764
   negative queries are forbidden.
739
765
740
   The first element in the above exemple, author:"john doe" is a phrase
766
   As usual, words inside quotes define a phrase (the order of words is
741
   search limited to a specific field. Phrase searches are specified as usual
767
   significant), so that title:"prejudice pride" is not the same as
742
   by enclosing the words in double quotes. The field specification appears
768
   title:prejudice title:pride, and is unlikely to find a result.
743
   before the colon (of course this is not limited to phrases, author:Balzac
769
744
   would be ok too). Recoll currently manages the following fields:
770
   Recoll currently manages the following default fields:
745
771
746
     * title, subject or caption are synonyms which specify data to be
772
     * title, subject or caption are synonyms which specify data to be
747
       searched for in the document title or subject.
773
       searched for in the document title or subject.
748
774
749
     * author or from for searching the documents originators.
775
     * author or from for searching the documents originators.
750
776
777
     * recipient or to for searching the documents recipients.
778
751
     * keyword for searching the document specified keywords (few documents
779
     * keyword for searching the document-specified keywords (few documents
752
       actually have any).
780
       actually have any).
753
781
754
   As of release 1.9, the filters have the possibility to create other fields
782
     * filename for the document's file name.
755
   with arbitrary names. No standard filters use this possibility yet.
756
783
757
   There are two other elements which may be specified through the field
758
   syntax, but are somewhat special:
759
760
     * ext for specifying the file name extension (Ex: ext:html)
784
     * ext specifies the file name extension (Ex: ext:html)
761
785
762
     * dir for specifying the file location (Ex: dir:/home/me/somedir).
786
   The field syntax also supports a few field-like, but special, criteria:
763
       Please note that this is quite inefficient, that it may produce very
787
764
       slow searches, and that it may be worth in some cases to set up
788
     * dir for filtering the results on file location (Ex:
789
       dir:/home/me/somedir). Please note that this is quite inefficient,
790
       that it may produce very slow searches, and that it may be worth in
765
       separate databases instead.
791
       some cases to set up separate databases instead.
766
792
767
     * mime for specifying the mime type. This one is quite special because
793
     * mime or format for specifying the mime type. This one is quite special
768
       you can specify several values which will be OR'ed (the normal default
794
       because you can specify several values which will be OR'ed (the normal
769
       for the language is AND). Ex: mime:text/plain mime:text/html.
795
       default for the language is AND). Ex: mime:text/plain mime:text/html.
770
       Specifying an explicit boolean operator or negation (-) before a mime
796
       Specifying an explicit boolean operator or negation (-) before a mime
771
       specification is not supported and will produce strange results.
797
       specification is not supported and will produce strange results.
772
798
799
     * type or rclcat for specifying the category (as in
800
       text/media/presentation/etc.). The classification of mime types in
801
       categories is defined in the Recoll configuration (mimeconf), and can
802
       be modified or extended. The default category names are those which
803
       permit filtering results in the main GUI screen. Categories are OR'ed
804
       like mime types above.
805
806
   The document filters used while indexing have the possibility to create
807
   other fields with arbitrary names, and aliases may be defined in the
808
   configuration, so that the exact field search possibilities may be
809
   different for you if someone took care of the customisation.
810
773
   The query language is currently the only way to use the Recoll field
811
   The query language is currently the only way to use the Recoll field
774
   search capability.
812
   search capability.
775
813
776
   Words inside phrases and capitalized words are not stem-expanded.
814
   Words inside phrases and capitalized words are not stem-expanded.
777
   Wildcards may be used anywhere inside a term. Specifying a wild-card on
815
   Wildcards may be used anywhere inside a term. Specifying a wild-card on
778
   the left of a term can produce a very slow search.
816
   the left of a term can produce a very slow search (or even an incorrect
817
   one if the expansion is truncated because of excessive size).
779
818
780
   You can use the show query link at the top of the result list to check the
819
   You can use the show query link at the top of the result list to check the
781
   exact query which was finally executed by Xapian.
820
   exact query which was finally executed by Xapian.
821
822
   Most Xesam phrase modifiers are unsupported, except for l (small ell) to
823
   disable stemming, and p to turn an phrase into a NEAR (unordered) search.
824
   Exemple: "prejudice pride"p
782
825
783
     ----------------------------------------------------------------------
826
     ----------------------------------------------------------------------
784
827
785
3.5. Complex/advanced search
828
3.5. Complex/advanced search
786
829
...
...
1192
   you can chose which ones you want to use at any moment by checking or
1235
   you can chose which ones you want to use at any moment by checking or
1193
   unchecking their entries.
1236
   unchecking their entries.
1194
1237
1195
   Your main database (the one the current configuration indexes to), is
1238
   Your main database (the one the current configuration indexes to), is
1196
   always implicitly active. If this is not desirable, you can set up your
1239
   always implicitly active. If this is not desirable, you can set up your
1197
   configuration so that it indexes, for example, an empty directory.
1240
   configuration so that it indexes, for example, an empty directory. An
1241
   alternative indexer may also need to implement a way of purging the index
1242
   from stale data,
1198
1243
1244
     ----------------------------------------------------------------------
1245
1246
                        Chapter 4. Programming interface
1247
1248
   Recoll has an Application programming Interface, usable both for indexing
1249
   and searching, currently accessible from the Python language.
1250
1251
   Another less radical way to extend the application is to write filters for
1252
   new types of documents.
1253
1254
   The processing of metadata attributes for documents (fields) is highly
1255
   configurable.
1256
1257
     ----------------------------------------------------------------------
1258
1259
4.1. Writing a document filter
1260
1261
   Recoll filters are executable programs which translate from a specific
1262
   format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
1263
   format, which may be text/plain or text/html.
1264
1265
   Recoll filters are usually shell-scripts, but this is in no way necessary.
1266
   These programs are extremely simple and most of the difficulty lies in
1267
   extracting the text from the native format, not outputting what is
1268
   expected by Recoll. Happily enough, most document formats already have
1269
   translators or text extractors which handle the difficult part and can be
1270
   called from the filter. In some case the output of the translating program
1271
   is appropriate, and no intermediate shell-script is needed.
1272
1273
   Filters are called with a single argument which is the source file name.
1274
   They should output the result to stdout.
1275
1276
   The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
1277
   the filter if the operation is for indexing or previewing. Some filters
1278
   use this to output a slightly different format. This is not essential.
1279
1280
   The association of file types to filters is performed in the mimeconf
1281
   file. A sample:
1282
1283
 
[index]
1284
 application/msword = exec antiword -t -i 1 -m UTF-8;\
1285
      mimetype=text/plain;charset=utf-8
1286
1287
 application/ogg = exec rclogg
1288
1289
 text/rtf = exec unrtf --nopict --html; charset=iso-8859-1; mimetype=text/html
1290
1291
   The fragment specifies that:
1292
1293
     * application/msword files are processed by executing the antiword
1294
       program, which outputs text/plain encoded in iso-8859-1.
1295
1296
     * application/ogg files are processed by the rclogg script, with default
1297
       output type (text/html, with encoding specified in the header, or
1298
       utf-8 by default).
1299
1300
     * text/rtf is processed by unrtf, which outputs text/html. The
1301
       iso-8859-1 encoding is specified because it is not the utf-8 default,
1302
       and not output by unrtf in the HTML header section.
1303
1304
   The easiest way to write a new filter is probably to start from an
1305
   existing one.
1306
1307
   Filters which output text/plain text are generally simpler, but they
1308
   cannot specify the character set and other metadata, so they are limited
1309
   to cases where these elements are not needed.
1310
1311
     ----------------------------------------------------------------------
1312
1313
  4.1.1. Filter HTML output
1314
1315
   The output HTML could be very minimal like the following example:
1316
1317
 <html><head>
1318
 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
1319
 </head>
1320
 <body>some text content</body></html>
1321
         
1322
1323
   You should take care to escape some characters inside the text by
1324
   transforming them into appropriate entities. "&" should be transformed
1325
   into "&amp;", "<" should be transformed into "&lt;". This is not always
1326
   properly done by translating programs which output HTML, and of course
1327
   nerver by those which output plain text.
1328
1329
   The character set needs to be specified in the header. It does not need to
1330
   be UTF-8 (Recoll will take care of translating it), but it must be
1331
   accurate for good results.
1332
1333
   Recoll will also make use of other header fields if they are present:
1334
   title, description, keywords.
1335
1336
   Filters also have the possibility to "invent" field names. This should be
1337
   output as meta tags:
1338
1339
 <meta name="somefield" content="Some textual data" />
1340
1341
   See the following section for details about configuring how field data is
1342
   processed by the indexer.
1343
1344
     ----------------------------------------------------------------------
1345
1346
4.2. Field data processing configuration
1347
1348
   Fields are named pieces of information in or about documents, like title,
1349
   author, abstract.
1350
1351
   The field values for documents can appear in several ways during indexing:
1352
   either output by filters as meta fields in the HTML header section, or
1353
   added as attributes of the Doc object when using the API, or again
1354
   synthetized internally by Recoll.
1355
1356
   The Recoll query language allows searching for text in a specific field.
1357
1358
   Recoll defines a number of default fields. Additional ones can be output
1359
   by filters, and described in the fields configuration file.
1360
1361
   Fields can be:
1362
1363
     * indexed, meaning that their terms are separately stored in inverted
1364
       lists (with a specific prefix), and that a field-specific search is
1365
       possible.
1366
1367
     * stored, meaning that their value is recorded in the index data record
1368
       for the document, and can be returned and displayed with search
1369
       results.
1370
1371
   A field can be either or both indexed and stored.
1372
1373
   A field becomes indexed by having a prefix defined in the [prefixes]
1374
   section of the fields file. See the comments in there for details
1375
1376
   A field becomes stored by appearing in the [stored] section of the fields
1377
   file.
1378
1379
     ----------------------------------------------------------------------
1380
1381
4.3. API
1382
1383
  4.3.1. Interface elements
1384
1385
   A few elements in the interface are specific and and need an explanation.
1386
1387
   udi
1388
1389
           An udi (unique document identifier) identifies a document. Because
1390
           of limitations inside the index engine, it is restricted in length
1391
           (to 200 bytes), which is why a regular URI cannot be used. The
1392
           structure and contents of the udi is defined by the application
1393
           and opaque to the index engine. For example, the internal file
1394
           system indexer uses the complete document path (file path +
1395
           internal path), truncated to length, the suppressed part being
1396
           replaced by a hash value.
1397
1398
   ipath
1399
1400
           This data value (set as a field in the Doc object) is stored,
1401
           along with the URL, but not indexed by Recoll. Its contents are
1402
           not interpreted, and its use is up to the application. For
1403
           example, the Recoll internal file system indexer stores the part
1404
           of the document access path internal to the container file (ipath
1405
           in this case is a list of subdocument sequential numbers). url and
1406
           ipath are returned in every search result and permit access to the
1407
           original document.
1408
1409
   Stored and indexed fields
1410
1411
           The fields file inside the Recoll configuration defines which
1412
           document fields are either "indexed" (searchable), "stored"
1413
           (retrievable with search results), or both.
1414
1415
   Data for an external indexer, should be stored in a separate index, not
1416
   the one for the Recoll internal file system indexer, except if the latter
1417
   is not used at all). The reason is that the main document indexer purge
1418
   pass would remove all the other indexer's documents, as they were not seen
1419
   during indexing. The main indexer documents would also probably be a
1420
   problem for the external indexer purge operation.
1421
1422
     ----------------------------------------------------------------------
1423
1424
  4.3.2. Python interface
1425
1426
    4.3.2.1. Introduction
1427
1428
   Recoll versions after 1.11 define a Python programming interface, both for
1429
   searching and indexing.
1430
1431
   The python interface is not built by default and can be found in the
1432
   source package, under python/recoll. The directory contains the usual
1433
   setup.py script which you can use to build and install the module:
1434
1435
         cd recoll-xxx/python/recoll
1436
         python setup.py build
1437
         python setup.py install
1438
     
1439
1440
     ----------------------------------------------------------------------
1441
1442
    4.3.2.2. Interface manual
1443
1444
   NAME
1445
       recoll - This is an interface to the Recoll full text indexer.
1446
1447
   FILE
1448
       /usr/local/lib/python2.5/site-packages/recoll.so
1449
1450
   CLASSES
1451
           Db
1452
           Doc
1453
           Query
1454
           SearchData
1455
       
1456
       class Db(__builtin__.object)
1457
        |  Db([confdir=None], [extra_dbs=None], [writable = False])
1458
        |  
1459
        |  A Db object holds a connection to a Recoll index. Use the connect()
1460
        |  function to create one.
1461
        |  confdir specifies a Recoll configuration directory (default: 
1462
        |   $RECOLL_CONFDIR or ~/.recoll).
1463
        |  extra_dbs is a list of external databases (xapian directories)
1464
        |  writable decides if we can index new data through this connection
1465
        |  
1466
        |  Methods defined here:
1467
        |  
1468
        |  
1469
        |  addOrUpdate(...)
1470
        |      addOrUpdate(udi, doc, parent_udi=None) -> None
1471
        |      Add or update index data for a given document
1472
        |      The udi string must define a unique id for the document. It is not
1473
        |      interpreted inside Recoll
1474
        |      doc is a Doc object
1475
        |      if parent_udi is set, this is a unique identifier for the
1476
        |      top-level container (ie mbox file)
1477
        |  
1478
        |  delete(...)
1479
        |      delete(udi) -> Bool.
1480
        |      Purge index from all data for udi. If udi matches a container
1481
        |      document, purge all subdocs (docs with a parent_udi matching udi).
1482
        |  
1483
        |  makeDocAbstract(...)
1484
        |      makeDocAbstract(Doc, Query) -> string
1485
        |      Build and return 'keyword-in-context' abstract for document
1486
        |      and query.
1487
        |  
1488
        |  needUpdate(...)
1489
        |      needUpdate(udi, sig) -> Bool.
1490
        |      Check if the index is up to date for the document defined by udi,
1491
        |      having the current signature sig.
1492
        |  
1493
        |  purge(...)
1494
        |      purge() -> Bool.
1495
        |      Delete all documents that were not touched during the just finished
1496
        |      indexing pass (since open-for-write). These are the documents for
1497
        |      the needUpdate() call was not performed, indicating that they no
1498
        |      longer exist in the primary storage system.
1499
        |  
1500
        |  query(...)
1501
        |      query() -> Query. Return a new, blank query object for this index.
1502
        |  
1503
        |  setAbstractParams(...)
1504
        |      setAbstractParams(maxchars, contextwords).
1505
        |      Set the parameters used to build 'keyword-in-context' abstracts
1506
        |  
1199
     ----------------------------------------------------------------------
1507
        |  ----------------------------------------------------------------------
1508
        |  Data and other attributes defined here:
1509
        |  
1510
       
1511
       class Doc(__builtin__.object)
1512
        |  Doc()
1513
        |  
1514
        |  A Doc object contains index data for a given document.
1515
        |  The data is extracted from the index when searching, or set by the
1516
        |  indexer program when updating. The Doc object has no useful methods but
1517
        |  many attributes to be read or set by its user. It matches exactly the
1518
        |  Rcl::Doc c++ object. Some of the attributes are predefined, but, 
1519
        |  especially when indexing, others can be set, the name of which will be
1520
        |  processed as field names by the indexing configuration.
1521
        |  Inputs can be specified as unicode or strings.
1522
        |  Outputs are unicode objects.
1523
        |  All dates are specified as unix timestamps, printed as strings
1524
        |  Predefined attributes (index/query/both):
1525
        |   text (index): document plain text
1526
        |   url (both)
1527
        |   fbytes (both) optional) file size in bytes
1528
        |   filename (both)
1529
        |   fmtime (both) optional file modification date. Unix time printed 
1530
        |      as string
1531
        |   dbytes (both) document text bytes
1532
        |   dmtime (both) document creation/modification date
1533
        |   ipath (both) value private to the app.: internal access path
1534
        |      inside file
1535
        |   mtype (both) mime type for original document
1536
        |   mtime (query) dmtime if set else fmtime
1537
        |   origcharset (both) charset the text was converted from
1538
        |   size (query) dbytes if set, else fbytes
1539
        |   sig (both) app-defined file modification signature. 
1540
        |      For up to date checks
1541
        |   relevancyrating (query)
1542
        |   abstract (both)
1543
        |   author (both)
1544
        |   title (both)
1545
        |   keywords (both)
1546
        |  
1547
        |  Methods defined here:
1548
        |  
1549
        |  
1550
        |  ----------------------------------------------------------------------
1551
        |  Data and other attributes defined here:
1552
        |  
1553
       
1554
       class Query(__builtin__.object)
1555
        |  Recoll Query objects are used to execute index searches. 
1556
        |  They must be created by the Db.query() method.
1557
        |  
1558
        |  Methods defined here:
1559
        |  
1560
        |  
1561
        |  execute(...)
1562
        |      execute(query_string, stemming=1|0)
1563
        |      
1564
        |      Starts a search for query_string, a Recoll search language string
1565
        |      (mostly Xesam-compatible).
1566
        |      The query can be a simple list of terms (and'ed by default), or more
1567
        |      complicated with field specs etc. See the Recoll manual.
1568
        |  
1569
        |  executesd(...)
1570
        |      executesd(SearchData)
1571
        |      
1572
        |      Starts a search for the query defined by the SearchData object.
1573
        |  
1574
        |  fetchone(...)
1575
        |      fetchone(None) -> Doc
1576
        |      
1577
        |      Fetches the next Doc object in the current search results.
1578
        |  
1579
        |  sortby(...)
1580
        |      sortby(field=fieldname, ascending=true)
1581
        |      Sort results by 'fieldname', in ascending or descending order.
1582
        |      Only one field can be used, no subsorts for now.
1583
        |      Must be called before executing the search
1584
        |  
1585
        |  ----------------------------------------------------------------------
1586
        |  Data descriptors defined here:
1587
        |  
1588
        |  next
1589
        |      Next index to be fetched from results. Normally increments after
1590
        |      each fetchone() call, but can be set/reset before the call effect
1591
        |      seeking. Starts at 0
1592
        |  
1593
        |  ----------------------------------------------------------------------
1594
        |  Data and other attributes defined here:
1595
        |  
1596
       
1597
       class SearchData(__builtin__.object)
1598
        |  SearchData()
1599
        |  
1600
        |  A SearchData object describes a query. It has a number of global
1601
        |  parameters and a chain of search clauses.
1602
        |  
1603
        |  Methods defined here:
1604
        |  
1605
        |  
1606
        |  addclause(...)
1607
        |      addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
1608
        |                qstring=string, slack=int, field=string, stemming=1|0,
1609
        |                subSearch=SearchData)
1610
        |      Adds a simple clause to the SearchData And/Or chain, or a subquery
1611
        |      defined by another SearchData object
1612
        |  
1613
        |  ----------------------------------------------------------------------
1614
        |  Data and other attributes defined here:
1615
        |  
1200
1616
1617
   FUNCTIONS
1618
       connect(...)
1619
           connect([confdir=None], [extra_dbs=None], [writable = False])
1620
                    -> Db.
1621
           
1622
           Connects to a Recoll database and returns a Db object.
1623
           confdir specifies a Recoll configuration directory
1624
           (the default is built like for any Recoll program).
1625
           extra_dbs is a list of external databases (xapian directories)
1626
           writable decides if we can index new data through this connection
1627
1628
   
1629
1630
     ----------------------------------------------------------------------
1631
1632
    4.3.2.3. Example code
1633
1634
   The following sample would query the index with a user language string.
1635
   See the python/samples directory inside the Recoll source for other
1636
   examples.
1637
1638
 #!/usr/bin/env python
1639
1640
 import recoll
1641
1642
 db = recoll.connect()
1643
 db.setAbstractParams(maxchars=80, contextwords=2)
1644
1645
 query = db.query()
1646
 nres = query.execute("some user question")
1647
 print "Result count: ", nres
1648
 if nres > 5:
1649
     nres = 5
1650
 while query.next >= 0 and query.next < nres:
1651
     doc = query.fetchone()
1652
     print query.next
1653
     for k in ("title", "size"):
1654
         print k, ":", getattr(doc, k).encode('utf-8')
1655
     abs = db.makeDocAbstract(doc, query).encode('utf-8')
1656
     print abs
1657
     print
1658
1659
 
1660
1661
     ----------------------------------------------------------------------
1662
1201
                            Chapter 4. Installation
1663
                            Chapter 5. Installation
1202
1664
1203
4.1. Installing a prebuilt copy
1665
5.1. Installing a prebuilt copy
1204
1666
1205
   Recoll binary packages from the Recoll web site are always linked
1667
   Recoll binary packages from the Recoll web site are always linked
1206
   statically to the Xapian libraries, and have no other dependencies. You
1668
   statically to the Xapian libraries, and have no other dependencies. You
1207
   will only have to check or install supporting applications for the file
1669
   will only have to check or install supporting applications for the file
1208
   types that you want to index beyond text, HTML and mail files, and maybe
1670
   types that you want to index beyond text, HTML and mail files, and maybe
1209
   have a look at the configuration section (but this may not be necessary
1671
   have a look at the configuration section (but this may not be necessary
1210
   for a quick test with default parameters).
1672
   for a quick test with default parameters).
1211
1673
1212
     ----------------------------------------------------------------------
1674
     ----------------------------------------------------------------------
1213
1675
1214
  4.1.1. Installing through a package system
1676
  5.1.1. Installing through a package system
1215
1677
1216
   If you use a BSD-type port system or a prebuilt package (RPM or other),
1678
   If you use a BSD-type port system or a prebuilt package (RPM or other),
1217
   just follow the usual procedure for your system.
1679
   just follow the usual procedure for your system.
1218
1680
1219
     ----------------------------------------------------------------------
1681
     ----------------------------------------------------------------------
1220
1682
1221
  4.1.2. Installing a prebuilt Recoll
1683
  5.1.2. Installing a prebuilt Recoll
1222
1684
1223
   The unpackaged binary versions on the Recoll web site are just compressed
1685
   The unpackaged binary versions on the Recoll web site are just compressed
1224
   tar files of a build tree, where only the useful parts were kept
1686
   tar files of a build tree, where only the useful parts were kept
1225
   (executables and sample configuration).
1687
   (executables and sample configuration).
1226
1688
...
...
1231
   had built the package from source (that is, just type make install). The
1693
   had built the package from source (that is, just type make install). The
1232
   binary trees are built for installation to /usr/local.
1694
   binary trees are built for installation to /usr/local.
1233
1695
1234
     ----------------------------------------------------------------------
1696
     ----------------------------------------------------------------------
1235
1697
1236
4.2. Supporting packages
1698
5.2. Supporting packages
1237
1699
1238
   Recoll uses external applications to index some file types. You need to
1700
   Recoll uses external applications to index some file types. You need to
1239
   install them for the file types that you wish to have indexed (these are
1701
   install them for the file types that you wish to have indexed (these are
1240
   run-time dependencies. None is needed for building Recoll):
1702
   run-time dependencies. None is needed for building Recoll).
1703
1704
   After an indexing pass, the commands that were found missing can be
1705
   displayed from the recoll File menu. The list is stored in the missing
1706
   text file inside the configuration directory.
1707
1708
   A list of common file types which need external commands:
1241
1709
1242
     * Openoffice: supported natively, but needs the unzip command to be
1710
     * Openoffice: supported natively, but needs the unzip command to be
1243
       installed.
1711
       installed.
1244
1712
1245
     * PDF: pdftotext is part of the Xpdf package.
1713
     * PDF: pdftotext is part of the Xpdf package.
...
...
1273
   Text, HTML, mail folders Openoffice and Scribus files are processed
1741
   Text, HTML, mail folders Openoffice and Scribus files are processed
1274
   internally. Lyx is used to index Lyx files. Many filters need sed and awk.
1742
   internally. Lyx is used to index Lyx files. Many filters need sed and awk.
1275
1743
1276
     ----------------------------------------------------------------------
1744
     ----------------------------------------------------------------------
1277
1745
1278
4.3. Building from source
1746
5.3. Building from source
1279
1747
1280
  4.3.1. Prerequisites
1748
  5.3.1. Prerequisites
1281
1749
1282
   At the very least, you will need to download and install the xapian core
1750
   At the very least, you will need to download and install the xapian core
1283
   package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
1751
   package (Recoll 1.9 normally uses version 1.0.2, but any 0.9 or 1.0.x
1284
   version will work too), and the qt run-time and development packages
1752
   version will work too), and the qt run-time and development packages
1285
   (Recoll development currently uses version 3.3.5, but any 3.3 version is
1753
   (Recoll development currently uses version 3.3.5, but any 3.3 version is
...
...
1293
   not be critical). On Linux systems, the iconv interface is part of libc
1761
   not be critical). On Linux systems, the iconv interface is part of libc
1294
   and you should not need to do anything special.
1762
   and you should not need to do anything special.
1295
1763
1296
     ----------------------------------------------------------------------
1764
     ----------------------------------------------------------------------
1297
1765
1298
  4.3.2. Building
1766
  5.3.2. Building
1299
1767
1300
   Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
1768
   Recoll has been built on Linux (redhat7.3, mandriva 2005/6, Fedora Core
1301
   3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
1769
   3/4/5/6), FreeBSD 5/6, macosx, and Solaris 8. If you build on another
1302
   system, and need to modify things, I would very much welcome patches.
1770
   system, and need to modify things, I would very much welcome patches.
1303
1771
...
...
1333
   manually copy and modify one of the existing files (the new file name
1801
   manually copy and modify one of the existing files (the new file name
1334
   should be the output of uname -s).
1802
   should be the output of uname -s).
1335
1803
1336
     ----------------------------------------------------------------------
1804
     ----------------------------------------------------------------------
1337
1805
1338
  4.3.3. Installation
1806
  5.3.3. Installation
1339
1807
1340
   Either type make install or execute recollinstall prefix, in the root of
1808
   Either type make install or execute recollinstall prefix, in the root of
1341
   the source tree. This will copy the commands to prefix/bin and the sample
1809
   the source tree. This will copy the commands to prefix/bin and the sample
1342
   configuration files, scripts and other shared data to prefix/share/recoll.
1810
   configuration files, scripts and other shared data to prefix/share/recoll.
1343
1811
...
...
1348
1816
1349
   You can then proceed to configuration.
1817
   You can then proceed to configuration.
1350
1818
1351
     ----------------------------------------------------------------------
1819
     ----------------------------------------------------------------------
1352
1820
1353
4.4. Configuration overview
1821
5.4. Configuration overview
1354
1822
1355
   Most of the parameters specific to the recoll GUI are set through the
1823
   Most of the parameters specific to the recoll GUI are set through the
1356
   Preferences menu and stored in the standard QT place ($HOME/.qt/recollrc).
1824
   Preferences menu and stored in the standard QT place ($HOME/.qt/recollrc).
1357
   You probably do not want to edit this by hand.
1825
   You probably do not want to edit this by hand.
1358
1826
...
...
1408
   White space is used for separation inside lists. List elements with
1876
   White space is used for separation inside lists. List elements with
1409
   embedded spaces can be quoted using double-quotes.
1877
   embedded spaces can be quoted using double-quotes.
1410
1878
1411
     ----------------------------------------------------------------------
1879
     ----------------------------------------------------------------------
1412
1880
1413
  4.4.1. Main configuration file
1881
  5.4.1. Main configuration file
1414
1882
1415
   recoll.conf is the main configuration file. It defines things like what to
1883
   recoll.conf is the main configuration file. It defines things like what to
1416
   index (top directories and things to ignore), and the default character
1884
   index (top directories and things to ignore), and the default character
1417
   set to use for document types which do not specify it internally.
1885
   set to use for document types which do not specify it internally.
1418
1886
...
...
1614
           cases. A value of 3 would allow more precision and efficiency on
2082
           cases. A value of 3 would allow more precision and efficiency on
1615
           longer words, but the index will be approximately twice as large.
2083
           longer words, but the index will be approximately twice as large.
1616
2084
1617
     ----------------------------------------------------------------------
2085
     ----------------------------------------------------------------------
1618
2086
1619
  4.4.2. The mimemap file
2087
  5.4.2. The mimemap file
1620
2088
1621
   mimemap specifies the file name extension to mime type mappings.
2089
   mimemap specifies the file name extension to mime type mappings.
1622
2090
1623
   For file names without an extension, or with an unknown one, the system's
2091
   For file names without an extension, or with an unknown one, the system's
1624
   file -i command will be executed to determine the mime type (this can be
2092
   file -i command will be executed to determine the mime type (this can be
...
...
1640
   there avoids cluttering the more user-oriented and locally customized
2108
   there avoids cluttering the more user-oriented and locally customized
1641
   skippedNames.
2109
   skippedNames.
1642
2110
1643
     ----------------------------------------------------------------------
2111
     ----------------------------------------------------------------------
1644
2112
1645
  4.4.3. The mimeconf file
2113
  5.4.3. The mimeconf file
1646
2114
1647
   mimeconf specifies how the different mime types are handled for indexing,
2115
   mimeconf specifies how the different mime types are handled for indexing,
1648
   and which icons are displayed in the recoll result lists.
2116
   and which icons are displayed in the recoll result lists.
1649
2117
1650
   Changing the parameters in the [index] section is probably not a good idea
2118
   Changing the parameters in the [index] section is probably not a good idea
...
...
1654
   recoll in the result lists (the values are the basenames of the png images
2122
   recoll in the result lists (the values are the basenames of the png images
1655
   inside the iconsdir directory (specified in recoll.conf).
2123
   inside the iconsdir directory (specified in recoll.conf).
1656
2124
1657
     ----------------------------------------------------------------------
2125
     ----------------------------------------------------------------------
1658
2126
1659
  4.4.4. The mimeview file
2127
  5.4.4. The mimeview file
1660
2128
1661
   mimeview specifies which programs are started when you click on an Edit
2129
   mimeview specifies which programs are started when you click on an Edit
1662
   link in a result list. Ie: HTML is normally displayed using firefox, but
2130
   link in a result list. Ie: HTML is normally displayed using firefox, but
1663
   you may prefer Konqueror, your openoffice.org program might be named
2131
   you may prefer Konqueror, your openoffice.org program might be named
1664
   oofice instead of openoffice etc.
2132
   oofice instead of openoffice etc.
...
...
1677
   user preferences, all mimeview entries will be ignored except the one
2145
   user preferences, all mimeview entries will be ignored except the one
1678
   labelled application/x-all (which is set to use xdg-open by default).
2146
   labelled application/x-all (which is set to use xdg-open by default).
1679
2147
1680
     ----------------------------------------------------------------------
2148
     ----------------------------------------------------------------------
1681
2149
1682
  4.4.5. Examples of configuration adjustments
2150
  5.4.5. Examples of configuration adjustments
1683
2151
1684
    4.4.5.1. Adding an external viewer for an non-indexed type
2152
    5.4.5.1. Adding an external viewer for an non-indexed type
1685
2153
1686
   Imagine that you have some kind of file which does not have indexable
2154
   Imagine that you have some kind of file which does not have indexable
1687
   content, but for which you would like to have a functional Edit link in
2155
   content, but for which you would like to have a functional Edit link in
1688
   the result list (when found by file name). The file names end in .blob and
2156
   the result list (when found by file name). The file names end in .blob and
1689
   can be displayed by application blobviewer.
2157
   can be displayed by application blobviewer.
...
...
1712
   The entries you add in your personal file override those in the central
2180
   The entries you add in your personal file override those in the central
1713
   configuration, which you do not need to alter
2181
   configuration, which you do not need to alter
1714
2182
1715
     ----------------------------------------------------------------------
2183
     ----------------------------------------------------------------------
1716
2184
1717
    4.4.5.2. Adding indexing support for a new file type
2185
    5.4.5.2. Adding indexing support for a new file type
1718
2186
1719
   Let us now imagine that the above .blob files actually contain indexable
2187
   Let us now imagine that the above .blob files actually contain indexable
1720
   text and that you know how to extract it with a command line program.
2188
   text and that you know how to extract it with a command line program.
1721
   Getting Recoll to index the files is easy. You need to perform the above
2189
   Getting Recoll to index the files is easy. You need to perform the above
1722
   alteration, and also to add data to the mimeconf file (typically in
2190
   alteration, and also to add data to the mimeconf file (typically in
...
...
1736
       makes sense (you can also create a category). Categories may be used
2204
       makes sense (you can also create a category). Categories may be used
1737
       for filtering in advanced search.
2205
       for filtering in advanced search.
1738
2206
1739
   The rclblob filter should be an executable program or script which exists
2207
   The rclblob filter should be an executable program or script which exists
1740
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
2208
   inside /usr/[local/]share/recoll/filters. It will be given a file name as
1741
   argument and should output the text contents in html format on the
2209
   argument and should output the text contents on the standard output.
1742
   standard output.
1743
2210
1744
   You can find more details about writing a Recoll filter in the section
2211
   The filter programming section describes in more detail how to write a
1745
   about writing filters
2212
   filter.
1746
2213
1747
     ----------------------------------------------------------------------
2214
     ----------------------------------------------------------------------
1748
2215
1749
4.5. The KDE Kicker Recoll applet
2216
5.5. The KDE Kicker Recoll applet
1750
2217
1751
   The Recoll source tree contains the source code to the recoll_applet, a
2218
   The Recoll source tree contains the source code to the recoll_applet, a
1752
   small application derived from the find_applet. This can be used to add a
2219
   small application derived from the find_applet. This can be used to add a
1753
   small Recoll launcher to the KDE panel.
2220
   small Recoll launcher to the KDE panel.
1754
2221
1755
   The applet is not automatically built with the main Recoll programs. To
2222
   The applet is not automatically built with the main Recoll programs, nor
1756
   build it, you need to unpack the Recoll source code, then go to the
2223
   is it included with the main source distribution (because the KDE build
1757
   kde/recoll_applet/ directory, and type the usual configure;make;make
2224
   boilerplate makes it relatively big). You can download its source from the
1758
   install.
2225
   recoll.org download page. Use the omnipotent configure;make;make install
2226
   incantation to build and install.
1759
2227
1760
   You can then add the applet to the panel by right-clicking the panel and
2228
   You can then add the applet to the panel by right-clicking the panel and
1761
   choosing the Add applet entry.
2229
   choosing the Add applet entry.
1762
2230
1763
   The recoll_applet has a small text window where you can type a Recoll
2231
   The recoll_applet has a small text window where you can type a Recoll
1764
   query (in query language form), and an icon which can be used to restrict
2232
   query (in query language form), and an icon which can be used to restrict
1765
   the search to certain types of files.
2233
   the search to certain types of files. It is quite primitive, and launches
2234
   a new recoll GUI instance every time (even if it is already running). You
2235
   may find it useful anyway.
1766
2236
1767
     ----------------------------------------------------------------------
2237
     ----------------------------------------------------------------------
1768
1769
4.6. Extending Recoll
1770
1771
  4.6.1. Writing a document filter
1772
1773
   Recoll filters are executable programs which translate from a specific
1774
   format (ie: openoffice, acrobat, etc.) to the Recoll indexing input
1775
   format, which was chosen to be HTML.
1776
1777
   Recoll filters are usually shell-scripts, but this is in no way necessary.
1778
   These programs are extremely simple and most of the difficulty lies in
1779
   extracting the text from the native format, not outputting what is
1780
   expected by Recoll. Happily enough, most document formats already have
1781
   translators or text extractors which handle the difficult part and can be
1782
   called from the filter.
1783
1784
   Filters are called with a single argument which is the source file name.
1785
   They should output the result to stdout.
1786
1787
   The RECOLL_FILTER_FORPREVIEW environment variable (values yes, no) tells
1788
   the filter if the operation is for indexing or previewing. Some filters
1789
   use this to output a slightly different format. This is not essential.
1790
1791
   The output HTML could be very minimal like the following example:
1792
1793
 <html><head>
1794
 <meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
1795
 </head>
1796
 <body>some text content</body></html>
1797
         
1798
1799
   You should take care to escape some characters inside the text by
1800
   transforming them into appropriate entities. "&" should be transformed
1801
   into "&amp;", "<" should be transformed into "&lt;".
1802
1803
   The character set needs to be specified in the header. It does not need to
1804
   be UTF-8 (Recoll will take care of translating it), but it must be
1805
   accurate for good results.
1806
1807
   Recoll will also make use of other header fields if they are present:
1808
   title, description, keywords.
1809
1810
   As of Recoll release 1.9, filters also have the possibility to "invent"
1811
   field names. This should be output as meta tags:
1812
1813
 <meta name="somefield" content="Some textual data" />
1814
1815
   In this case, a correspondance between field name and Xapian prefix should
1816
   also be added to the mimeconf file. See the existing entries for
1817
   inspiration. The field can then be used inside the query language to
1818
   narrow searches.
1819
1820
   The easiest way to write a new filter is probably to start from an
1821
   existing one.
1822
1823
     ----------------------------------------------------------------------