Switch to unified view

a/src/README b/src/README
...
...
195
195
196
Chapter 1. Introduction
196
Chapter 1. Introduction
197
197
198
1.1. Giving it a try
198
1.1. Giving it a try
199
199
200
   If you do not like reading manuals (who does?) and would like to give
200
   If you do not like reading manuals (who does?) but wish to give Recoll a
201
   Recoll a try, just install the application and start the recoll graphical
201
   try, just install the application and start the recoll graphical user
202
   user interface (GUI), which will ask to index your home directory by
202
   interface (GUI), which will ask permission to index your home directory by
203
   default, allowing you to search immediately after indexing completes.
203
   default, allowing you to search immediately after indexing completes.
204
204
205
   Do not do this if your home directory contains a huge number of documents
205
   Do not do this if your home directory contains a huge number of documents
206
   and you do not want to wait or are very short on disk space. In this case,
206
   and you do not want to wait or are very short on disk space. In this case,
207
   you may first want to customize the configuration to restrict the indexed
207
   you may first want to customize the configuration to restrict the indexed
208
   area.
208
   area (for the very impatient with a completed package install, from the
209
   recoll GUI: Preferences -> Indexing configuration, then adjust the Top
210
   directories section).
209
211
210
   Also be aware that you may need to install the appropriate supporting
212
   Also be aware that you may need to install the appropriate supporting
211
   applications for document types that need them (for example antiword for
213
   applications for document types that need them (for example antiword for
212
   Microsoft Word files).
214
   Microsoft Word files).
213
215
214
1.2. Full text search
216
1.2. Full text search
215
217
216
   Recoll is a full text search application. Full text search applications
218
   Recoll is a full text search application. Full text search finds your data
217
   let you find your data by content rather than by external attributes (like
219
   by content rather than by external attributes (like a file name). You
218
   a file name). More specifically, they will let you specify words (terms)
220
   specify words (terms) which should or should not appear in the text you
219
   that should or should not appear in the text you are looking for, and
221
   are looking for, and receive in return a list of matching documents,
220
   return a list of matching documents, ordered so that the most relevant
222
   ordered so that the most relevant documents will appear first.
221
   documents will appear first.
222
223
223
   You do not need to remember in what file or email message you stored a
224
   You do not need to remember in what file or email message you stored a
224
   given piece of information. You just ask for related terms, and the tool
225
   given piece of information. You just ask for related terms, and the tool
225
   will return a list of documents where these terms are prominent, in a
226
   will return a list of documents where these terms are prominent, in a
226
   similar way to Internet search engines.
227
   similar way to Internet search engines.
227
228
228
   A search application tries to determine which documents are most relevant
229
   Full text search applications try to determine which documents are most
229
   to the search terms you provide. Computer algorithms for determining
230
   relevant to the search terms you provide. Computer algorithms for
230
   relevance can be very complex, and in general are inferior to the power of
231
   determining relevance can be very complex, and in general are inferior to
231
   the human mind to rapidly determine relevance. The quality of relevance
232
   the power of the human mind to rapidly determine relevance. The quality of
232
   guessing is probably the most important aspect when evaluating a search
233
   relevance guessing is probably the most important aspect when evaluating a
233
   application.
234
   search application.
234
235
235
   In many cases, you are looking for all the forms of a word, not for a
236
   In many cases, you are looking for all the forms of a word, including
236
   specific form or spelling. These different forms may include plurals,
237
   different tenses for a verb, or terms derived from the same root or stem
237
   plurals, different tenses for a verb, or terms derived from the same root
238
   (example: floor, floors, floored, flooring...). Search applications
238
   or stem (example: floor, floors, floored, flooring...). Queries are
239
   usually expand queries to all such related terms (words that reduce to the
239
   usually automatically expanded to all such related terms (words that
240
   same stem) and also provide a way to disable this expansion if you are
240
   reduce to the same stem). This can be prevented for searching for a
241
   actually searching for a specific form.
241
   specific form.
242
242
243
   Stemming, by itself, does not accommodate for misspellings or phonetic
243
   Stemming, by itself, does not accommodate for misspellings or phonetic
244
   searches. Recoll supports these features through a specific tool (the term
244
   searches. A full text search application may also support this form of
245
   explorer) which will let you explore the set of index terms along
245
   approximation. For example, a search for aliterattion returning no result
246
   different modes.
246
   may propose, depending on index contents, alliteration alteration
247
   alterations altercation as possible replacement terms.
247
248
248
1.3. Recoll overview
249
1.3. Recoll overview
249
250
250
   Recoll uses the Xapian information retrieval library as its storage and
251
   Recoll uses the Xapian information retrieval library as its storage and
251
   retrieval engine. Xapian is a very mature package using a sophisticated
252
   retrieval engine. Xapian is a very mature package using a sophisticated
252
   probabilistic ranking model. Recoll provides the mechanisms and interface
253
   probabilistic ranking model.
253
   to get data into and out of the system.
254
254
255
   In practice, Xapian works by remembering where terms appear in your
255
   The Xapian library manages an index database which describes where terms
256
   document files. The acquisition process is called indexing.
256
   appear in your document files. It efficiently processes the complex
257
   queries which are produced by the Recoll query expansion mechanism, and is
258
   in charge of the all-important relevance computation task.
257
259
260
   Recoll provides the mechanisms and interface to get data into and out of
261
   the index. This includes translating the many possible document formats
262
   into pure text, handling term variations (using Xapian stemmers), and
263
   spelling approximations (using the aspell speller), interpreting user
264
   queries and presenting results.
265
266
   In a shorter way, Recoll does the dirty footwork, Xapian deals with the
267
   intelligent parts of the process.
268
258
   The resulting index can be big (roughly the size of the original document
269
   The Xapian index can be big (roughly the size of the original document
259
   set), but it is not a document archive. Recoll can only display documents
270
   set), but it is not a document archive. Recoll can only display documents
260
   that still exist at the place from which they were indexed. (Actually,
271
   that still exist at the place from which they were indexed. (Actually,
261
   there is a way to reconstruct a document from the information in the
272
   there is a way to reconstruct a document from the information in the
262
   index, but the result is not nice, as all formatting, punctuation and
273
   index, but the result is not nice, as all formatting, punctuation and
263
   capitalization are lost).
274
   capitalization are lost).
264
275
265
   Recoll stores all internal data in Unicode UTF-8 format, and it can index
276
   Recoll stores all internal data in Unicode UTF-8 format, and it can index
266
   files with different character sets, encodings, and languages into the
277
   files of many types with different character sets, encodings, and
267
   same index. It has can process many document types.
278
   languages into the same index. It can process documents embedded inside
279
   other documents (for example a pdf document stored inside a Zip archive
280
   sent as an email attachment...), down to an arbitrary depth.
268
281
269
   Stemming is the process by which Recoll reduces words to their radicals so
282
   Stemming is the process by which Recoll reduces words to their radicals so
270
   that searching does not depend, for example, on a word being singular or
283
   that searching does not depend, for example, on a word being singular or
271
   plural (floor, floors), or on a verb tense (flooring, floored). Because
284
   plural (floor, floors), or on a verb tense (flooring, floored). Because
272
   the mechanisms used for stemming depend on the specific grammatical rules
285
   the mechanisms used for stemming depend on the specific grammatical rules
...
...
316
   parameters affecting only the recoll GUI are stored in the standard
329
   parameters affecting only the recoll GUI are stored in the standard
317
   location defined by Qt.
330
   location defined by Qt.
318
331
319
   The indexing process is started automatically the first time you execute
332
   The indexing process is started automatically the first time you execute
320
   the recoll GUI. Indexing can also be performed by executing the
333
   the recoll GUI. Indexing can also be performed by executing the
321
   recollindex command.
334
   recollindex command. Recoll indexing is multithreaded by default when
335
   appropriate hardware resources are available, and can perform in parallel
336
   multiple tasks among text extraction, segmentation and index updates.
322
337
323
   Searches are usually performed inside the recoll GUI, which has many
338
   Searches are usually performed inside the recoll GUI, which has many
324
   options to help you find what you are looking for. However, there are
339
   options to help you find what you are looking for. However, there are
325
   other ways to perform Recoll searches: mostly a command line interface, a
340
   other ways to perform Recoll searches: mostly a command line interface, a
326
   Python programming interface, a KDE KIO slave module, and a Ubuntu Unity
341
   Python programming interface, a KDE KIO slave module, and Ubuntu Unity
327
   Lens module.
342
   Lens (for older versions) or Scope (for current versions) modules.
328
343
329
Chapter 2. Indexing
344
Chapter 2. Indexing
330
345
331
2.1. Introduction
346
2.1. Introduction
332
347
333
   Indexing is the process by which the set of documents is analyzed and the
348
   Indexing is the process by which the set of documents is analyzed and the
334
   data entered into the database. Recoll indexing is normally incremental:
349
   data entered into the database. Recoll indexing is normally incremental:
335
   documents will only be processed if they have been modified. On the first
350
   documents will only be processed if they have been modified since the last
336
   execution, all documents will need processing. A full index build can be
351
   run. On the first execution, all documents will need processing. A full
337
   forced later by specifying an option to the indexing command (recollindex
352
   index build can be forced later by specifying an option to the indexing
338
   -z or -Z).
353
   command (recollindex -z or -Z).
339
354
340
   The following sections give an overview of different aspects of the
355
   The following sections give an overview of different aspects of the
341
   indexing processes and configuration, with links to detailed sections.
356
   indexing processes and configuration, with links to detailed sections.
342
357
343
  2.1.1. Indexing modes
358
  2.1.1. Indexing modes
...
...
1460
1475
1461
   Recoll automatically manages the expansion of search terms to their
1476
   Recoll automatically manages the expansion of search terms to their
1462
   derivatives (ie: plural/singular, verb inflections). But there are other
1477
   derivatives (ie: plural/singular, verb inflections). But there are other
1463
   cases where the exact search term is not known. For example, you may not
1478
   cases where the exact search term is not known. For example, you may not
1464
   remember the exact spelling, or only know the beginning of the name.
1479
   remember the exact spelling, or only know the beginning of the name.
1480
1481
   The search will only propose replacement terms with spelling variations
1482
   when no matching document were found. In some cases, both proper spellings
1483
   and mispellings are present in the index, and it may be interesting to
1484
   look for them explicitely.
1465
1485
1466
   The term explorer tool (started from the toolbar icon or from the Term
1486
   The term explorer tool (started from the toolbar icon or from the Term
1467
   explorer entry of the Tools menu) can be used to search the full index
1487
   explorer entry of the Tools menu) can be used to search the full index
1468
   terms list. It has three modes of operations:
1488
   terms list. It has three modes of operations:
1469
1489
...
...
3300
3320
3301
   Now for the list:
3321
   Now for the list:
3302
3322
3303
     o Openoffice files need unzip and xsltproc.
3323
     o Openoffice files need unzip and xsltproc.
3304
3324
3305
     o PDF files need pdftotext which is part of the Xpdf or Poppler
3325
     o PDF files need pdftotext which is part of Poppler (usually comes with
3306
       packages.
3326
       the poppler-utils package). Avoid the original one from Xpdf.
3307
3327
3308
     o Postscript files need pstotext. The original version has an issue with
3328
     o Postscript files need pstotext. The original version has an issue with
3309
       shell character in file names, which is corrected in recent packages.
3329
       shell character in file names, which is corrected in recent packages.
3310
       See http://www.recoll.org/features.html for more detail.
3330
       See http://www.recoll.org/features.html for more detail.
3311
3331
...
...
3318
     o MS Open XML (docx) needs xsltproc.
3338
     o MS Open XML (docx) needs xsltproc.
3319
3339
3320
     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3340
     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3321
       Ubuntu) package.
3341
       Ubuntu) package.
3322
3342
3323
     o RTF files need unrtf, which, in its standard version, has much trouble
3343
     o RTF files need unrtf, which, in its older versions, has much trouble
3324
       with non-western character sets. Check
3344
       with non-western character sets. Many Linux distributions carry
3325
       http://www.recoll.org/features.html.
3345
       outdated unrtf versions. Check http://www.recoll.org/features.html for
3346
       details.
3326
3347
3327
     o TeX files need untex or detex. Check
3348
     o TeX files need untex or detex. Check
3328
       http://www.recoll.org/features.html for sources if it's not packaged
3349
       http://www.recoll.org/features.html for sources if it's not packaged
3329
       for your distribution.
3350
       for your distribution.
3330
3351