|
a/src/README |
|
b/src/README |
|
... |
|
... |
195 |
|
195 |
|
196 |
Chapter 1. Introduction
|
196 |
Chapter 1. Introduction
|
197 |
|
197 |
|
198 |
1.1. Giving it a try
|
198 |
1.1. Giving it a try
|
199 |
|
199 |
|
200 |
If you do not like reading manuals (who does?) and would like to give
|
200 |
If you do not like reading manuals (who does?) but wish to give Recoll a
|
201 |
Recoll a try, just install the application and start the recoll graphical
|
201 |
try, just install the application and start the recoll graphical user
|
202 |
user interface (GUI), which will ask to index your home directory by
|
202 |
interface (GUI), which will ask permission to index your home directory by
|
203 |
default, allowing you to search immediately after indexing completes.
|
203 |
default, allowing you to search immediately after indexing completes.
|
204 |
|
204 |
|
205 |
Do not do this if your home directory contains a huge number of documents
|
205 |
Do not do this if your home directory contains a huge number of documents
|
206 |
and you do not want to wait or are very short on disk space. In this case,
|
206 |
and you do not want to wait or are very short on disk space. In this case,
|
207 |
you may first want to customize the configuration to restrict the indexed
|
207 |
you may first want to customize the configuration to restrict the indexed
|
208 |
area.
|
208 |
area (for the very impatient with a completed package install, from the
|
|
|
209 |
recoll GUI: Preferences -> Indexing configuration, then adjust the Top
|
|
|
210 |
directories section).
|
209 |
|
211 |
|
210 |
Also be aware that you may need to install the appropriate supporting
|
212 |
Also be aware that you may need to install the appropriate supporting
|
211 |
applications for document types that need them (for example antiword for
|
213 |
applications for document types that need them (for example antiword for
|
212 |
Microsoft Word files).
|
214 |
Microsoft Word files).
|
213 |
|
215 |
|
214 |
1.2. Full text search
|
216 |
1.2. Full text search
|
215 |
|
217 |
|
216 |
Recoll is a full text search application. Full text search applications
|
218 |
Recoll is a full text search application. Full text search finds your data
|
217 |
let you find your data by content rather than by external attributes (like
|
219 |
by content rather than by external attributes (like a file name). You
|
218 |
a file name). More specifically, they will let you specify words (terms)
|
220 |
specify words (terms) which should or should not appear in the text you
|
219 |
that should or should not appear in the text you are looking for, and
|
221 |
are looking for, and receive in return a list of matching documents,
|
220 |
return a list of matching documents, ordered so that the most relevant
|
222 |
ordered so that the most relevant documents will appear first.
|
221 |
documents will appear first.
|
|
|
222 |
|
223 |
|
223 |
You do not need to remember in what file or email message you stored a
|
224 |
You do not need to remember in what file or email message you stored a
|
224 |
given piece of information. You just ask for related terms, and the tool
|
225 |
given piece of information. You just ask for related terms, and the tool
|
225 |
will return a list of documents where these terms are prominent, in a
|
226 |
will return a list of documents where these terms are prominent, in a
|
226 |
similar way to Internet search engines.
|
227 |
similar way to Internet search engines.
|
227 |
|
228 |
|
228 |
A search application tries to determine which documents are most relevant
|
229 |
Full text search applications try to determine which documents are most
|
229 |
to the search terms you provide. Computer algorithms for determining
|
230 |
relevant to the search terms you provide. Computer algorithms for
|
230 |
relevance can be very complex, and in general are inferior to the power of
|
231 |
determining relevance can be very complex, and in general are inferior to
|
231 |
the human mind to rapidly determine relevance. The quality of relevance
|
232 |
the power of the human mind to rapidly determine relevance. The quality of
|
232 |
guessing is probably the most important aspect when evaluating a search
|
233 |
relevance guessing is probably the most important aspect when evaluating a
|
233 |
application.
|
234 |
search application.
|
234 |
|
235 |
|
235 |
In many cases, you are looking for all the forms of a word, not for a
|
236 |
In many cases, you are looking for all the forms of a word, including
|
236 |
specific form or spelling. These different forms may include plurals,
|
|
|
237 |
different tenses for a verb, or terms derived from the same root or stem
|
237 |
plurals, different tenses for a verb, or terms derived from the same root
|
238 |
(example: floor, floors, floored, flooring...). Search applications
|
238 |
or stem (example: floor, floors, floored, flooring...). Queries are
|
239 |
usually expand queries to all such related terms (words that reduce to the
|
239 |
usually automatically expanded to all such related terms (words that
|
240 |
same stem) and also provide a way to disable this expansion if you are
|
240 |
reduce to the same stem). This can be prevented for searching for a
|
241 |
actually searching for a specific form.
|
241 |
specific form.
|
242 |
|
242 |
|
243 |
Stemming, by itself, does not accommodate for misspellings or phonetic
|
243 |
Stemming, by itself, does not accommodate for misspellings or phonetic
|
244 |
searches. Recoll supports these features through a specific tool (the term
|
244 |
searches. A full text search application may also support this form of
|
245 |
explorer) which will let you explore the set of index terms along
|
245 |
approximation. For example, a search for aliterattion returning no result
|
246 |
different modes.
|
246 |
may propose, depending on index contents, alliteration alteration
|
|
|
247 |
alterations altercation as possible replacement terms.
|
247 |
|
248 |
|
248 |
1.3. Recoll overview
|
249 |
1.3. Recoll overview
|
249 |
|
250 |
|
250 |
Recoll uses the Xapian information retrieval library as its storage and
|
251 |
Recoll uses the Xapian information retrieval library as its storage and
|
251 |
retrieval engine. Xapian is a very mature package using a sophisticated
|
252 |
retrieval engine. Xapian is a very mature package using a sophisticated
|
252 |
probabilistic ranking model. Recoll provides the mechanisms and interface
|
253 |
probabilistic ranking model.
|
253 |
to get data into and out of the system.
|
|
|
254 |
|
254 |
|
255 |
In practice, Xapian works by remembering where terms appear in your
|
255 |
The Xapian library manages an index database which describes where terms
|
256 |
document files. The acquisition process is called indexing.
|
256 |
appear in your document files. It efficiently processes the complex
|
|
|
257 |
queries which are produced by the Recoll query expansion mechanism, and is
|
|
|
258 |
in charge of the all-important relevance computation task.
|
257 |
|
259 |
|
|
|
260 |
Recoll provides the mechanisms and interface to get data into and out of
|
|
|
261 |
the index. This includes translating the many possible document formats
|
|
|
262 |
into pure text, handling term variations (using Xapian stemmers), and
|
|
|
263 |
spelling approximations (using the aspell speller), interpreting user
|
|
|
264 |
queries and presenting results.
|
|
|
265 |
|
|
|
266 |
In a shorter way, Recoll does the dirty footwork, Xapian deals with the
|
|
|
267 |
intelligent parts of the process.
|
|
|
268 |
|
258 |
The resulting index can be big (roughly the size of the original document
|
269 |
The Xapian index can be big (roughly the size of the original document
|
259 |
set), but it is not a document archive. Recoll can only display documents
|
270 |
set), but it is not a document archive. Recoll can only display documents
|
260 |
that still exist at the place from which they were indexed. (Actually,
|
271 |
that still exist at the place from which they were indexed. (Actually,
|
261 |
there is a way to reconstruct a document from the information in the
|
272 |
there is a way to reconstruct a document from the information in the
|
262 |
index, but the result is not nice, as all formatting, punctuation and
|
273 |
index, but the result is not nice, as all formatting, punctuation and
|
263 |
capitalization are lost).
|
274 |
capitalization are lost).
|
264 |
|
275 |
|
265 |
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
276 |
Recoll stores all internal data in Unicode UTF-8 format, and it can index
|
266 |
files with different character sets, encodings, and languages into the
|
277 |
files of many types with different character sets, encodings, and
|
267 |
same index. It has can process many document types.
|
278 |
languages into the same index. It can process documents embedded inside
|
|
|
279 |
other documents (for example a pdf document stored inside a Zip archive
|
|
|
280 |
sent as an email attachment...), down to an arbitrary depth.
|
268 |
|
281 |
|
269 |
Stemming is the process by which Recoll reduces words to their radicals so
|
282 |
Stemming is the process by which Recoll reduces words to their radicals so
|
270 |
that searching does not depend, for example, on a word being singular or
|
283 |
that searching does not depend, for example, on a word being singular or
|
271 |
plural (floor, floors), or on a verb tense (flooring, floored). Because
|
284 |
plural (floor, floors), or on a verb tense (flooring, floored). Because
|
272 |
the mechanisms used for stemming depend on the specific grammatical rules
|
285 |
the mechanisms used for stemming depend on the specific grammatical rules
|
|
... |
|
... |
316 |
parameters affecting only the recoll GUI are stored in the standard
|
329 |
parameters affecting only the recoll GUI are stored in the standard
|
317 |
location defined by Qt.
|
330 |
location defined by Qt.
|
318 |
|
331 |
|
319 |
The indexing process is started automatically the first time you execute
|
332 |
The indexing process is started automatically the first time you execute
|
320 |
the recoll GUI. Indexing can also be performed by executing the
|
333 |
the recoll GUI. Indexing can also be performed by executing the
|
321 |
recollindex command.
|
334 |
recollindex command. Recoll indexing is multithreaded by default when
|
|
|
335 |
appropriate hardware resources are available, and can perform in parallel
|
|
|
336 |
multiple tasks among text extraction, segmentation and index updates.
|
322 |
|
337 |
|
323 |
Searches are usually performed inside the recoll GUI, which has many
|
338 |
Searches are usually performed inside the recoll GUI, which has many
|
324 |
options to help you find what you are looking for. However, there are
|
339 |
options to help you find what you are looking for. However, there are
|
325 |
other ways to perform Recoll searches: mostly a command line interface, a
|
340 |
other ways to perform Recoll searches: mostly a command line interface, a
|
326 |
Python programming interface, a KDE KIO slave module, and a Ubuntu Unity
|
341 |
Python programming interface, a KDE KIO slave module, and Ubuntu Unity
|
327 |
Lens module.
|
342 |
Lens (for older versions) or Scope (for current versions) modules.
|
328 |
|
343 |
|
329 |
Chapter 2. Indexing
|
344 |
Chapter 2. Indexing
|
330 |
|
345 |
|
331 |
2.1. Introduction
|
346 |
2.1. Introduction
|
332 |
|
347 |
|
333 |
Indexing is the process by which the set of documents is analyzed and the
|
348 |
Indexing is the process by which the set of documents is analyzed and the
|
334 |
data entered into the database. Recoll indexing is normally incremental:
|
349 |
data entered into the database. Recoll indexing is normally incremental:
|
335 |
documents will only be processed if they have been modified. On the first
|
350 |
documents will only be processed if they have been modified since the last
|
336 |
execution, all documents will need processing. A full index build can be
|
351 |
run. On the first execution, all documents will need processing. A full
|
337 |
forced later by specifying an option to the indexing command (recollindex
|
352 |
index build can be forced later by specifying an option to the indexing
|
338 |
-z or -Z).
|
353 |
command (recollindex -z or -Z).
|
339 |
|
354 |
|
340 |
The following sections give an overview of different aspects of the
|
355 |
The following sections give an overview of different aspects of the
|
341 |
indexing processes and configuration, with links to detailed sections.
|
356 |
indexing processes and configuration, with links to detailed sections.
|
342 |
|
357 |
|
343 |
2.1.1. Indexing modes
|
358 |
2.1.1. Indexing modes
|
|
... |
|
... |
1460 |
|
1475 |
|
1461 |
Recoll automatically manages the expansion of search terms to their
|
1476 |
Recoll automatically manages the expansion of search terms to their
|
1462 |
derivatives (ie: plural/singular, verb inflections). But there are other
|
1477 |
derivatives (ie: plural/singular, verb inflections). But there are other
|
1463 |
cases where the exact search term is not known. For example, you may not
|
1478 |
cases where the exact search term is not known. For example, you may not
|
1464 |
remember the exact spelling, or only know the beginning of the name.
|
1479 |
remember the exact spelling, or only know the beginning of the name.
|
|
|
1480 |
|
|
|
1481 |
The search will only propose replacement terms with spelling variations
|
|
|
1482 |
when no matching document were found. In some cases, both proper spellings
|
|
|
1483 |
and mispellings are present in the index, and it may be interesting to
|
|
|
1484 |
look for them explicitely.
|
1465 |
|
1485 |
|
1466 |
The term explorer tool (started from the toolbar icon or from the Term
|
1486 |
The term explorer tool (started from the toolbar icon or from the Term
|
1467 |
explorer entry of the Tools menu) can be used to search the full index
|
1487 |
explorer entry of the Tools menu) can be used to search the full index
|
1468 |
terms list. It has three modes of operations:
|
1488 |
terms list. It has three modes of operations:
|
1469 |
|
1489 |
|
|
... |
|
... |
3300 |
|
3320 |
|
3301 |
Now for the list:
|
3321 |
Now for the list:
|
3302 |
|
3322 |
|
3303 |
o Openoffice files need unzip and xsltproc.
|
3323 |
o Openoffice files need unzip and xsltproc.
|
3304 |
|
3324 |
|
3305 |
o PDF files need pdftotext which is part of the Xpdf or Poppler
|
3325 |
o PDF files need pdftotext which is part of Poppler (usually comes with
|
3306 |
packages.
|
3326 |
the poppler-utils package). Avoid the original one from Xpdf.
|
3307 |
|
3327 |
|
3308 |
o Postscript files need pstotext. The original version has an issue with
|
3328 |
o Postscript files need pstotext. The original version has an issue with
|
3309 |
shell character in file names, which is corrected in recent packages.
|
3329 |
shell character in file names, which is corrected in recent packages.
|
3310 |
See http://www.recoll.org/features.html for more detail.
|
3330 |
See http://www.recoll.org/features.html for more detail.
|
3311 |
|
3331 |
|
|
... |
|
... |
3318 |
o MS Open XML (docx) needs xsltproc.
|
3338 |
o MS Open XML (docx) needs xsltproc.
|
3319 |
|
3339 |
|
3320 |
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
3340 |
o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
|
3321 |
Ubuntu) package.
|
3341 |
Ubuntu) package.
|
3322 |
|
3342 |
|
3323 |
o RTF files need unrtf, which, in its standard version, has much trouble
|
3343 |
o RTF files need unrtf, which, in its older versions, has much trouble
|
3324 |
with non-western character sets. Check
|
3344 |
with non-western character sets. Many Linux distributions carry
|
3325 |
http://www.recoll.org/features.html.
|
3345 |
outdated unrtf versions. Check http://www.recoll.org/features.html for
|
|
|
3346 |
details.
|
3326 |
|
3347 |
|
3327 |
o TeX files need untex or detex. Check
|
3348 |
o TeX files need untex or detex. Check
|
3328 |
http://www.recoll.org/features.html for sources if it's not packaged
|
3349 |
http://www.recoll.org/features.html for sources if it's not packaged
|
3329 |
for your distribution.
|
3350 |
for your distribution.
|
3330 |
|
3351 |
|