recoll / Code / Diff of /src/README

Diff of /src/README [6d4e44] .. [88bccb]

Switch to unified view


...

Chapter 1. Introduction

1.1. Giving it a try

   If you do not like reading manuals (who does?) but wish to give Recoll a
   try, just install the application and start the recoll graphical user
   interface (GUI), which will ask permission to index your home directory by
   default, allowing you to search immediately after indexing completes.

   Do not do this if your home directory contains a huge number of documents
   and you do not want to wait or are very short on disk space. In this case,
   you may first want to customize the configuration to restrict the indexed
   area (for the very impatient with a completed package install, from the
   recoll GUI: Preferences -> Indexing configuration, then adjust the Top
   directories section).

   Also be aware that you may need to install the appropriate supporting
   applications for document types that need them (for example antiword for
   Microsoft Word files).

1.2. Full text search

   Recoll is a full text search application. Full text search finds your data
   by content rather than by external attributes (like a file name). You
   specify words (terms) which should or should not appear in the text you
   are looking for, and receive in return a list of matching documents,
   ordered so that the most relevant documents will appear first.


   You do not need to remember in what file or email message you stored a
   given piece of information. You just ask for related terms, and the tool
   will return a list of documents where these terms are prominent, in a
   similar way to Internet search engines.

   Full text search applications try to determine which documents are most
   relevant to the search terms you provide. Computer algorithms for
   determining relevance can be very complex, and in general are inferior to
   the power of the human mind to rapidly determine relevance. The quality of
   relevance guessing is probably the most important aspect when evaluating a
   search application.

   In many cases, you are looking for all the forms of a word, including

   plurals, different tenses for a verb, or terms derived from the same root
   or stem (example: floor, floors, floored, flooring...). Queries are
   usually automatically expanded to all such related terms (words that
   reduce to the same stem). This can be prevented for searching for a
   specific form.

   Stemming, by itself, does not accommodate for misspellings or phonetic
   searches. A full text search application may also support this form of
   approximation. For example, a search for aliterattion returning no result
   may propose, depending on index contents, alliteration alteration
   alterations altercation as possible replacement terms.

1.3. Recoll overview

   Recoll uses the Xapian information retrieval library as its storage and
   retrieval engine. Xapian is a very mature package using a sophisticated
   probabilistic ranking model.


   The Xapian library manages an index database which describes where terms
   appear in your document files. It efficiently processes the complex
   queries which are produced by the Recoll query expansion mechanism, and is
   in charge of the all-important relevance computation task.

   Recoll provides the mechanisms and interface to get data into and out of
   the index. This includes translating the many possible document formats
   into pure text, handling term variations (using Xapian stemmers), and
   spelling approximations (using the aspell speller), interpreting user
   queries and presenting results.

   In a shorter way, Recoll does the dirty footwork, Xapian deals with the
   intelligent parts of the process.

   The Xapian index can be big (roughly the size of the original document
   set), but it is not a document archive. Recoll can only display documents
   that still exist at the place from which they were indexed. (Actually,
   there is a way to reconstruct a document from the information in the
   index, but the result is not nice, as all formatting, punctuation and
   capitalization are lost).

   Recoll stores all internal data in Unicode UTF-8 format, and it can index
   files of many types with different character sets, encodings, and
   languages into the same index. It can process documents embedded inside
   other documents (for example a pdf document stored inside a Zip archive
   sent as an email attachment...), down to an arbitrary depth.

   Stemming is the process by which Recoll reduces words to their radicals so
   that searching does not depend, for example, on a word being singular or
   plural (floor, floors), or on a verb tense (flooring, floored). Because
   the mechanisms used for stemming depend on the specific grammatical rules
...
   parameters affecting only the recoll GUI are stored in the standard
   location defined by Qt.

   The indexing process is started automatically the first time you execute
   the recoll GUI. Indexing can also be performed by executing the
   recollindex command. Recoll indexing is multithreaded by default when
   appropriate hardware resources are available, and can perform in parallel
   multiple tasks among text extraction, segmentation and index updates.

   Searches are usually performed inside the recoll GUI, which has many
   options to help you find what you are looking for. However, there are
   other ways to perform Recoll searches: mostly a command line interface, a
   Python programming interface, a KDE KIO slave module, and Ubuntu Unity
   Lens (for older versions) or Scope (for current versions) modules.

Chapter 2. Indexing

2.1. Introduction

   Indexing is the process by which the set of documents is analyzed and the
   data entered into the database. Recoll indexing is normally incremental:
   documents will only be processed if they have been modified since the last
   run. On the first execution, all documents will need processing. A full
   index build can be forced later by specifying an option to the indexing
   command (recollindex -z or -Z).

   The following sections give an overview of different aspects of the
   indexing processes and configuration, with links to detailed sections.

  2.1.1. Indexing modes
...

   Recoll automatically manages the expansion of search terms to their
   derivatives (ie: plural/singular, verb inflections). But there are other
   cases where the exact search term is not known. For example, you may not
   remember the exact spelling, or only know the beginning of the name.

   The search will only propose replacement terms with spelling variations
   when no matching document were found. In some cases, both proper spellings
   and mispellings are present in the index, and it may be interesting to
   look for them explicitely.

   The term explorer tool (started from the toolbar icon or from the Term
   explorer entry of the Tools menu) can be used to search the full index
   terms list. It has three modes of operations:

...

   Now for the list:

     o Openoffice files need unzip and xsltproc.

     o PDF files need pdftotext which is part of Poppler (usually comes with
       the poppler-utils package). Avoid the original one from Xpdf.

     o Postscript files need pstotext. The original version has an issue with
       shell character in file names, which is corrected in recent packages.
       See http://www.recoll.org/features.html for more detail.

...
     o MS Open XML (docx) needs xsltproc.

     o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
       Ubuntu) package.

     o RTF files need unrtf, which, in its older versions, has much trouble
       with non-western character sets. Many Linux distributions carry
       outdated unrtf versions. Check http://www.recoll.org/features.html for
       details.

     o TeX files need untex or detex. Check
       http://www.recoll.org/features.html for sources if it's not packaged
       for your distribution.


	a/src/README		b/src/README
	...		...
195		195
196	Chapter 1. Introduction	196	Chapter 1. Introduction
197		197
198	1.1. Giving it a try	198	1.1. Giving it a try
199		199
200	If you do not like reading manuals (who does?) and would like to give	200	If you do not like reading manuals (who does?) but wish to give Recoll a
201	Recoll a try, just install the application and start the recoll graphical	201	try, just install the application and start the recoll graphical user
202	user interface (GUI), which will ask to index your home directory by	202	interface (GUI), which will ask permission to index your home directory by
203	default, allowing you to search immediately after indexing completes.	203	default, allowing you to search immediately after indexing completes.
204		204
205	Do not do this if your home directory contains a huge number of documents	205	Do not do this if your home directory contains a huge number of documents
206	and you do not want to wait or are very short on disk space. In this case,	206	and you do not want to wait or are very short on disk space. In this case,
207	you may first want to customize the configuration to restrict the indexed	207	you may first want to customize the configuration to restrict the indexed
208	area.	208	area (for the very impatient with a completed package install, from the
		209	recoll GUI: Preferences -> Indexing configuration, then adjust the Top
		210	directories section).
209		211
210	Also be aware that you may need to install the appropriate supporting	212	Also be aware that you may need to install the appropriate supporting
211	applications for document types that need them (for example antiword for	213	applications for document types that need them (for example antiword for
212	Microsoft Word files).	214	Microsoft Word files).
213		215
214	1.2. Full text search	216	1.2. Full text search
215		217
216	Recoll is a full text search application. Full text search applications	218	Recoll is a full text search application. Full text search finds your data
217	let you find your data by content rather than by external attributes (like	219	by content rather than by external attributes (like a file name). You
218	a file name). More specifically, they will let you specify words (terms)	220	specify words (terms) which should or should not appear in the text you
219	that should or should not appear in the text you are looking for, and	221	are looking for, and receive in return a list of matching documents,
220	return a list of matching documents, ordered so that the most relevant	222	ordered so that the most relevant documents will appear first.
221	documents will appear first.
222		223
223	You do not need to remember in what file or email message you stored a	224	You do not need to remember in what file or email message you stored a
224	given piece of information. You just ask for related terms, and the tool	225	given piece of information. You just ask for related terms, and the tool
225	will return a list of documents where these terms are prominent, in a	226	will return a list of documents where these terms are prominent, in a
226	similar way to Internet search engines.	227	similar way to Internet search engines.
227		228
228	A search application tries to determine which documents are most relevant	229	Full text search applications try to determine which documents are most
229	to the search terms you provide. Computer algorithms for determining	230	relevant to the search terms you provide. Computer algorithms for
230	relevance can be very complex, and in general are inferior to the power of	231	determining relevance can be very complex, and in general are inferior to
231	the human mind to rapidly determine relevance. The quality of relevance	232	the power of the human mind to rapidly determine relevance. The quality of
232	guessing is probably the most important aspect when evaluating a search	233	relevance guessing is probably the most important aspect when evaluating a
233	application.	234	search application.
234		235
235	In many cases, you are looking for all the forms of a word, not for a	236	In many cases, you are looking for all the forms of a word, including
236	specific form or spelling. These different forms may include plurals,
237	different tenses for a verb, or terms derived from the same root or stem	237	plurals, different tenses for a verb, or terms derived from the same root
238	(example: floor, floors, floored, flooring...). Search applications	238	or stem (example: floor, floors, floored, flooring...). Queries are
239	usually expand queries to all such related terms (words that reduce to the	239	usually automatically expanded to all such related terms (words that
240	same stem) and also provide a way to disable this expansion if you are	240	reduce to the same stem). This can be prevented for searching for a
241	actually searching for a specific form.	241	specific form.
242		242
243	Stemming, by itself, does not accommodate for misspellings or phonetic	243	Stemming, by itself, does not accommodate for misspellings or phonetic
244	searches. Recoll supports these features through a specific tool (the term	244	searches. A full text search application may also support this form of
245	explorer) which will let you explore the set of index terms along	245	approximation. For example, a search for aliterattion returning no result
246	different modes.	246	may propose, depending on index contents, alliteration alteration
		247	alterations altercation as possible replacement terms.
247		248
248	1.3. Recoll overview	249	1.3. Recoll overview
249		250
250	Recoll uses the Xapian information retrieval library as its storage and	251	Recoll uses the Xapian information retrieval library as its storage and
251	retrieval engine. Xapian is a very mature package using a sophisticated	252	retrieval engine. Xapian is a very mature package using a sophisticated
252	probabilistic ranking model. Recoll provides the mechanisms and interface	253	probabilistic ranking model.
253	to get data into and out of the system.
254		254
255	In practice, Xapian works by remembering where terms appear in your	255	The Xapian library manages an index database which describes where terms
256	document files. The acquisition process is called indexing.	256	appear in your document files. It efficiently processes the complex
		257	queries which are produced by the Recoll query expansion mechanism, and is
		258	in charge of the all-important relevance computation task.
257		259
		260	Recoll provides the mechanisms and interface to get data into and out of
		261	the index. This includes translating the many possible document formats
		262	into pure text, handling term variations (using Xapian stemmers), and
		263	spelling approximations (using the aspell speller), interpreting user
		264	queries and presenting results.
		265
		266	In a shorter way, Recoll does the dirty footwork, Xapian deals with the
		267	intelligent parts of the process.
		268
258	The resulting index can be big (roughly the size of the original document	269	The Xapian index can be big (roughly the size of the original document
259	set), but it is not a document archive. Recoll can only display documents	270	set), but it is not a document archive. Recoll can only display documents
260	that still exist at the place from which they were indexed. (Actually,	271	that still exist at the place from which they were indexed. (Actually,
261	there is a way to reconstruct a document from the information in the	272	there is a way to reconstruct a document from the information in the
262	index, but the result is not nice, as all formatting, punctuation and	273	index, but the result is not nice, as all formatting, punctuation and
263	capitalization are lost).	274	capitalization are lost).
264		275
265	Recoll stores all internal data in Unicode UTF-8 format, and it can index	276	Recoll stores all internal data in Unicode UTF-8 format, and it can index
266	files with different character sets, encodings, and languages into the	277	files of many types with different character sets, encodings, and
267	same index. It has can process many document types.	278	languages into the same index. It can process documents embedded inside
		279	other documents (for example a pdf document stored inside a Zip archive
		280	sent as an email attachment...), down to an arbitrary depth.
268		281
269	Stemming is the process by which Recoll reduces words to their radicals so	282	Stemming is the process by which Recoll reduces words to their radicals so
270	that searching does not depend, for example, on a word being singular or	283	that searching does not depend, for example, on a word being singular or
271	plural (floor, floors), or on a verb tense (flooring, floored). Because	284	plural (floor, floors), or on a verb tense (flooring, floored). Because
272	the mechanisms used for stemming depend on the specific grammatical rules	285	the mechanisms used for stemming depend on the specific grammatical rules
	...		...
316	parameters affecting only the recoll GUI are stored in the standard	329	parameters affecting only the recoll GUI are stored in the standard
317	location defined by Qt.	330	location defined by Qt.
318		331
319	The indexing process is started automatically the first time you execute	332	The indexing process is started automatically the first time you execute
320	the recoll GUI. Indexing can also be performed by executing the	333	the recoll GUI. Indexing can also be performed by executing the
321	recollindex command.	334	recollindex command. Recoll indexing is multithreaded by default when
		335	appropriate hardware resources are available, and can perform in parallel
		336	multiple tasks among text extraction, segmentation and index updates.
322		337
323	Searches are usually performed inside the recoll GUI, which has many	338	Searches are usually performed inside the recoll GUI, which has many
324	options to help you find what you are looking for. However, there are	339	options to help you find what you are looking for. However, there are
325	other ways to perform Recoll searches: mostly a command line interface, a	340	other ways to perform Recoll searches: mostly a command line interface, a
326	Python programming interface, a KDE KIO slave module, and a Ubuntu Unity	341	Python programming interface, a KDE KIO slave module, and Ubuntu Unity
327	Lens module.	342	Lens (for older versions) or Scope (for current versions) modules.
328		343
329	Chapter 2. Indexing	344	Chapter 2. Indexing
330		345
331	2.1. Introduction	346	2.1. Introduction
332		347
333	Indexing is the process by which the set of documents is analyzed and the	348	Indexing is the process by which the set of documents is analyzed and the
334	data entered into the database. Recoll indexing is normally incremental:	349	data entered into the database. Recoll indexing is normally incremental:
335	documents will only be processed if they have been modified. On the first	350	documents will only be processed if they have been modified since the last
336	execution, all documents will need processing. A full index build can be	351	run. On the first execution, all documents will need processing. A full
337	forced later by specifying an option to the indexing command (recollindex	352	index build can be forced later by specifying an option to the indexing
338	-z or -Z).	353	command (recollindex -z or -Z).
339		354
340	The following sections give an overview of different aspects of the	355	The following sections give an overview of different aspects of the
341	indexing processes and configuration, with links to detailed sections.	356	indexing processes and configuration, with links to detailed sections.
342		357
343	2.1.1. Indexing modes	358	2.1.1. Indexing modes
	...		...
1460		1475
1461	Recoll automatically manages the expansion of search terms to their	1476	Recoll automatically manages the expansion of search terms to their
1462	derivatives (ie: plural/singular, verb inflections). But there are other	1477	derivatives (ie: plural/singular, verb inflections). But there are other
1463	cases where the exact search term is not known. For example, you may not	1478	cases where the exact search term is not known. For example, you may not
1464	remember the exact spelling, or only know the beginning of the name.	1479	remember the exact spelling, or only know the beginning of the name.
		1480
		1481	The search will only propose replacement terms with spelling variations
		1482	when no matching document were found. In some cases, both proper spellings
		1483	and mispellings are present in the index, and it may be interesting to
		1484	look for them explicitely.
1465		1485
1466	The term explorer tool (started from the toolbar icon or from the Term	1486	The term explorer tool (started from the toolbar icon or from the Term
1467	explorer entry of the Tools menu) can be used to search the full index	1487	explorer entry of the Tools menu) can be used to search the full index
1468	terms list. It has three modes of operations:	1488	terms list. It has three modes of operations:
1469		1489
	...		...
3300		3320
3301	Now for the list:	3321	Now for the list:
3302		3322
3303	o Openoffice files need unzip and xsltproc.	3323	o Openoffice files need unzip and xsltproc.
3304		3324
3305	o PDF files need pdftotext which is part of the Xpdf or Poppler	3325	o PDF files need pdftotext which is part of Poppler (usually comes with
3306	packages.	3326	the poppler-utils package). Avoid the original one from Xpdf.
3307		3327
3308	o Postscript files need pstotext. The original version has an issue with	3328	o Postscript files need pstotext. The original version has an issue with
3309	shell character in file names, which is corrected in recent packages.	3329	shell character in file names, which is corrected in recent packages.
3310	See http://www.recoll.org/features.html for more detail.	3330	See http://www.recoll.org/features.html for more detail.
3311		3331
	...		...
3318	o MS Open XML (docx) needs xsltproc.	3338	o MS Open XML (docx) needs xsltproc.
3319		3339
3320	o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on	3340	o Wordperfect files need wpd2html from the libwpd (or libwpd-tools on
3321	Ubuntu) package.	3341	Ubuntu) package.
3322		3342
3323	o RTF files need unrtf, which, in its standard version, has much trouble	3343	o RTF files need unrtf, which, in its older versions, has much trouble
3324	with non-western character sets. Check	3344	with non-western character sets. Many Linux distributions carry
3325	http://www.recoll.org/features.html.	3345	outdated unrtf versions. Check http://www.recoll.org/features.html for
		3346	details.
3326		3347
3327	o TeX files need untex or detex. Check	3348	o TeX files need untex or detex. Check
3328	http://www.recoll.org/features.html for sources if it's not packaged	3349	http://www.recoll.org/features.html for sources if it's not packaged
3329	for your distribution.	3350	for your distribution.
3330		3351