Recoll uses the Xapian information retrieval library as its storage and retrieval engine. Xapian is a very mature package using a sophisticated probabilistic ranking model.
The Xapian library manages an index database which describes where terms appear in your document files. It efficiently processes the complex queries which are produced by the Recoll query expansion mechanism, and is in charge of the all-important relevance computation task.
Recoll provides the mechanisms and interface to get data into and out of the index. This includes translating the many possible document formats into pure text, handling term variations (using Xapian stemmers), and spelling approximations (using the aspell speller), interpreting user queries and presenting results.
In a shorter way, Recoll does the dirty footwork, Xapian deals with the intelligent parts of the process.
The Xapian index can be big (roughly the size of the original document set), but it is not a document archive. Recoll can only display documents that still exist at the place from which they were indexed. (Actually, there is a way to reconstruct a document from the information in the index, but the result is not nice, as all formatting, punctuation and capitalization are lost).
Recoll stores all internal data in Unicode UTF-8 format, and it can index files of many types with different character sets, encodings, and languages into the same index. It can process documents embedded inside other documents (for example a pdf document stored inside a Zip archive sent as an email attachment...), down to an arbitrary depth.
Stemming is the process by which Recoll reduces words to their radicals so that searching does not depend, for example, on a word being singular or plural (floor, floors), or on a verb tense (flooring, floored). Because the mechanisms used for stemming depend on the specific grammatical rules for each language, there is a separate Xapian stemmer module for most common languages where stemming makes sense.
Recoll stores the unstemmed versions of terms in the main index and uses auxiliary databases for term expansion (one for each stemming language), which means that you can switch stemming languages between searches, or add a language without needing a full reindex.
Storing documents written in different languages in the same index is possible, and commonly done. In this situation, you can specify several stemming languages for the index.
Recoll currently makes no attempt at automatic language recognition, which means that the stemmer will sometimes be applied to terms from other languages with potentially strange results. In practise, even if this introduces possibilities of confusion, this approach has been proven quite useful, and it is much less cumbersome than separating your documents according to what language they are written in.
By default, Recoll strips most accents and
diacritics from terms, and converts them to lower case before
either storing them in the index or searching for them. As a
consequence, it is impossible to search for a particular
capitalization of a term (US
/
us
), or to discriminate two terms based on
diacritics (sake
/ saké
,
mate
/ maté
).
Recoll versions 1.18 and newer can optionally store the raw terms, without accent stripping or case conversion. In this configuration, default searches will behave as before, but it is possible to perform searches sensitive to case and diacritics. This is described in more detail in the section about index case and diacritics sensitivity.
Recoll has many parameters which define exactly what to
index, and how to classify and decode the source
documents. These are kept in configuration files. A
default configuration is copied into a standard location
(usually something like
/usr/share/recoll/examples
)
during installation. The default values set by the
configuration files in this directory may be overridden by
values set inside your personal configuration, found
by default in the .recoll
sub-directory
of your home directory. The default configuration will index
your home directory with default parameters and should be
sufficient for giving Recoll a try, but you may want to adjust
it later, which can be done either by editing the text files
or by using configuration menus in the
recoll GUI. Some other parameters affecting only
the recoll GUI are stored in the standard
location defined by Qt.
The indexing process is started automatically the first time you execute the recoll GUI. Indexing can also be performed by executing the recollindex command. Recoll indexing is multithreaded by default when appropriate hardware resources are available, and can perform in parallel multiple tasks among text extraction, segmentation and index updates.
Searches are usually performed inside the recoll GUI, which has many options to help you find what you are looking for. However, there are other ways to perform Recoll searches: mostly a command line interface, a Python programming interface, a KDE KIO slave module, and Ubuntu Unity Lens (for older versions) or Scope (for current versions) modules.