Recoll features
- Supported systems
- Recoll has been compiled and tested on FreeBSD, Linux, Darwin and Solaris (versions FreeBSD 5.5, Redhat 7.3, Fedora Core 5, Suse 10.1, Gentoo, Debian 3.1, Solaris 8/9, but other not too distant releases should be ok too). You can download the source code and some precompiled packages here.
- Qt versions from 3.1
- Document types
- Supports the following document types (along with their
compressed versions):
- Natively
-
- text.
- html.
- OpenOffice files (needs unzip command).
- maildir and mailbox (Mozilla, Thunderbird and Evolution mail ok).
- gaim log files.
- With external helpers
-
- pdf with xpdf.
- postscript with ghostscript and pstotext.
- msword with antiword.
- Powerpoint and Excel with the catdoc utilities.
- rtf with unrtf.
- dvi with dvips.
- djvu with DjVuLibre.
- mp3 tags support with id3info (id3lib).
- Other features
-
- Multiple selectable databases.
- Powerful query facilities, with boolean searches, phrases, filter on file types and directory tree.
- Specific file name searches with wildcards.
- Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
- Stemming performed at query time (can switch stemming language after indexing).
- Easy installation. No database daemon, web server or exotic language necessary.
- An indexer which runs either as a thread inside the GUI or as an external, cron'able program.
- It can be selectively turned-off for any query term by capitalizing it (Floor).
- The stemming language (ie: english, french...) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).
Stemming
Stemming is a process which transforms inflected words into their most basic form. For exemple, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way. No stemming is performed at query time, so that all information gets into the index. The resulting index is bigger, but most people probably don't care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine. The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways: