Recoll features

Supported systems
Recoll has been compiled and tested on FreeBSD, Linux, Darwin and Solaris (versions FreeBSD 5-7, Redhat 7/8/9, Fedora Core 5-13, Suse 10/11, Gentoo, Debian 3.1, Solaris 8/9/10. Other not too distant releases should be ok too).
Qt versions from 3.1 to 4.5
Document types
Recoll can index many document types (along with their compressed versions). Some types are handled internally (no external application needed). Other types need some application to be installed to extract the text. Types that only need common very common utilities (awk/sed/groff etc.) are listed in the native section.
Natively
  • text.
  • html.
  • maildir and mailbox (Mozilla, Thunderbird and Evolution mail ok).
  • OpenOffice files (needs unzip command).
  • Abiword files.
  • Kword files.
  • gaim and purple log files.
  • Lyx files (needs Lyx to be installed).
  • Scribus files.
  • Man pages (need groff).
With external helpers
In addition to the applications listed below, many document types need the iconv command.
  • Microsoft Office Open XML files with the unzip and xsltproc commands.
  • pdf with the pdftotext command, which can be installed as part of xpdf or poppler, depending on your distribution.
  • msword with antiword.
  • Powerpoint and Excel with the catdoc utilities.
  • CHM (Microsoft help) files (needs Python, pychm or chmlib).
  • Zip archives (needs Python).
  • iCalendar(.ics) files (needs Python, icalendar).
  • Mozilla calendar data See the wiki about this.
  • Wordperfect with libwpd.
  • postscript with ghostscript and pstotext. Actually the pstotext 1.9 found at the latter link has a problem with file names using special shell characters, and you should either use the version packaged for your system which is probably patched, or apply the Debian patch which is stored here for convenience. See http://packages.debian.org/squeeze/pstotext and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988 for references/explanations.
  • rtf with unrtf.
  • TeX with untex. If there is no untex package for your distribution, a source package is stored on this site (as untex has no obvious home). Will also work with detex if this is installed.
  • dvi with dvips.
  • djvu with DjVuLibre.
  • mp3/flac/ogg vorbis tags support with id3info (id3lib) (compiling id3lib on recent systems may need a small patch, see here.) or the ogg and flac tools. Release 1.14 and later use a python filter based on mutagen for all audio tags.
  • Image file tags support with exiftool. This is a perl program, so you also need perl on the system. This works with about any possible image file and tag format (jpg, png, tiff, gif etc.).
Other features
  • Can use Beagle browser plug-ins to index web history. See the the Wiki for more detail.
  • Processes all email attachments.
  • Multiple selectable databases.
  • Powerful query facilities, with boolean searches, phrases, filter on file types and directory tree.
  • Xesam-compatible query language.
  • Wildcard searches (with a specific and faster function for file names).
  • Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
  • Stemming performed at query time (can switch stemming language after indexing).
  • Easy installation. No database daemon, web server or exotic language necessary.
  • An indexer which runs either as a thread inside the GUI, as an external, batch, cron'able program, or as a real-time indexing daemon.

Stemming

Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.

In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.

This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.

Recoll works in a different way. No stemming is performed at query time, so that all information gets into the index. The resulting index is bigger, but most people probably don't care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.

At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.

At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine. The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways: