Recoll features
Supported systems | Document types | Other features | Desktop and web integration | Stemming |
- Easy installation, few dependancies. No database daemon, web server, desktop environment or exotic language necessary.
- Will run on most Unix-based systems
- Qt 4 GUI, plus command line, KIO and krunner interfaces.
- Searches most common document types, emails and their attachments. Transparently handles decompression (gzip, bzip2).
- Powerful query facilities, with boolean searches, phrases, proximity, wildcards, filter on file types and directory tree.
- Multi-language and multi-character set with Unicode based internals.
- Extensive documentation, with a complete user manual and manual pages for each command.
Supported systems
Recoll has been compiled and tested on Linux, Darwin and Solaris (initial versions Redhat 7, Fedora Core 5, Suse 10, Gentoo, Debian 3.1, Solaris 8). It should compile and run on all subsequent releases of these systems and probably a few others too.
Qt versions from 3.1 to 4.7
Document types
Recoll can index many document types (along with their compressed versions). Some types are handled internally (no external application needed). Other types need a separate application to be installed to extract the text. Types that only need very common utilities (awk/sed/groff etc.) are listed in the native section.
File types indexed natively
- text.
- html.
- maildir and mailbox (Mozilla, Thunderbird and Evolution mail ok).
- gaim and purple log files.
- Lyx files (needs Lyx to be installed).
- Scribus files.
- Man pages (need groff).
File types indexed with external helpers
Many document types need the iconv command in addition to the applications specifically listed.
The XML ones
The following types need xsltproc from the libxslt package. Quite a few also need unzip:
- Abiword files.
- Fb2 ebooks.
- Kword files.
- Microsoft Office Open XML files.
- OpenOffice files.
- SVG files.
- Gnumeric files.
- Okular annotations files.
Other formats
- pdf with the pdftotext command, which can be installed as part of xpdf or poppler, depending on your distribution.
- msword with antiword. It is also useful to have wvWare installed as it may be be used as a fallback for some files which antiword does not handle.
- Powerpoint and Excel with the catdoc utilities.
- CHM (Microsoft help) files with Python, pychm and chmlib.
- GNU info files with Python and the info command.
- Zip archives (needs Python).
- Rar archives (needs Python), the rarfile Python module and the unrar utility.
- iCalendar(.ics) files (needs Python, icalendar).
- Mozilla calendar data See the wiki about this.
- Wordperfect with the wpd2html command from libwpd. On some distributions, the command may come with an package named libwpd-tools or such, not the base libwpd package.
- postscript with
ghostscript and pstotext.
Pstotext 1.9 has a serious issue with special characters in
file names, and you should either use the version packaged for
your system which is probably patched, or apply the Debian
patch which is stored here for
convenience. See http://packages.debian.org/squeeze/pstotext
and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=356988
for references/explanations.
To make things a bit easier, I also store an already patched version. I added an install target to the Makefile... This installs to /usr/local, use make install PREFIX=/usr to change. So all you need is:
tar xvzf pstotext-1.9-patched.tar.gz cd pstotext-1.9-patched make make install
- RTF files with unrtf. Please note that up to version 0.21, unrtf mostly does not work with non western-european character sets. If you have a need for indexing, ie, russian or chinese RTF files, I have produced a modified version which works much better (as indicated by my tests and a few external ones). You can download the source here. The development is hosted on bitbucket.org.
- TeX with untex. If there is no untex package for your distribution, a source package is stored on this site (as untex has no obvious home). Will also work with detex if this is installed.
- dvi with dvips.
- djvu with DjVuLibre.
- Audio file tags: Recoll releases 1.13 and older use id3info (id3lib)
(compiling id3lib on recent systems may need a small patch,
see here.) or the ogg and flac
tools.
Recoll releases 1.14 and later use a Python filter based on mutagen for all audio types. - Image file tags with exiftool. This is a perl program, so you also need perl on the system. This works with about any possible image file and tag format (jpg, png, tiff, gif etc.).
- Midi karaoke files with Python, the midi module, and some help from chardet. There is probably a chardet package for your distribution, but you will quite probably need to build the midi package. This is easy but see the to notes here.
- Konqueror webarchive format with Python (uses the tarfile module).
- mimehtml web archive format (support based on the mail filter, which introduces some mild weirdness, but still usable).
Other features
- Can use Beagle browser plug-ins to index web history. See the the Wiki for more detail.
- Processes all email attachments, and more generally any realistic level of container imbrication (the "msword attachment to a message inside a mailbox in a zip" thingy...) .
- Multiple selectable databases.
- Powerful query facilities, with boolean searches, phrases, filter on file types and directory tree.
- Xesam-compatible query language.
- Wildcard searches (with a specific and faster function for file names).
- Support for multiple charsets. Internal processing and storage uses Unicode UTF-8.
- Stemming performed at query time (can switch stemming language after indexing).
- Easy installation. No database daemon, web server or exotic language necessary.
- An indexer which runs either as a thread inside the GUI, as an external, batch, cron'able program, or as a real-time indexing daemon.
Desktop and web integration
The Recoll GUI has many features that help to specify an efficient search and to manage the results. However it maybe sometimes preferable to use a simpler tool with a better integration with your desktop interfaces. Several solutions exist, at the moment mostly for the KDE desktop:
- The Recoll KIO module allows starting queries and viewing results from the Konqueror browser or KDE applications Open dialogs.
- The recollrunner krunner module allows integrating Recoll search results into a krunner query.
Recoll also has Python and PHP modules which can allow easy integration with web or other applications.
Stemming
Stemming is a process which transforms inflected words into their most basic form. For example, flooring, floors, floored would probably all be transformed to floor by a stemmer for the English language.
In many search engines, the stemming process occurs during indexing. The index will only contain the stemmed form of words, with exceptions for terms which are detected as being probably proper nouns (ie: capitalized). At query time, the terms entered by the user are stemmed, then matched against the index.
This process results into a smaller index, but it has the grave inconvenient of irrevocably losing information during indexing.
Recoll works in a different way. No stemming is performed at query time, so that all information gets into the index. The resulting index is bigger, but most people probably don't care much about this nowadays, because they have a 100Gb disk 95% full of binary data which does not get indexed.
At the end of an indexing pass, Recoll builds one or several stemming dictionaries, where all word stems are listed in correspondence to the list of their derivatives.
At query time, by default, user-entered terms are stemmed, then matched against the stem database, and the query is expanded to include all derivatives. This will yield search results analogous to those obtained by a classical engine. The benefits of this approach is that stem expansion can be controlled instantly at query time in several ways:
- It can be selectively turned-off for any query term by capitalizing it (Floor).
- The stemming language (ie: english, french...) can be selected (this supposes that several stemming databases have been built, which can be configured as part of the indexing, or done later, in a reasonably fast way).