Updated filters for Recoll

The following describe new and updated filters, which will be part of the next release, but can be installed on an older release if you need them.

For updated filters, you just need to copy the script to the filters directory which may be typically either /usr/share/recoll/filters, or /usr/local/share/recoll/filters. Please check that the script is executable after copying it, and make it so if needed (chmod a+x scriptname)

For new filters, you'll need to copy the script file as above, possibly install the supporting application, and usually edit the mimemap, mimeview and mimeconf files, either in the shared directory ( /usr[/local]/share/recoll/examples), or in your personal configuration directory ($HOME/.recoll or $RECOLL_CONFDIR).

Alternatively, you can replace your system files with these updated and complete versions: mimemap mimeconf mimeview.

There is a slightly more detailed description of the filter installation procedure on the Recoll Wiki.

The following entries are in reverse chronologic order. Each lists the latest Recoll release on which the update makes sense (newer releases have an up to date version of the filter).

However, if you are running a Recoll version older than 1.17, you should really upgrade.

PDF documents

Fixded rclpdf filter, compatible with newer poppler pdftotext versions, which now properly escape text inside the html section (but not the body, curiously).

Scribus documents

An improved rclscribus filter, thanks to Morten Langlo.

7zip archives

A new rcl7z filter by François Botha for 7zip archives. Needs the pylzma Python module.

Attachments to PDF documents (1.20 and older)

A new rclmpdf filter for processing PDF files with attachments. This replaces the old rclpdf filter. You need to add it to ~/.recoll/mimeconf until it is made standard (this is still a bit experimental, and a big change from the previous filter):


        [index]
        application/pdf = execm rclmpdf
        
Note the execm instead of exec.

Open/Libre-Office documents (1.19 and older)

rclsoff: the previous version did not produce white space between input tab-separated words, leading to search failures.

Purple logs (1.20 and older)

New rclpurple filter for Pidging and other chat applications log files. Handles newer log formats.

PowerPoint documents (1.19 and older)

The rclppt filter was based on catppt, but this seems to fail quite often on newer PPT documents. The new version is based on code from the libreoffice mso-dump project. It is both reasonably fast and quite thorough.

Installation:

EPUB documents (1.17 and older)

New rclepub filter for EPUB documents. This needs the python epub decoding module.

CHM files (1.17.1 and older)

rclchm. The previous version of the filter mishandled files which had encoded internal URLs (not very frequent, but happens).

Updated Open Document filter (1.17 and older)

The new filter will correctly handle exported Google Docs documents and also Open/LibreOffice ones in some cases. The previous filters concatenated all the text inside the exported Google docs without any spacing...

TAR archives (1.17 and older)

New rcltar filter for tar archives. The indexing of tar archives is disabled by default in the sample configuration (stored here). This is an execm filter !. You'll need to add an
application/x-tar = execm rcltar
line in the [index] section of your $HOME/mimeconf to enable it, not an exec one.

XML files (1.17 and older)

By default, the current recoll version does not index xml content (except for known formats like dia, svg etc.). This new rclxml filter will extract the data from any xml file. Only text data is extracted, no attribute values. The other option is to treat xml file as plain text one (see comment in mimeconf), and index everything, including a lot of garbage.

DIA files (1.16 and older)

rcldia is a new filter for Dia files, contributed by Stefan Friedel.

Okular annotations (1.16 and older)

rclokulnote. Okular lets you create annotations for PDF documents and stores them in xml format somewhere under ~/.kde. This filter does not do a nice job to format the data, but will at least let you find it...

Gnumeric (1.16 and older)

rclgnm. Needs xsltproc and gunzip. As .gnumeric was in the list of explicitely ignored suffixes, you can't just add the mime and indexer script lines to your local mimemap and mimeconf, you also need to define recoll_noindex in the local mimemap (to override the system one which contains .gnumeric). The simplest approach may be to just replace the system files with those above.

Rar archive support (1.15 and older)

rclrar. This is up to date in Recoll 1.16.2 but may be added to Recoll 1.15. It needs the Python rarfile module.

Mimehtml support (1.15)

This is based on the internal mail filter, you just need to download and install the configuration files (mimemap and mimeconf. Will only work with 1.15 and later.

Konqueror webarchive (.war) filter (1.15)

rclwar

Updated zip archive filter (1.15)

The filter is corrected to handle utf-8 paths in zip archives: rclzip. Up to date in Recoll 1.16, but may be useful with Recoll 1.15

Updated audio tag filter (1.14)

The mutagen-based rclaudio filter delivered with recoll 1.14.2 used a very recent mutagen interface which will only work with mutagen versions after 1.17 (probably. at least works with 1.19, doesn't with 1.15). You can download the corrected script here. Not useful with Recoll 1.5 or 1.6.