Parent: [119773] (diff)

Download this file

filters.html    243 lines (196 with data), 9.8 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Recoll updated filters</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll is a simple full-text search system for unix and linux
based on the powerful and mature xapian engine">
<meta name="Keywords" content=
"full text search, desktop search, unix, linux">
<meta http-equiv="Content-language" content="en">
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="../styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="../index.html">Home</a></li>
<li><a href="../download.html">Downloads</a></li>
<li><a href="../usermanual/index.html">User manual</a></li>
<li><a href="../usermanual/RCL.INSTALL.html">Installation</a></li>
<li><a href="../index.html#support">Support</a></li>
</ul>
</div>
<div class="content">
<h1>Updated filters for Recoll</h1>
<p>The following describe new and updated filters, which will be
part of the next release, but can be installed on an older
release if you need them.</p>
<p>For updated filters, you just need to copy the script to the
filters directory which may be typically either <span
class="filename">/usr/share/recoll/filters</span>, or <span
class="filename">/usr/local/share/recoll/filters</span>. Please check
that the script is executable after copying it, and make it so if
needed (chmod a+x <i>scriptname</i>)</p>
<p>For new filters, you'll need to copy the script file as
above, possibly install the supporting application, and usually
edit the
<span class="filename">mimemap</span>,
<span class="filename">mimeview</span> and
<span class="filename">mimeconf</span> files, either in the
shared directory
(<span class="filename">
/usr[/local]/share/recoll/examples</span>), or
in your personal configuration directory
(<span class="filename">$HOME/.recoll</span> or
<span class="filename">$RECOLL_CONFDIR</span>).</p>
<p>Alternatively, you can replace your system files with
these updated and complete versions:
<a href="mimemap">mimemap</a>
<a href="mimeconf">mimeconf</a>
<a href="mimeview">mimeview</a>.</p>
<p>There is a slightly more detailed description of the filter
installation procedure on the
<a href="http://www.recoll.org/faqsandhowtos/FilterRetrofit.html">
Recoll Wiki</a>.</p>
<p>The following entries are in reverse chronologic order. Each
lists the latest Recoll release on which the update makes sense
(newer releases have an up to date version of the filter).</p>
<p>However, if you are running a Recoll version older than 1.17,
you should really upgrade.</p>
<h2>PDF documents</h2>
<p>Fixded <a href="rclpdf">rclpdf</a> filter, compatible with
newer poppler pdftotext versions, which now properly escape
text inside the html <head> section (but not the body,
curiously).</p>
<h2>Scribus documents</h2>
<p>An improved <a href="rclscribus">rclscribus</a> filter,
thanks to Morten Langlo.</p>
<h2>7zip archives</h2>
<p>A new <a href="rcl7z">rcl7z</a> filter by Fran�ois Botha
for 7zip archives. Needs the
<a href="https://pypi.python.org/pypi/pylzma">pylzma Python
module</a>. </p>
<h2>Attachments to PDF documents (1.20 and older)</h2>
<p>A new <a href="rclmpdf">rclmpdf</a> filter for processing
PDF files with attachments. This replaces the old <b>rclpdf</b>
filter. You need to add it to ~/.recoll/mimeconf until it is
made standard (this is still a bit experimental, and a big
change from the previous filter):
<pre><tt>
[index]
application/pdf = execm rclmpdf
</tt></pre>
Note the <tt>execm</tt> instead of <tt>exec</tt>. </p>
<h2><a name="soff1">Open/Libre-Office documents (1.19 and older)</a></h2>
<p><a href="rclsoff">rclsoff</a>: the previous version did not
produce white space between input tab-separated words, leading
to search failures.</p>
<h2>Purple logs (1.20 and older)</h2>
<p>New <a href="rclpurple">rclpurple</a> filter for Pidging and
other chat applications log files. Handles newer log
formats. </p>
<h2>PowerPoint documents (1.19 and older)</h2>
<p>The <b>rclppt</b> filter was based on <b>catppt</b>, but this
seems to fail quite often on newer PPT
documents. The new version is based on code from
the <b>libreoffice</b> <b>mso-dump</b> project. It is both
reasonably fast and quite thorough.
</p>
<p>Installation:<ul>
<li>As <tt>recollindex</tt> was executing <b>catppt</b>
directly in the default configuration, you will also need to add
the following to
the <tt>mimeconf</tt> file (e.g.: ~/.recoll/mimeconf):
<pre>
[index]
application/vnd.ms-powerpoint = exec rclppt
</pre>
</li>
<li>Copy the 3 following files to the Recoll filters directory (e.g:
<i>/usr/share/recoll/filters</i>) and make sure
that <tt>ppt-dump.py</tt> and <tt>rclppt</tt> are executable.
<ul>
<li><a href="rclppt">rclppt</a></li>
<li><a href="ppt-dump.py">ppt-dump.py</a></li>
<li><a href="msodump.zip">msodump.zip</a></li>
</ul>
</li>
</ul>
</p>
<h2>EPUB documents (1.17 and older)</h2>
<p>New <a href="rclepub">rclepub</a> filter for EPUB documents.
This needs
the <a href="http://pypi.python.org/pypi/epub/0.5.0">
python epub decoding module</a>. </p>
<h2>CHM files (1.17.1 and older)</h2>
<p><a href="rclchm">rclchm</a>. The previous version of the
filter mishandled files which had encoded internal URLs (not
very frequent, but happens).</p>
<h2>Updated Open Document filter (1.17 and older)</h2>
<p>The <a href="rclsoff">new filter</a> will correctly handle
exported Google Docs documents and also Open/LibreOffice ones in
some cases. The previous filters concatenated all the text
inside the exported Google docs without any spacing...</p>
<h2>TAR archives (1.17 and older)</h2>
<p>New <a href="rcltar">rcltar</a> filter for tar archives. The
indexing of tar archives is disabled by default in the sample
configuration (stored here). This is an <tt>execm</tt>
filter&nbsp;!. You'll need to add an <br>
<tt>application/x-tar = execm rcltar</tt><br>
line in the [index] section of your
$HOME/mimeconf to enable it, not an <tt>exec</tt> one.</p>
<h2>XML files (1.17 and older)</h2>
<p>By default, the current recoll version does not index xml
content (except for known formats like dia, svg etc.). This
new <a href="rclxml">rclxml</a> filter will extract the data
from any xml file. Only text data is extracted, no attribute
values. The other option is to treat xml file as plain text
one (see comment in mimeconf), and index everything, including
a lot of garbage.</p>
<h2>DIA files (1.16 and older)</h2>
<p><a href="rcldia">rcldia</a> is a new filter
for <a href="http://projects.gnome.org/dia/">Dia</a> files,
contributed by Stefan Friedel.</p>
<h2>Okular annotations (1.16 and older)</h2>
<p><a href="rclokulnote">rclokulnote</a>. Okular lets you create
annotations for PDF documents and stores them in xml format
somewhere under ~/.kde. This filter does not do a nice job to
format the data, but will at least let you find it...</p>
<h2>Gnumeric (1.16 and older)</h2>
<p><a href="rclgnm">rclgnm</a>. Needs xsltproc and
gunzip. As <tt>.gnumeric</tt> was in the list of
explicitely ignored suffixes, you can't just add the mime
and indexer script lines to your local mimemap and mimeconf, you
also need to define recoll_noindex in the local mimemap (to
override the system one which
contains <tt>.gnumeric</tt>). The simplest approach may be to
just replace the system files with those above.</p>
<h2>Rar archive support (1.15 and older)</h2>
<p><a href="rclrar">rclrar</a>. This is up to date in Recoll
1.16.2 but may be added to Recoll 1.15. It needs the Python
rarfile module. </p>
<h2>Mimehtml support (1.15)</h2>
<p>This is based on the internal mail filter, you just need to
download and install the configuration files (mimemap and
mimeconf. Will only work with 1.15 and later.</p>
<h2>Konqueror webarchive (.war) filter (1.15)</h2>
<p><a href="rclwar">rclwar</a></p>
<h2>Updated zip archive filter (1.15)</h2>
<p>The filter is corrected to handle utf-8 paths in zip archives:
<a href="rclzip">rclzip</a>. Up to date in Recoll 1.16, but
may be useful with Recoll 1.15</p>
<h2>Updated audio tag filter (1.14)</h2>
<p>The mutagen-based rclaudio filter delivered with recoll 1.14.2
used a very recent mutagen interface which will only work with
mutagen versions after 1.17 (probably. at least works with 1.19,
doesn't with 1.15).
You can download the <a href="rclaudio">corrected script
here. Not useful with Recoll 1.5 or 1.6</a>.
</p>
</div>
</body>
</html>