Download this file

perfs.html    417 lines (351 with data), 14.5 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>RECOLL indexing performance and index sizes</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
<meta name="Keywords" content=
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="en">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="rightlinks">
<ul>
<li><a href="index.html">Home</a></li>
<li><a href="pics/index.html">Screenshots</a></li>
<li><a href="download.html">Downloads</a></li>
<li><a href="doc.html">Documentation</a></li>
</ul>
</div>
<div class="content">
<h1>Recoll: Indexing performance and index sizes</h1>
<p>The time needed to index a given set of documents, and the
resulting index size depend of many factors.
<p>The index size depends almost only on the size of the
uncompressed input text, and you can expect it to be roughly
of the same order of magnitude. Depending on the type of file,
the proportion of text to file size varies very widely, going
from close to 1 for pure text files to a very small factor
for, e.g., metadata tags in mp3 files.</p>
<p>Estimating indexing time is a much more complicated issue,
depending on the type and size of input and on system
performance. There is no general way to determine what part of
the hardware should be optimized. Depending on the type of
input, performance may be bound by I/O read or write
performance, CPU single-processing speed, or combined
multi-processing speed.</p>
<p>It should be noted that Recoll performance will not be an
issue for most people. The indexer can process 1000 typical
PDF files per minute, or 500 Wikipedia HTML pages per second
on medium-range hardware, meaning that the initial indexing of
a typical dataset will need a few dozen minutes at
most. Further incremental index updates will be much faster
because most files will not need to be processed again.</p>
<p>However, there are Recoll installations with
terabyte-sized datasets, on which indexing can take days. For
such operations (or even much smaller ones), it is very
important to know what kind of performance can be expected,
and what aspects of the hardware should be optimized.</p>
<p>In order to provide some reference points, I have run a
number of benchs on medium-sized datasets, using typical
mid-range desktop hardware, and varying the indexing
configuration parameters to show how they affect the results.</p>
<p>The following may help you check that you are getting typical
performance for your indexing, and give some indications about
what to adjust to improve it.</p>
<p>From time to time, I receive a report about a system becoming
unusable during indexing. As far as I know, with the default
Recoll configuration, and barring an exceptional issue (bug),
this is always due to a system problem (typically bad hardware
such as a disk doing retries). The tests below were mostly run
while I was using the desktop, which never became
unusable. However, some tests rendered it less responsive and
this is noted with the results.</p>
<p>The following text refers to the indexing parameters without
further explanation. Here follow links to more explanation about the
<a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
model</a> and
<a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
parameters</a>.</p>
<p>All text were run without generating the stemming database or
aspell dictionary. These phases are relatively short and there
is nothing which can be optimized about them.</p>
<h2>Hardware</h2>
<p>The tests were run on what could be considered a mid-range
desktop PC:
<ul>
<li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
hyper-threading for a total of 8 hardware threads</li>
<li>8 GBytes of RAM</li>
<li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
</ul>
</p>
<p>This is usually a fanless PC, but I did run a fan on the
external case fins during some of the tests (esp. PDF
indexing), because the CPU was running a bit too hot.</p>
<h2>Indexing PDF files</h2>
<p>The tests were run on 18000 random PDFs harvested on
Google, with a total size of around 30 GB, using Recoll 1.22.3
and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
<h3>PDF: storage</h3>
<p>Typical PDF files have a low text to file size ratio, and a
lot of data needs to be read for indexing. With the test
configuration, the indexer needs to read around 45 MBytes / S
from multiple files. This means that input storage makes a
difference and that you need an SSD or a fast array for
optimal performance.</p>
<table border=1>
<thead>
<tr>
<th>Storage</th>
<th>idxflushmb</th>
<th>thrTCounts</th>
<th>Real Time</th>
</tr>
<tbody>
<tr>
<td>NFS drive (gigabit)</td>
<td>200</td>
<td>6/4/1</td>
<td>24m40</td>
</tr>
<tr>
<td>local SSD</td>
<td>200</td>
<td>6/4/1</td>
<td>11m40</td>
</tr>
</tbody>
</table>
<h3>PDF: threading</h3>
<p>Because PDF files are bulky and complicated to process, the
dominant step for indexing them is input processing. PDF text
extraction is performed by multiple instances
the <i>pdftotext</i> program, and parallelisation works very
well.</p>
<p>The following table shows the indexing times with a variety
of threading parameters.</p>
<table border=1>
<thead>
<tr>
<th>idxflushmb</th>
<th>thrQSizes</th>
<th>thrTCounts</th>
<th>Time R/U/S</th>
</tr>
<tbody>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>2/1/1</td>
<td>19m21</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>10/10/1</td>
<td>10m38</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>100/10/1</td>
<td>11m</td>
</tr>
</tbody>
</table>
<p>10/10/1 was the best value for thrTCounts for this test. The
total CPU time was around 78 mn.</p>
<p>The last line shows the effect of a ridiculously high thread
count value for the input step, which is not much. Using
sligthly lower values than the optimum has not much impact
either. The only thing which really degrades performance is
configuring less threads than available from the hardware.</p>
<p>With the optimal parameters above, the peak recollindex
resident memory size is around 930 MB, to which we should add
ten instances of pdftotext (10MB typical), and of the
rclpdf.py Python input handler (around 15 MB each). This means
that the total resident memory used by indexing is around 1200
MB, quite a modest value in 2016.</p>
<h3>PDF: Xapian flushes</h3>
<p>idxflushmb has practically no influence on the indexing time
(tested from 40 to 1000), which is not too surprising because
the Xapian index size is very small relatively to the input
size, so that the cost of Xapian flushes to disk is not very
significant. The value of 200 used for the threading tests
could be lowered in practise, which would decrease memory
usage and not change the indexing time significantly.</p>
<h3>PDF: conclusion</h3>
<p>For indexing PDF files, you need many cores and a fast
input storage system. Neither single-thread performance nor
amount of memory will be critical aspects.</p>
<p>Running the PDF indexing tests had no influence on the system
"feel", I could work on it just as if it were quiescent.</p>
<h2>Indexing HTML files</h2>
<p>The tests were run on an (old) French Wikipedia dump: 2.9
million HTML files stored in 42000 directories, for an
approximate total size of 41 GB (average file size
14 KB).
<p>The files are stored on a local SSD. Just reading them with
find+cpio takes close to 8 mn.</p>
<p>The resulting index has a size of around 30 GB.</p>
<p>I was too lazy to extract 3 million entries tar file on a
spinning disk, so all tests were performed with the data
stored on a local SSD.</p>
<p>For this test, the indexing time is dominated by the Xapian
index updates. As these are single threaded, only the flush
interval has a real influence.</p>
<table border=1>
<thead>
<tr>
<th>idxflushmb</th>
<th>thrQSizes</th>
<th>thrTCounts</th>
<th>Time R/U/S</th>
</tr>
<tbody>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>2/1/1</td>
<td>88m</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>6/4/1</td>
<td>91m</td>
</tr>
<tr>
<td>200</td>
<td>2/2/2</td>
<td>1/1/1</td>
<td>96m</td>
</tr>
<tr>
<td>100</td>
<td>2/2/2</td>
<td>1/2/1</td>
<td>120m</td>
</tr>
<tr>
<td>100</td>
<td>2/2/2</td>
<td>6/4/1</td>
<td>121m</td>
</tr>
<tr>
<td>40</td>
<td>2/2/2</td>
<td>1/2/1</td>
<td>173m</td>
</tr>
</tbody>
</table>
<p>The indexing process becomes quite big (resident size around
4GB), and the combination of high I/O load and high memory
usage makes the system less responsive at times (but not
unusable). As this happens principally when switching
applications, my guess would be that some program pages
(e.g. from the window manager and X) get flushed out, and take
time being read in, during which time the display appears
frozen.</p>
<p>For this kind of data, single-threaded CPU performance and
storage write speed can make a difference. Multithreading does
not help.</p>
<h2>Adjusting hardware to improve indexing performance</h2>
<p>I think that the following multi-step approach has a good
chance to improve performance:
<ul>
<li>Check that multithreading is enabled (it is, by default
with recent Recoll versions).</li>
<li>Increase the flush threshold until the machine begins to
have memory issues. Maybe add memory.</li>
<li>Store the index on an SSD. If possible, also store the
data on an SSD. Actually, when using many threads, it is
probably almost more important to have the data on an
SSD.</li>
<li>If you have many files which will need temporary copies
(email attachments, archive members, compressed files): use
a memory temporary directory. Add memory.</li>
<li>More CPUs...</li>
</ul>
</p>
<p>At some point, the index updating and writing may become the
bottleneck (this depends on the data mix, very quickly with
HTML or text files). As far as I can think, the only possible
approach is then to partition the index. You can query the
multiple Xapian indices either by using the Recoll external
index capability, or by actually merging the results with
xapian-compact.</p>
<h5>Old benchmarks</h5>
<p>To provide a point of comparison for the evolution of
hardware and software...</p>
<p>The following very old data was obtained (around 2007?) on a
machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
executed with the default flush threshold value.
The process memory usage is the one given by <b>ps</b></p>
<table border=1>
<thead>
<tr>
<th>Data</th>
<th>Data size</th>
<th>Indexing time</th>
<th>Index size</th>
<th>Peak process memory usage</th>
</tr>
<tbody>
<tr>
<td>Random pdfs harvested on Google</td>
<td>1.7 GB, 3564 files</td>
<td>27 mn</td>
<td>230 MB</td>
<td>225 MB</td>
</tr>
<tr>
<td>Ietf mailing list archive</td>
<td>211 MB, 44,000 messages</td>
<td>8 mn</td>
<td>350 MB</td>
<td>90 MB</td>
</tr>
<tr>
<td>Partial Wikipedia dump</td>
<td>15 GB, one million files</td>
<td>6H30</td>
<td>10 GB</td>
<td>324 MB</td>
</tr>
<tr>
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
<td>Random pdfs harvested on Google<br>
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
<td>1.7 GB, 3564 files</td>
<td>25 mn</td>
<td>262 MB</td>
<td>65 MB</td>
</tr>
</tbody>
</table>
<p>Notice how the index size for the mail archive is bigger than
the data size. Myriads of small pure text documents will do
this. The factor of expansion would be even much worse with
compressed folders of course (the test was on uncompressed
data).</p>
<p>The last test was performed with Recoll 1.9.0 which has an
ajustable flush threshold (<em>idxflushmb</em> parameter), here
set to 10 MB. Notice the much lower peak memory usage, with no
performance degradation. The resulting index is bigger though,
the exact reason is not known to me, possibly because of
additional fragmentation </p>
</div>
</body>
</html>