Switch to unified view

a/website/perfs.html b/website/perfs.html
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
2
3
<html>
3
<html>
4
  <head>
4
  <head>
5
    <title>RECOLL: a personal text search system for
5
    <title>RECOLL indexing performance and index sizes</title>
6
    Unix/Linux</title>
7
    <meta name="generator" content="HTML Tidy, see www.w3.org">
6
    <meta name="generator" content="HTML Tidy, see www.w3.org">
8
    <meta name="Author" content="Jean-Francois Dockes">
7
    <meta name="Author" content="Jean-Francois Dockes">
9
    <meta name="Description" content=
8
    <meta name="Description" content=
10
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
9
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
11
    <meta name="Keywords" content=
10
    <meta name="Keywords" content=
...
...
31
    <div class="content">
30
    <div class="content">
32
31
33
      <h1>Recoll: Indexing performance and index sizes</h1>
32
      <h1>Recoll: Indexing performance and index sizes</h1>
34
33
35
      <p>The time needed to index a given set of documents, and the
34
      <p>The time needed to index a given set of documents, and the
36
    resulting index size depend of many factors, such as file size
35
    resulting index size depend of many factors.
37
  and proportion of actual text content for the index size, cpu
38
  speed, available memory, average file size and format for the
39
  speed of indexing.</p>
40
36
41
      <p>We try here to give a number of reference points which can
37
      <p>The index size depends almost only on the size of the
42
  be used to roughly estimate the resources needed to create and
38
        uncompressed input text, and you can expect it to be roughly
43
  store an index. Obviously, your data set will never fit one of
39
        of the same order of magnitude. Depending on the type of file,
44
  the samples, so the results cannot be exactly predicted.</p>
40
        the proportion of text to file size varies very widely, going
41
        from close to 1 for pure text files to a very small factor
42
        for, e.g., metadata tags in mp3 files.</p>
45
43
46
      <p>The following very old data was obtained on a machine with a
44
      <p>Estimating indexing time is a much more complicated issue,
47
        1800 Mhz
45
        depending on the type and size of input and on system
48
  AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
46
        performance. There is no general way to determine what part of
49
  disk, running Suse 10.1. More recent data follows.</p>
47
        the hardware should be optimized. Depending on the type of
48
        input, performance may be bound by I/O read or write
49
        performance, CPU single-processing speed, or combined
50
        multi-processing speed.</p>
50
51
51
      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
52
      <p>It should be noted that Recoll performance will not be an
52
  executed with the default flush threshold value. 
53
        issue for most people. The indexer can process 1000 typical
53
  The process memory usage is the one given by <b>ps</b></p>
54
        PDF files per minute, or 500 Wikipedia HTML pages per second
55
        on medium-range hardware, meaning that the initial indexing of
56
        a typical dataset will need a few dozen minutes at
57
        most. Further incremental index updates will be much faster
58
        because most files will not need to be processed again.</p>
59
60
      <p>However, there are Recoll installations with
61
        terabyte-sized datasets, on which indexing can take days. For
62
        such operations (or even much smaller ones), it is very
63
        important to know what kind of performance can be expected,
64
        and what aspects of the hardware should be optimized.</p>
65
66
      <p>In order to provide some reference points, I have run a
67
        number of benchs on medium-sized datasets, using typical
68
        mid-range desktop hardware, and varying the indexing
69
        configuration parameters to show how they affect the results.</p>
70
71
      <p>The following may help you check that you are getting typical
72
        performance for your indexing, and give some indications about
73
        what to adjust to improve it.</p>
74
        
75
      <p>From time to time, I receive a report about a system becoming
76
        unusable during indexing. As far as I know, with the default
77
        Recoll configuration, and barring an exceptional issue (bug),
78
        this is always due to a system problem (typically bad hardware
79
        such as a disk doing retries). The tests below were mostly run
80
        while I was using the desktop, which never became
81
        unusable. However, some tests rendered it less responsive and
82
        this is noted with the results.</p>
83
84
      <p>The following text refers to the indexing parameters without
85
        further explanation. Here follow links to more explanation about the
86
        <a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
87
        model</a> and
88
        <a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
89
          parameters</a>.</p>
90
      
91
92
      <p>All text were run without generating the stemming database or
93
        aspell dictionary. These phases are relatively short and there
94
        is nothing which can be optimized about them.</p>
95
      
96
      <h2>Hardware</h2>
97
98
      <p>The tests were run on what could be considered a mid-range
99
        desktop PC:
100
        <ul>
101
          <li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
102
            hyper-threading for a total of 8 hardware threads</li>
103
          <li>8 GBytes of RAM</li>
104
          <li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
105
        </ul>
106
      </p>
107
108
      <p>This is usually a fanless PC, but I did run a fan on the
109
        external case fins during some of the tests (esp. PDF
110
        indexing), because the CPU was running a bit too hot.</p>
111
112
113
      <h2>Indexing PDF files</h2>
114
      
115
116
      <p>The tests were run on 18000 random PDFs harvested on
117
        Google, with a total size of around 30 GB, using Recoll 1.22.3
118
        and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
119
120
      <h3>PDF: storage</h3>
121
122
      <p>Typical PDF files have a low text to file size ratio, and a
123
        lot of data needs to be read for indexing. With the test
124
        configuration, the indexer needs to read around 45 MBytes / S
125
        from multiple files. This means that input storage makes a
126
        difference and that you need an SSD or a fast array for
127
        optimal performance.</p>
54
128
55
      <table border=1>
129
      <table border=1>
56
    <thead>
130
    <thead>
57
      <tr>
131
      <tr>
58
        <th>Data</th>
132
        <th>Storage</th>
133
      <th>idxflushmb</th>
134
      <th>thrTCounts</th>
59
        <th>Data size</th>
135
        <th>Real Time</th>
60
      <th>Indexing time</th>
61
      <th>Index size</th>
62
      <th>Peak process memory usage</th>
63
      </tr>
136
      </tr>
64
    <tbody>
137
    <tbody>
65
      <tr>
138
      <tr>
66
      <td>Random pdfs harvested on Google</td>
139
      <td>NFS drive (gigabit)</td>
67
      <td>1.7 GB, 3564 files</td>
68
      <td>27 mn</td>
69
        <td>230 MB</td>
140
        <td>200</td>
141
      <td>6/4/1</td>
70
        <td>225 MB</td>
142
        <td>24m40</td>
71
    </tr>
72
      <tr>
143
      </tr>
73
      <td>Ietf mailing list archive</td>
74
      <td>211 MB, 44,000 messages</td>
75
      <td>8 mn</td>
76
      <td>350 MB</td>
77
      <td>90 MB</td>
78
      </tr>
144
      <tr>
79
    <tr>
145
      <td>local SSD</td>
80
      <td>Partial Wikipedia dump</td>
81
      <td>15 GB, one million files</td>
82
      <td>6H30</td>
83
      <td>10 GB</td>
84
      <td>324 MB</td>
85
    </tr>
86
    <tr>
87
      <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
88
      <td>Random pdfs harvested on Google<br>
89
      Recoll 1.9, <em>idxflushmb</em> set to 10</td>
90
      <td>1.7 GB, 3564 files</td>
91
      <td>25 mn</td>
92
      <td>262 MB</td>
93
      <td>65 MB</td>
94
    </tr>
95
  </tbody>
96
      </table>
97
98
      <p>Notice how the index size for the mail archive is bigger than
99
  the data size. Myriads of small pure text documents will do
100
  this. The factor of expansion would be even much worse with
101
  compressed folders of course (the test was on uncompressed
102
  data).</p>
103
104
      <p>The last test was performed with Recoll 1.9.0 which has an
105
  ajustable flush threshold (<em>idxflushmb</em> parameter), here
106
  set to 10 MB. Notice the much lower peak memory usage, with no
107
  performance degradation. The resulting index is bigger though,
108
  the exact reason is not known to me, possibly because of
109
  additional fragmentation </p>
110
111
      <p>There is more recent performance data (2012) at the end of
112
        the <a href="idxthreads/threadingRecoll.html">article about
113
          converting Recoll indexing to multithreading</a></p>
114
115
      <p>Update, March 2016: I took another sample of PDF performance
116
        data on a more modern machine, with Recoll multithreading turned
117
        on. The machine has an Intel Core I7-4770T Cpu, which has 4
118
        physical cores, and supports hyper-threading for a total of 8
119
        threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
120
        fanless, this is not a "beast" computer).</p>
121
        
122
      <table border=1>
123
  <thead>
124
    <tr>
125
      <th>Data</th>
126
      <th>Data size</th>
127
      <th>Indexing time</th>
128
      <th>Index size</th>
129
      <th>Peak process memory usage</th>
130
    </tr>
131
  <tbody>
132
    <tr>
133
      <td>Random pdfs harvested on Google<br>
134
      Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
135
      parameters 6/4/1</td>
136
      <td>11 GB, 5320 files</td>
137
      <td>3 mn 15 S</td>
138
        <td>400 MB</td>
146
        <td>200</td>
139
        <td>545 MB</td>
147
        <td>6/4/1</td>
148
      <td>11m40</td>
140
      </tr>
149
      </tr>
141
    </tbody>
150
    </tbody>
142
      </table>
151
      </table>
143
        
152
        
144
      <p>The indexing process used 21 mn of CPU during these 3mn15 of
153
145
        real time, we are not letting these cores stay idle
154
      <h3>PDF: threading</h3>
146
        much... The improvement compared to the numbers above is quite
155
147
        spectacular (a factor of 11, approximately), mostly due to the
156
      <p>Because PDF files are bulky and complicated to process, the
148
        multiprocessing, but also to the faster CPU and the SSD
157
        dominant step for indexing them is input processing. PDF text
149
        storage. Note that the peak memory value is for the
158
        extraction is performed by multiple instances
150
        recollindex process, and does not take into account the
159
        the <i>pdftotext</i> program, and parallelisation works very
151
        multiple Python and pdftotext instances (which are relatively
160
        well.</p>
152
        small but things add up...).</p>
161
153
      
162
      <p>The following table shows the indexing times with a variety
154
      <h5>Improving indexing performance with hardware:</h5>
163
        of threading parameters.</p>
155
      <p>I think
164
165
      <table border=1>
166
  <thead>
167
    <tr>
168
      <th>idxflushmb</th>
169
      <th>thrQSizes</th>
170
      <th>thrTCounts</th>
171
      <th>Time R/U/S</th>
172
    </tr>
173
          <tbody>
174
    <tr>
175
      <td>200</td>
176
      <td>2/2/2</td>
177
      <td>2/1/1</td>
178
      <td>19m21</td>
179
    </tr>
180
    <tr>
181
      <td>200</td>
182
      <td>2/2/2</td>
183
      <td>10/10/1</td>
184
      <td>10m38</td>
185
    </tr>
186
    <tr>
187
      <td>200</td>
188
      <td>2/2/2</td>
189
      <td>100/10/1</td>
190
      <td>11m</td>
191
    </tr>
192
          </tbody>
193
      </table>
194
195
      <p>10/10/1 was the best value for thrTCounts for this test. The
196
        total CPU time was around 78 mn.</p>
197
198
      <p>The last line shows the effect of a ridiculously high thread
199
        count value for the input step, which is not much. Using
200
        sligthly lower values than the optimum has not much impact
201
        either. The only thing which really degrades performance is
202
        configuring less threads than available from the hardware.</p>
203
204
      <p>With the optimal parameters above, the peak recollindex
205
        resident memory size is around 930 MB, to which we should add
206
        ten instances of pdftotext (10MB typical), and of the
207
        rclpdf.py Python input handler (around 15 MB each). This means
208
        that the total resident memory used by indexing is around 1200
209
        MB, quite a modest value in 2016.</p>
210
211
212
      <h3>PDF: Xapian flushes</h3>
213
214
      <p>idxflushmb has practically no influence on the indexing time
215
        (tested from 40 to 1000), which is not too surprising because
216
        the Xapian index size is very small relatively to the input
217
        size, so that the cost of Xapian flushes to disk is not very
218
        significant. The value of 200 used for the threading tests
219
        could be lowered in practise, which would decrease memory
220
        usage and not change the indexing time significantly.</p>
221
222
      <h3>PDF: conclusion</h3>
223
224
      <p>For indexing PDF files, you need many cores and a fast
225
        input storage system. Neither single-thread performance nor
226
        amount of memory will be critical aspects.</p>
227
228
      <p>Running the PDF indexing tests had no influence on the system
229
        "feel", I could work on it just as if it were quiescent.</p>
230
231
232
      <h2>Indexing HTML files</h2>
233
234
      <p>The tests were run on an (old) French Wikipedia dump: 2.9
235
        million HTML files stored in 42000 directories, for an
236
        approximate total size of 41 GB (average file size
237
        14 KB).
238
239
        <p>The files are stored on a local SSD. Just reading them with
240
          find+cpio takes close to 8 mn.</p>
241
242
        <p>The resulting index has a size of around 30 GB.</p>
243
244
        <p>I was too lazy to extract 3 million entries tar file on a
245
          spinning disk, so all tests were performed with the data
246
          stored on a local SSD.</p>
247
248
        <p>For this test, the indexing time is dominated by the Xapian
249
          index updates. As these are single threaded, only the flush
250
          interval has a real influence.</p>
251
252
      <table border=1>
253
  <thead>
254
    <tr>
255
      <th>idxflushmb</th>
256
      <th>thrQSizes</th>
257
      <th>thrTCounts</th>
258
      <th>Time R/U/S</th>
259
    </tr>
260
          <tbody>
261
    <tr>
262
      <td>200</td>
263
      <td>2/2/2</td>
264
      <td>2/1/1</td>
265
      <td>88m</td>
266
    </tr>
267
    <tr>
268
      <td>200</td>
269
      <td>2/2/2</td>
270
      <td>6/4/1</td>
271
      <td>91m</td>
272
    </tr>
273
    <tr>
274
      <td>200</td>
275
      <td>2/2/2</td>
276
      <td>1/1/1</td>
277
      <td>96m</td>
278
    </tr>
279
    <tr>
280
      <td>100</td>
281
      <td>2/2/2</td>
282
      <td>1/2/1</td>
283
      <td>120m</td>
284
    </tr>
285
    <tr>
286
      <td>100</td>
287
      <td>2/2/2</td>
288
      <td>6/4/1</td>
289
      <td>121m</td>
290
    </tr>
291
    <tr>
292
      <td>40</td>
293
      <td>2/2/2</td>
294
      <td>1/2/1</td>
295
      <td>173m</td>
296
    </tr>
297
          </tbody>
298
      </table>
299
300
301
      <p>The indexing process becomes quite big (resident size around
302
        4GB), and the combination of high I/O load and high memory
303
        usage makes the system less responsive at times (but not
304
        unusable). As this happens principally when switching
305
        applications, my guess would be that some program pages
306
        (e.g. from the window manager and X) get flushed out, and take
307
        time being read in, during which time the display appears
308
        frozen.</p>
309
310
      <p>For this kind of data, single-threaded CPU performance and
311
        storage write speed can make a difference. Multithreading does
312
        not help.</p>
313
314
      <h2>Adjusting hardware to improve indexing performance</h2>
315
156
      that the following multi-step approach has a good chance to
316
      <p>I think that the following multi-step approach has a good
157
        improve performance:
317
        chance to improve performance:
158
        <ul>
318
        <ul>
159
          <li>Check that multithreading is enabled (it is, by default
319
          <li>Check that multithreading is enabled (it is, by default
160
            with recent Recoll versions).</li>
320
            with recent Recoll versions).</li>
161
          <li>Increase the flush threshold until the machine begins to
321
          <li>Increase the flush threshold until the machine begins to
162
            have memory issues. Maybe add memory.</li>
322
            have memory issues. Maybe add memory.</li>
...
...
169
            a memory temporary directory. Add memory.</li>
329
            a memory temporary directory. Add memory.</li>
170
          <li>More CPUs...</li>
330
          <li>More CPUs...</li>
171
        </ul>
331
        </ul>
172
      </p>
332
      </p>
173
333
174
      <p>At some point, the index writing may become the
334
      <p>At some point, the index updating and writing may become the
335
        bottleneck (this depends on the data mix, very quickly with
175
        bottleneck. As far as I can think, the only possible approach
336
        HTML or text files). As far as I can think, the only possible
176
        then is to partition the index.</p>
337
        approach is then to partition the index. You can query the
338
        multiple Xapian indices either by using the Recoll external
339
        index capability, or by actually merging the results with
340
        xapian-compact.</p>
341
342
343
344
      <h5>Old benchmarks</h5>
345
346
      <p>To provide a point of comparison for the evolution of
347
        hardware and software...</p>
177
      
348
      
349
      <p>The following very old data was obtained (around 2007?) on a
350
        machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
351
        7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
352
353
      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
354
  executed with the default flush threshold value. 
355
  The process memory usage is the one given by <b>ps</b></p>
356
357
      <table border=1>
358
  <thead>
359
    <tr>
360
      <th>Data</th>
361
      <th>Data size</th>
362
      <th>Indexing time</th>
363
      <th>Index size</th>
364
      <th>Peak process memory usage</th>
365
    </tr>
366
  <tbody>
367
    <tr>
368
      <td>Random pdfs harvested on Google</td>
369
      <td>1.7 GB, 3564 files</td>
370
      <td>27 mn</td>
371
      <td>230 MB</td>
372
      <td>225 MB</td>
373
    </tr>
374
    <tr>
375
      <td>Ietf mailing list archive</td>
376
      <td>211 MB, 44,000 messages</td>
377
      <td>8 mn</td>
378
      <td>350 MB</td>
379
      <td>90 MB</td>
380
    </tr>
381
    <tr>
382
      <td>Partial Wikipedia dump</td>
383
      <td>15 GB, one million files</td>
384
      <td>6H30</td>
385
      <td>10 GB</td>
386
      <td>324 MB</td>
387
    </tr>
388
    <tr>
389
      <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
390
      <td>Random pdfs harvested on Google<br>
391
      Recoll 1.9, <em>idxflushmb</em> set to 10</td>
392
      <td>1.7 GB, 3564 files</td>
393
      <td>25 mn</td>
394
      <td>262 MB</td>
395
      <td>65 MB</td>
396
    </tr>
397
  </tbody>
398
      </table>
399
400
      <p>Notice how the index size for the mail archive is bigger than
401
  the data size. Myriads of small pure text documents will do
402
  this. The factor of expansion would be even much worse with
403
  compressed folders of course (the test was on uncompressed
404
  data).</p>
405
406
      <p>The last test was performed with Recoll 1.9.0 which has an
407
  ajustable flush threshold (<em>idxflushmb</em> parameter), here
408
  set to 10 MB. Notice the much lower peak memory usage, with no
409
  performance degradation. The resulting index is bigger though,
410
  the exact reason is not known to me, possibly because of
411
  additional fragmentation </p>
412
178
    </div>
413
    </div>
179
  </body>
414
  </body>
180
</html>
415
</html>
181
416