|
a/website/perfs.html |
|
b/website/perfs.html |
|
... |
|
... |
41 |
<p>We try here to give a number of reference points which can
|
41 |
<p>We try here to give a number of reference points which can
|
42 |
be used to roughly estimate the resources needed to create and
|
42 |
be used to roughly estimate the resources needed to create and
|
43 |
store an index. Obviously, your data set will never fit one of
|
43 |
store an index. Obviously, your data set will never fit one of
|
44 |
the samples, so the results cannot be exactly predicted.</p>
|
44 |
the samples, so the results cannot be exactly predicted.</p>
|
45 |
|
45 |
|
46 |
<p>The following data was obtained on a machine with a 1800 Mhz
|
46 |
<p>The following very old data was obtained on a machine with a
|
|
|
47 |
1800 Mhz
|
47 |
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
48 |
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
48 |
disk, running Suse 10.1.</p>
|
49 |
disk, running Suse 10.1. More recent data follows.</p>
|
49 |
|
50 |
|
50 |
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
51 |
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
51 |
executed with the default flush threshold value.
|
52 |
executed with the default flush threshold value.
|
52 |
The process memory usage is the one given by <b>ps</b></p>
|
53 |
The process memory usage is the one given by <b>ps</b></p>
|
53 |
|
54 |
|
|
... |
|
... |
104 |
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
105 |
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
105 |
set to 10 MB. Notice the much lower peak memory usage, with no
|
106 |
set to 10 MB. Notice the much lower peak memory usage, with no
|
106 |
performance degradation. The resulting index is bigger though,
|
107 |
performance degradation. The resulting index is bigger though,
|
107 |
the exact reason is not known to me, possibly because of
|
108 |
the exact reason is not known to me, possibly because of
|
108 |
additional fragmentation </p>
|
109 |
additional fragmentation </p>
|
|
|
110 |
|
|
|
111 |
<p>There is more recent performance data (2012) at the end of
|
|
|
112 |
the <a href="idxthreads/threadingRecoll.html">article about
|
|
|
113 |
converting Recoll indexing to multithreading</a></p>
|
|
|
114 |
|
|
|
115 |
<p>Update, March 2016: I took another sample of PDF performance
|
|
|
116 |
data on a more modern machine, with Recoll multithreading turned
|
|
|
117 |
on. The machine has an Intel Core I7-4770T Cpu, which has 4
|
|
|
118 |
physical cores, and supports hyper-threading for a total of 8
|
|
|
119 |
threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
|
|
|
120 |
fanless, this is not a "beast" computer).</p>
|
|
|
121 |
|
|
|
122 |
<table border=1>
|
|
|
123 |
<thead>
|
|
|
124 |
<tr>
|
|
|
125 |
<th>Data</th>
|
|
|
126 |
<th>Data size</th>
|
|
|
127 |
<th>Indexing time</th>
|
|
|
128 |
<th>Index size</th>
|
|
|
129 |
<th>Peak process memory usage</th>
|
|
|
130 |
</tr>
|
|
|
131 |
<tbody>
|
|
|
132 |
<tr>
|
|
|
133 |
<td>Random pdfs harvested on Google<br>
|
|
|
134 |
Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
|
|
|
135 |
parameters 6/4/1</td>
|
|
|
136 |
<td>11 GB, 5320 files</td>
|
|
|
137 |
<td>3 mn 15 S</td>
|
|
|
138 |
<td>400 MB</td>
|
|
|
139 |
<td>545 MB</td>
|
|
|
140 |
</tr>
|
|
|
141 |
</tbody>
|
|
|
142 |
</table>
|
|
|
143 |
|
|
|
144 |
<p>The indexing process used 21 mn of CPU during these 3mn15 of
|
|
|
145 |
real time, we are not letting these cores stay idle
|
|
|
146 |
much... The improvement compared to the numbers above is quite
|
|
|
147 |
spectacular (a factor of 11, approximately), mostly due to the
|
|
|
148 |
multiprocessing, but also to the faster CPU and the SSD
|
|
|
149 |
storage. Note that the peak memory value is for the
|
|
|
150 |
recollindex process, and does not take into account the
|
|
|
151 |
multiple Python and pdftotext instances (which are relatively
|
|
|
152 |
small but things add up...).</p>
|
|
|
153 |
|
|
|
154 |
<h5>Improving indexing performance with hardware:</h5>
|
|
|
155 |
<p>I think
|
|
|
156 |
that the following multi-step approach has a good chance to
|
|
|
157 |
improve performance:
|
|
|
158 |
<ul>
|
|
|
159 |
<li>Check that multithreading is enabled (it is, by default
|
|
|
160 |
with recent Recoll versions).</li>
|
|
|
161 |
<li>Increase the flush threshold until the machine begins to
|
|
|
162 |
have memory issues. Maybe add memory.</li>
|
|
|
163 |
<li>Store the index on an SSD. If possible, also store the
|
|
|
164 |
data on an SSD. Actually, when using many threads, it is
|
|
|
165 |
probably almost more important to have the data on an
|
|
|
166 |
SSD.</li>
|
|
|
167 |
<li>If you have many files which will need temporary copies
|
|
|
168 |
(email attachments, archive members, compressed files): use
|
|
|
169 |
a memory temporary directory. Add memory.</li>
|
|
|
170 |
<li>More CPUs...</li>
|
|
|
171 |
</ul>
|
109 |
</p>
|
172 |
</p>
|
110 |
|
173 |
|
|
|
174 |
<p>At some point, the index writing may become the
|
|
|
175 |
bottleneck. As far as I can think, the only possible approach
|
|
|
176 |
then is to partition the index.</p>
|
|
|
177 |
|
111 |
</div>
|
178 |
</div>
|
112 |
</body>
|
179 |
</body>
|
113 |
</html>
|
180 |
</html>
|
114 |
|
181 |
|