|
a/website/perfs.html |
|
b/website/perfs.html |
1 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
1 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
2 |
|
2 |
|
3 |
<html>
|
3 |
<html>
|
4 |
<head>
|
4 |
<head>
|
5 |
<title>RECOLL: a personal text search system for
|
5 |
<title>RECOLL indexing performance and index sizes</title>
|
6 |
Unix/Linux</title>
|
|
|
7 |
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
6 |
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
8 |
<meta name="Author" content="Jean-Francois Dockes">
|
7 |
<meta name="Author" content="Jean-Francois Dockes">
|
9 |
<meta name="Description" content=
|
8 |
<meta name="Description" content=
|
10 |
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
9 |
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
11 |
<meta name="Keywords" content=
|
10 |
<meta name="Keywords" content=
|
|
... |
|
... |
31 |
<div class="content">
|
30 |
<div class="content">
|
32 |
|
31 |
|
33 |
<h1>Recoll: Indexing performance and index sizes</h1>
|
32 |
<h1>Recoll: Indexing performance and index sizes</h1>
|
34 |
|
33 |
|
35 |
<p>The time needed to index a given set of documents, and the
|
34 |
<p>The time needed to index a given set of documents, and the
|
36 |
resulting index size depend of many factors, such as file size
|
35 |
resulting index size depend of many factors.
|
37 |
and proportion of actual text content for the index size, cpu
|
|
|
38 |
speed, available memory, average file size and format for the
|
|
|
39 |
speed of indexing.</p>
|
|
|
40 |
|
36 |
|
41 |
<p>We try here to give a number of reference points which can
|
37 |
<p>The index size depends almost only on the size of the
|
42 |
be used to roughly estimate the resources needed to create and
|
38 |
uncompressed input text, and you can expect it to be roughly
|
43 |
store an index. Obviously, your data set will never fit one of
|
39 |
of the same order of magnitude. Depending on the type of file,
|
44 |
the samples, so the results cannot be exactly predicted.</p>
|
40 |
the proportion of text to file size varies very widely, going
|
|
|
41 |
from close to 1 for pure text files to a very small factor
|
|
|
42 |
for, e.g., metadata tags in mp3 files.</p>
|
45 |
|
43 |
|
46 |
<p>The following very old data was obtained on a machine with a
|
44 |
<p>Estimating indexing time is a much more complicated issue,
|
47 |
1800 Mhz
|
45 |
depending on the type and size of input and on system
|
48 |
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
46 |
performance. There is no general way to determine what part of
|
49 |
disk, running Suse 10.1. More recent data follows.</p>
|
47 |
the hardware should be optimized. Depending on the type of
|
|
|
48 |
input, performance may be bound by I/O read or write
|
|
|
49 |
performance, CPU single-processing speed, or combined
|
|
|
50 |
multi-processing speed.</p>
|
50 |
|
51 |
|
51 |
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
52 |
<p>It should be noted that Recoll performance will not be an
|
52 |
executed with the default flush threshold value.
|
53 |
issue for most people. The indexer can process 1000 typical
|
53 |
The process memory usage is the one given by <b>ps</b></p>
|
54 |
PDF files per minute, or 500 Wikipedia HTML pages per second
|
|
|
55 |
on medium-range hardware, meaning that the initial indexing of
|
|
|
56 |
a typical dataset will need a few dozen minutes at
|
|
|
57 |
most. Further incremental index updates will be much faster
|
|
|
58 |
because most files will not need to be processed again.</p>
|
|
|
59 |
|
|
|
60 |
<p>However, there are Recoll installations with
|
|
|
61 |
terabyte-sized datasets, on which indexing can take days. For
|
|
|
62 |
such operations (or even much smaller ones), it is very
|
|
|
63 |
important to know what kind of performance can be expected,
|
|
|
64 |
and what aspects of the hardware should be optimized.</p>
|
|
|
65 |
|
|
|
66 |
<p>In order to provide some reference points, I have run a
|
|
|
67 |
number of benchs on medium-sized datasets, using typical
|
|
|
68 |
mid-range desktop hardware, and varying the indexing
|
|
|
69 |
configuration parameters to show how they affect the results.</p>
|
|
|
70 |
|
|
|
71 |
<p>The following may help you check that you are getting typical
|
|
|
72 |
performance for your indexing, and give some indications about
|
|
|
73 |
what to adjust to improve it.</p>
|
|
|
74 |
|
|
|
75 |
<p>From time to time, I receive a report about a system becoming
|
|
|
76 |
unusable during indexing. As far as I know, with the default
|
|
|
77 |
Recoll configuration, and barring an exceptional issue (bug),
|
|
|
78 |
this is always due to a system problem (typically bad hardware
|
|
|
79 |
such as a disk doing retries). The tests below were mostly run
|
|
|
80 |
while I was using the desktop, which never became
|
|
|
81 |
unusable. However, some tests rendered it less responsive and
|
|
|
82 |
this is noted with the results.</p>
|
|
|
83 |
|
|
|
84 |
<p>The following text refers to the indexing parameters without
|
|
|
85 |
further explanation. Here follow links to more explanation about the
|
|
|
86 |
<a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
|
|
|
87 |
model</a> and
|
|
|
88 |
<a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
|
|
|
89 |
parameters</a>.</p>
|
|
|
90 |
|
|
|
91 |
|
|
|
92 |
<p>All text were run without generating the stemming database or
|
|
|
93 |
aspell dictionary. These phases are relatively short and there
|
|
|
94 |
is nothing which can be optimized about them.</p>
|
|
|
95 |
|
|
|
96 |
<h2>Hardware</h2>
|
|
|
97 |
|
|
|
98 |
<p>The tests were run on what could be considered a mid-range
|
|
|
99 |
desktop PC:
|
|
|
100 |
<ul>
|
|
|
101 |
<li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
|
|
|
102 |
hyper-threading for a total of 8 hardware threads</li>
|
|
|
103 |
<li>8 GBytes of RAM</li>
|
|
|
104 |
<li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
|
|
|
105 |
</ul>
|
|
|
106 |
</p>
|
|
|
107 |
|
|
|
108 |
<p>This is usually a fanless PC, but I did run a fan on the
|
|
|
109 |
external case fins during some of the tests (esp. PDF
|
|
|
110 |
indexing), because the CPU was running a bit too hot.</p>
|
|
|
111 |
|
|
|
112 |
|
|
|
113 |
<h2>Indexing PDF files</h2>
|
|
|
114 |
|
|
|
115 |
|
|
|
116 |
<p>The tests were run on 18000 random PDFs harvested on
|
|
|
117 |
Google, with a total size of around 30 GB, using Recoll 1.22.3
|
|
|
118 |
and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
|
|
|
119 |
|
|
|
120 |
<h3>PDF: storage</h3>
|
|
|
121 |
|
|
|
122 |
<p>Typical PDF files have a low text to file size ratio, and a
|
|
|
123 |
lot of data needs to be read for indexing. With the test
|
|
|
124 |
configuration, the indexer needs to read around 45 MBytes / S
|
|
|
125 |
from multiple files. This means that input storage makes a
|
|
|
126 |
difference and that you need an SSD or a fast array for
|
|
|
127 |
optimal performance.</p>
|
54 |
|
128 |
|
55 |
<table border=1>
|
129 |
<table border=1>
|
56 |
<thead>
|
130 |
<thead>
|
57 |
<tr>
|
131 |
<tr>
|
58 |
<th>Data</th>
|
132 |
<th>Storage</th>
|
|
|
133 |
<th>idxflushmb</th>
|
|
|
134 |
<th>thrTCounts</th>
|
59 |
<th>Data size</th>
|
135 |
<th>Real Time</th>
|
60 |
<th>Indexing time</th>
|
|
|
61 |
<th>Index size</th>
|
|
|
62 |
<th>Peak process memory usage</th>
|
|
|
63 |
</tr>
|
136 |
</tr>
|
64 |
<tbody>
|
137 |
<tbody>
|
65 |
<tr>
|
138 |
<tr>
|
66 |
<td>Random pdfs harvested on Google</td>
|
139 |
<td>NFS drive (gigabit)</td>
|
67 |
<td>1.7 GB, 3564 files</td>
|
|
|
68 |
<td>27 mn</td>
|
|
|
69 |
<td>230 MB</td>
|
140 |
<td>200</td>
|
|
|
141 |
<td>6/4/1</td>
|
70 |
<td>225 MB</td>
|
142 |
<td>24m40</td>
|
71 |
</tr>
|
|
|
72 |
<tr>
|
143 |
</tr>
|
73 |
<td>Ietf mailing list archive</td>
|
|
|
74 |
<td>211 MB, 44,000 messages</td>
|
|
|
75 |
<td>8 mn</td>
|
|
|
76 |
<td>350 MB</td>
|
|
|
77 |
<td>90 MB</td>
|
|
|
78 |
</tr>
|
144 |
<tr>
|
79 |
<tr>
|
145 |
<td>local SSD</td>
|
80 |
<td>Partial Wikipedia dump</td>
|
|
|
81 |
<td>15 GB, one million files</td>
|
|
|
82 |
<td>6H30</td>
|
|
|
83 |
<td>10 GB</td>
|
|
|
84 |
<td>324 MB</td>
|
|
|
85 |
</tr>
|
|
|
86 |
<tr>
|
|
|
87 |
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
|
|
|
88 |
<td>Random pdfs harvested on Google<br>
|
|
|
89 |
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
|
|
|
90 |
<td>1.7 GB, 3564 files</td>
|
|
|
91 |
<td>25 mn</td>
|
|
|
92 |
<td>262 MB</td>
|
|
|
93 |
<td>65 MB</td>
|
|
|
94 |
</tr>
|
|
|
95 |
</tbody>
|
|
|
96 |
</table>
|
|
|
97 |
|
|
|
98 |
<p>Notice how the index size for the mail archive is bigger than
|
|
|
99 |
the data size. Myriads of small pure text documents will do
|
|
|
100 |
this. The factor of expansion would be even much worse with
|
|
|
101 |
compressed folders of course (the test was on uncompressed
|
|
|
102 |
data).</p>
|
|
|
103 |
|
|
|
104 |
<p>The last test was performed with Recoll 1.9.0 which has an
|
|
|
105 |
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
|
|
106 |
set to 10 MB. Notice the much lower peak memory usage, with no
|
|
|
107 |
performance degradation. The resulting index is bigger though,
|
|
|
108 |
the exact reason is not known to me, possibly because of
|
|
|
109 |
additional fragmentation </p>
|
|
|
110 |
|
|
|
111 |
<p>There is more recent performance data (2012) at the end of
|
|
|
112 |
the <a href="idxthreads/threadingRecoll.html">article about
|
|
|
113 |
converting Recoll indexing to multithreading</a></p>
|
|
|
114 |
|
|
|
115 |
<p>Update, March 2016: I took another sample of PDF performance
|
|
|
116 |
data on a more modern machine, with Recoll multithreading turned
|
|
|
117 |
on. The machine has an Intel Core I7-4770T Cpu, which has 4
|
|
|
118 |
physical cores, and supports hyper-threading for a total of 8
|
|
|
119 |
threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
|
|
|
120 |
fanless, this is not a "beast" computer).</p>
|
|
|
121 |
|
|
|
122 |
<table border=1>
|
|
|
123 |
<thead>
|
|
|
124 |
<tr>
|
|
|
125 |
<th>Data</th>
|
|
|
126 |
<th>Data size</th>
|
|
|
127 |
<th>Indexing time</th>
|
|
|
128 |
<th>Index size</th>
|
|
|
129 |
<th>Peak process memory usage</th>
|
|
|
130 |
</tr>
|
|
|
131 |
<tbody>
|
|
|
132 |
<tr>
|
|
|
133 |
<td>Random pdfs harvested on Google<br>
|
|
|
134 |
Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
|
|
|
135 |
parameters 6/4/1</td>
|
|
|
136 |
<td>11 GB, 5320 files</td>
|
|
|
137 |
<td>3 mn 15 S</td>
|
|
|
138 |
<td>400 MB</td>
|
146 |
<td>200</td>
|
139 |
<td>545 MB</td>
|
147 |
<td>6/4/1</td>
|
|
|
148 |
<td>11m40</td>
|
140 |
</tr>
|
149 |
</tr>
|
141 |
</tbody>
|
150 |
</tbody>
|
142 |
</table>
|
151 |
</table>
|
143 |
|
152 |
|
144 |
<p>The indexing process used 21 mn of CPU during these 3mn15 of
|
153 |
|
145 |
real time, we are not letting these cores stay idle
|
154 |
<h3>PDF: threading</h3>
|
146 |
much... The improvement compared to the numbers above is quite
|
155 |
|
147 |
spectacular (a factor of 11, approximately), mostly due to the
|
156 |
<p>Because PDF files are bulky and complicated to process, the
|
148 |
multiprocessing, but also to the faster CPU and the SSD
|
157 |
dominant step for indexing them is input processing. PDF text
|
149 |
storage. Note that the peak memory value is for the
|
158 |
extraction is performed by multiple instances
|
150 |
recollindex process, and does not take into account the
|
159 |
the <i>pdftotext</i> program, and parallelisation works very
|
151 |
multiple Python and pdftotext instances (which are relatively
|
160 |
well.</p>
|
152 |
small but things add up...).</p>
|
161 |
|
153 |
|
162 |
<p>The following table shows the indexing times with a variety
|
154 |
<h5>Improving indexing performance with hardware:</h5>
|
163 |
of threading parameters.</p>
|
155 |
<p>I think
|
164 |
|
|
|
165 |
<table border=1>
|
|
|
166 |
<thead>
|
|
|
167 |
<tr>
|
|
|
168 |
<th>idxflushmb</th>
|
|
|
169 |
<th>thrQSizes</th>
|
|
|
170 |
<th>thrTCounts</th>
|
|
|
171 |
<th>Time R/U/S</th>
|
|
|
172 |
</tr>
|
|
|
173 |
<tbody>
|
|
|
174 |
<tr>
|
|
|
175 |
<td>200</td>
|
|
|
176 |
<td>2/2/2</td>
|
|
|
177 |
<td>2/1/1</td>
|
|
|
178 |
<td>19m21</td>
|
|
|
179 |
</tr>
|
|
|
180 |
<tr>
|
|
|
181 |
<td>200</td>
|
|
|
182 |
<td>2/2/2</td>
|
|
|
183 |
<td>10/10/1</td>
|
|
|
184 |
<td>10m38</td>
|
|
|
185 |
</tr>
|
|
|
186 |
<tr>
|
|
|
187 |
<td>200</td>
|
|
|
188 |
<td>2/2/2</td>
|
|
|
189 |
<td>100/10/1</td>
|
|
|
190 |
<td>11m</td>
|
|
|
191 |
</tr>
|
|
|
192 |
</tbody>
|
|
|
193 |
</table>
|
|
|
194 |
|
|
|
195 |
<p>10/10/1 was the best value for thrTCounts for this test. The
|
|
|
196 |
total CPU time was around 78 mn.</p>
|
|
|
197 |
|
|
|
198 |
<p>The last line shows the effect of a ridiculously high thread
|
|
|
199 |
count value for the input step, which is not much. Using
|
|
|
200 |
sligthly lower values than the optimum has not much impact
|
|
|
201 |
either. The only thing which really degrades performance is
|
|
|
202 |
configuring less threads than available from the hardware.</p>
|
|
|
203 |
|
|
|
204 |
<p>With the optimal parameters above, the peak recollindex
|
|
|
205 |
resident memory size is around 930 MB, to which we should add
|
|
|
206 |
ten instances of pdftotext (10MB typical), and of the
|
|
|
207 |
rclpdf.py Python input handler (around 15 MB each). This means
|
|
|
208 |
that the total resident memory used by indexing is around 1200
|
|
|
209 |
MB, quite a modest value in 2016.</p>
|
|
|
210 |
|
|
|
211 |
|
|
|
212 |
<h3>PDF: Xapian flushes</h3>
|
|
|
213 |
|
|
|
214 |
<p>idxflushmb has practically no influence on the indexing time
|
|
|
215 |
(tested from 40 to 1000), which is not too surprising because
|
|
|
216 |
the Xapian index size is very small relatively to the input
|
|
|
217 |
size, so that the cost of Xapian flushes to disk is not very
|
|
|
218 |
significant. The value of 200 used for the threading tests
|
|
|
219 |
could be lowered in practise, which would decrease memory
|
|
|
220 |
usage and not change the indexing time significantly.</p>
|
|
|
221 |
|
|
|
222 |
<h3>PDF: conclusion</h3>
|
|
|
223 |
|
|
|
224 |
<p>For indexing PDF files, you need many cores and a fast
|
|
|
225 |
input storage system. Neither single-thread performance nor
|
|
|
226 |
amount of memory will be critical aspects.</p>
|
|
|
227 |
|
|
|
228 |
<p>Running the PDF indexing tests had no influence on the system
|
|
|
229 |
"feel", I could work on it just as if it were quiescent.</p>
|
|
|
230 |
|
|
|
231 |
|
|
|
232 |
<h2>Indexing HTML files</h2>
|
|
|
233 |
|
|
|
234 |
<p>The tests were run on an (old) French Wikipedia dump: 2.9
|
|
|
235 |
million HTML files stored in 42000 directories, for an
|
|
|
236 |
approximate total size of 41 GB (average file size
|
|
|
237 |
14 KB).
|
|
|
238 |
|
|
|
239 |
<p>The files are stored on a local SSD. Just reading them with
|
|
|
240 |
find+cpio takes close to 8 mn.</p>
|
|
|
241 |
|
|
|
242 |
<p>The resulting index has a size of around 30 GB.</p>
|
|
|
243 |
|
|
|
244 |
<p>I was too lazy to extract 3 million entries tar file on a
|
|
|
245 |
spinning disk, so all tests were performed with the data
|
|
|
246 |
stored on a local SSD.</p>
|
|
|
247 |
|
|
|
248 |
<p>For this test, the indexing time is dominated by the Xapian
|
|
|
249 |
index updates. As these are single threaded, only the flush
|
|
|
250 |
interval has a real influence.</p>
|
|
|
251 |
|
|
|
252 |
<table border=1>
|
|
|
253 |
<thead>
|
|
|
254 |
<tr>
|
|
|
255 |
<th>idxflushmb</th>
|
|
|
256 |
<th>thrQSizes</th>
|
|
|
257 |
<th>thrTCounts</th>
|
|
|
258 |
<th>Time R/U/S</th>
|
|
|
259 |
</tr>
|
|
|
260 |
<tbody>
|
|
|
261 |
<tr>
|
|
|
262 |
<td>200</td>
|
|
|
263 |
<td>2/2/2</td>
|
|
|
264 |
<td>2/1/1</td>
|
|
|
265 |
<td>88m</td>
|
|
|
266 |
</tr>
|
|
|
267 |
<tr>
|
|
|
268 |
<td>200</td>
|
|
|
269 |
<td>2/2/2</td>
|
|
|
270 |
<td>6/4/1</td>
|
|
|
271 |
<td>91m</td>
|
|
|
272 |
</tr>
|
|
|
273 |
<tr>
|
|
|
274 |
<td>200</td>
|
|
|
275 |
<td>2/2/2</td>
|
|
|
276 |
<td>1/1/1</td>
|
|
|
277 |
<td>96m</td>
|
|
|
278 |
</tr>
|
|
|
279 |
<tr>
|
|
|
280 |
<td>100</td>
|
|
|
281 |
<td>2/2/2</td>
|
|
|
282 |
<td>1/2/1</td>
|
|
|
283 |
<td>120m</td>
|
|
|
284 |
</tr>
|
|
|
285 |
<tr>
|
|
|
286 |
<td>100</td>
|
|
|
287 |
<td>2/2/2</td>
|
|
|
288 |
<td>6/4/1</td>
|
|
|
289 |
<td>121m</td>
|
|
|
290 |
</tr>
|
|
|
291 |
<tr>
|
|
|
292 |
<td>40</td>
|
|
|
293 |
<td>2/2/2</td>
|
|
|
294 |
<td>1/2/1</td>
|
|
|
295 |
<td>173m</td>
|
|
|
296 |
</tr>
|
|
|
297 |
</tbody>
|
|
|
298 |
</table>
|
|
|
299 |
|
|
|
300 |
|
|
|
301 |
<p>The indexing process becomes quite big (resident size around
|
|
|
302 |
4GB), and the combination of high I/O load and high memory
|
|
|
303 |
usage makes the system less responsive at times (but not
|
|
|
304 |
unusable). As this happens principally when switching
|
|
|
305 |
applications, my guess would be that some program pages
|
|
|
306 |
(e.g. from the window manager and X) get flushed out, and take
|
|
|
307 |
time being read in, during which time the display appears
|
|
|
308 |
frozen.</p>
|
|
|
309 |
|
|
|
310 |
<p>For this kind of data, single-threaded CPU performance and
|
|
|
311 |
storage write speed can make a difference. Multithreading does
|
|
|
312 |
not help.</p>
|
|
|
313 |
|
|
|
314 |
<h2>Adjusting hardware to improve indexing performance</h2>
|
|
|
315 |
|
156 |
that the following multi-step approach has a good chance to
|
316 |
<p>I think that the following multi-step approach has a good
|
157 |
improve performance:
|
317 |
chance to improve performance:
|
158 |
<ul>
|
318 |
<ul>
|
159 |
<li>Check that multithreading is enabled (it is, by default
|
319 |
<li>Check that multithreading is enabled (it is, by default
|
160 |
with recent Recoll versions).</li>
|
320 |
with recent Recoll versions).</li>
|
161 |
<li>Increase the flush threshold until the machine begins to
|
321 |
<li>Increase the flush threshold until the machine begins to
|
162 |
have memory issues. Maybe add memory.</li>
|
322 |
have memory issues. Maybe add memory.</li>
|
|
... |
|
... |
169 |
a memory temporary directory. Add memory.</li>
|
329 |
a memory temporary directory. Add memory.</li>
|
170 |
<li>More CPUs...</li>
|
330 |
<li>More CPUs...</li>
|
171 |
</ul>
|
331 |
</ul>
|
172 |
</p>
|
332 |
</p>
|
173 |
|
333 |
|
174 |
<p>At some point, the index writing may become the
|
334 |
<p>At some point, the index updating and writing may become the
|
|
|
335 |
bottleneck (this depends on the data mix, very quickly with
|
175 |
bottleneck. As far as I can think, the only possible approach
|
336 |
HTML or text files). As far as I can think, the only possible
|
176 |
then is to partition the index.</p>
|
337 |
approach is then to partition the index. You can query the
|
|
|
338 |
multiple Xapian indices either by using the Recoll external
|
|
|
339 |
index capability, or by actually merging the results with
|
|
|
340 |
xapian-compact.</p>
|
|
|
341 |
|
|
|
342 |
|
|
|
343 |
|
|
|
344 |
<h5>Old benchmarks</h5>
|
|
|
345 |
|
|
|
346 |
<p>To provide a point of comparison for the evolution of
|
|
|
347 |
hardware and software...</p>
|
177 |
|
348 |
|
|
|
349 |
<p>The following very old data was obtained (around 2007?) on a
|
|
|
350 |
machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
|
|
|
351 |
7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
|
|
|
352 |
|
|
|
353 |
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
|
|
354 |
executed with the default flush threshold value.
|
|
|
355 |
The process memory usage is the one given by <b>ps</b></p>
|
|
|
356 |
|
|
|
357 |
<table border=1>
|
|
|
358 |
<thead>
|
|
|
359 |
<tr>
|
|
|
360 |
<th>Data</th>
|
|
|
361 |
<th>Data size</th>
|
|
|
362 |
<th>Indexing time</th>
|
|
|
363 |
<th>Index size</th>
|
|
|
364 |
<th>Peak process memory usage</th>
|
|
|
365 |
</tr>
|
|
|
366 |
<tbody>
|
|
|
367 |
<tr>
|
|
|
368 |
<td>Random pdfs harvested on Google</td>
|
|
|
369 |
<td>1.7 GB, 3564 files</td>
|
|
|
370 |
<td>27 mn</td>
|
|
|
371 |
<td>230 MB</td>
|
|
|
372 |
<td>225 MB</td>
|
|
|
373 |
</tr>
|
|
|
374 |
<tr>
|
|
|
375 |
<td>Ietf mailing list archive</td>
|
|
|
376 |
<td>211 MB, 44,000 messages</td>
|
|
|
377 |
<td>8 mn</td>
|
|
|
378 |
<td>350 MB</td>
|
|
|
379 |
<td>90 MB</td>
|
|
|
380 |
</tr>
|
|
|
381 |
<tr>
|
|
|
382 |
<td>Partial Wikipedia dump</td>
|
|
|
383 |
<td>15 GB, one million files</td>
|
|
|
384 |
<td>6H30</td>
|
|
|
385 |
<td>10 GB</td>
|
|
|
386 |
<td>324 MB</td>
|
|
|
387 |
</tr>
|
|
|
388 |
<tr>
|
|
|
389 |
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
|
|
|
390 |
<td>Random pdfs harvested on Google<br>
|
|
|
391 |
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
|
|
|
392 |
<td>1.7 GB, 3564 files</td>
|
|
|
393 |
<td>25 mn</td>
|
|
|
394 |
<td>262 MB</td>
|
|
|
395 |
<td>65 MB</td>
|
|
|
396 |
</tr>
|
|
|
397 |
</tbody>
|
|
|
398 |
</table>
|
|
|
399 |
|
|
|
400 |
<p>Notice how the index size for the mail archive is bigger than
|
|
|
401 |
the data size. Myriads of small pure text documents will do
|
|
|
402 |
this. The factor of expansion would be even much worse with
|
|
|
403 |
compressed folders of course (the test was on uncompressed
|
|
|
404 |
data).</p>
|
|
|
405 |
|
|
|
406 |
<p>The last test was performed with Recoll 1.9.0 which has an
|
|
|
407 |
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
|
|
408 |
set to 10 MB. Notice the much lower peak memory usage, with no
|
|
|
409 |
performance degradation. The resulting index is bigger though,
|
|
|
410 |
the exact reason is not known to me, possibly because of
|
|
|
411 |
additional fragmentation </p>
|
|
|
412 |
|
178 |
</div>
|
413 |
</div>
|
179 |
</body>
|
414 |
</body>
|
180 |
</html>
|
415 |
</html>
|
181 |
|
416 |
|