|
a |
|
b/website/perfs.html |
|
|
1 |
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
|
|
|
2 |
|
|
|
3 |
<html>
|
|
|
4 |
<head>
|
|
|
5 |
<title>RECOLL: a personal text search system for
|
|
|
6 |
Unix/Linux</title>
|
|
|
7 |
<meta name="generator" content="HTML Tidy, see www.w3.org">
|
|
|
8 |
<meta name="Author" content="Jean-Francois Dockes">
|
|
|
9 |
<meta name="Description" content=
|
|
|
10 |
"recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
|
|
|
11 |
<meta name="Keywords" content=
|
|
|
12 |
"full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
|
|
|
13 |
<meta http-equiv="Content-language" content="en">
|
|
|
14 |
<meta http-equiv="content-type" content=
|
|
|
15 |
"text/html; charset=iso-8859-1">
|
|
|
16 |
<meta name="robots" content="All,Index,Follow">
|
|
|
17 |
<link type="text/css" rel="stylesheet" href="styles/style.css">
|
|
|
18 |
</head>
|
|
|
19 |
|
|
|
20 |
<body>
|
|
|
21 |
|
|
|
22 |
<div class="rightlinks">
|
|
|
23 |
<ul>
|
|
|
24 |
<li><a href="index.html">Home</a></li>
|
|
|
25 |
<li><a href="pics/index.html">Screenshots</a></li>
|
|
|
26 |
<li><a href="download.html">Downloads</a></li>
|
|
|
27 |
<li><a href="doc.html">Documentation</a></li>
|
|
|
28 |
</ul>
|
|
|
29 |
</div>
|
|
|
30 |
|
|
|
31 |
<div class="content">
|
|
|
32 |
|
|
|
33 |
<h1 class="intro">Recoll: Indexing performance and index sizes</h1>
|
|
|
34 |
|
|
|
35 |
<p>The time needed to index a given set of documents, and the
|
|
|
36 |
resulting index size depend of many factors, such as file size
|
|
|
37 |
and proportion of actual text content for the index size, cpu
|
|
|
38 |
speed, available memory, average file size and format for the
|
|
|
39 |
speed of indexing.</p>
|
|
|
40 |
|
|
|
41 |
<p>We try here to give a number of reference points which can
|
|
|
42 |
be used to roughly estimate the resources needed to create and
|
|
|
43 |
store an index. Obviously, your data set will never fit one of
|
|
|
44 |
the samples, so the results cannot be exactly predicted.</p>
|
|
|
45 |
|
|
|
46 |
<p>The following data was obtained on a machine with a 1800 Mhz
|
|
|
47 |
AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
|
|
|
48 |
disk, running Suse 10.1.</p>
|
|
|
49 |
|
|
|
50 |
<p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
|
|
|
51 |
executed with the default flush threshold value.
|
|
|
52 |
The process memory usage is the one given by <b>ps</b></p>
|
|
|
53 |
|
|
|
54 |
<table border=1>
|
|
|
55 |
<thead>
|
|
|
56 |
<tr>
|
|
|
57 |
<th>Data</th>
|
|
|
58 |
<th>Data size</th>
|
|
|
59 |
<th>Indexing time</th>
|
|
|
60 |
<th>Index size</th>
|
|
|
61 |
<th>Peak process memory usage</th>
|
|
|
62 |
</tr>
|
|
|
63 |
<tbody>
|
|
|
64 |
<tr>
|
|
|
65 |
<td>Random pdfs harvested on Google</td>
|
|
|
66 |
<td>1.7 GB, 3564 files</td>
|
|
|
67 |
<td>27 mn</td>
|
|
|
68 |
<td>230 MB</td>
|
|
|
69 |
<td>225 MB</td>
|
|
|
70 |
</tr>
|
|
|
71 |
<tr>
|
|
|
72 |
<td>Ietf mailing list archive</td>
|
|
|
73 |
<td>211 MB, 44,000 messages</td>
|
|
|
74 |
<td>8 mn</td>
|
|
|
75 |
<td>350 MB</td>
|
|
|
76 |
<td>90 MB</td>
|
|
|
77 |
</tr>
|
|
|
78 |
<tr>
|
|
|
79 |
<td>Partial Wikipedia dump</td>
|
|
|
80 |
<td>15 GB, one million files</td>
|
|
|
81 |
<td>6H30</td>
|
|
|
82 |
<td>10 GB</td>
|
|
|
83 |
<td>324 MB</td>
|
|
|
84 |
</tr>
|
|
|
85 |
<tr>
|
|
|
86 |
<!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
|
|
|
87 |
<td>Random pdfs harvested on Google<br>
|
|
|
88 |
Recoll 1.9, <em>idxflushmb</em> set to 10</td>
|
|
|
89 |
<td>1.7 GB, 3564 files</td>
|
|
|
90 |
<td>25 mn</td>
|
|
|
91 |
<td>262 MB</td>
|
|
|
92 |
<td>65 MB</td>
|
|
|
93 |
</tr>
|
|
|
94 |
</tbody>
|
|
|
95 |
</table>
|
|
|
96 |
|
|
|
97 |
<p>Notice how the index size for the mail archive is bigger than
|
|
|
98 |
the data size. Myriads of small pure text documents will do
|
|
|
99 |
this. The factor of expansion would be even much worse with
|
|
|
100 |
compressed folders of course (the test was on uncompressed
|
|
|
101 |
data).</p>
|
|
|
102 |
|
|
|
103 |
<p>The last test was performed with Recoll 1.9.0 which has an
|
|
|
104 |
ajustable flush threshold (<em>idxflushmb</em> parameter), here
|
|
|
105 |
set to 10 MB. Notice the much lower peak memory usage, with no
|
|
|
106 |
performance degradation. The resulting index is bigger though,
|
|
|
107 |
the exact reason is not known to me, possibly because of
|
|
|
108 |
additional fragmentation </p>
|
|
|
109 |
</p>
|
|
|
110 |
|
|
|
111 |
</div>
|
|
|
112 |
</body>
|
|
|
113 |
</html>
|
|
|
114 |
|