Switch to unified view

a b/website/perfs.html
1
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
2
3
<html>
4
  <head>
5
    <title>RECOLL: a personal text search system for
6
    Unix/Linux</title>
7
    <meta name="generator" content="HTML Tidy, see www.w3.org">
8
    <meta name="Author" content="Jean-Francois Dockes">
9
    <meta name="Description" content=
10
    "recoll is a simple full-text search system for unix and linux based on the powerful and mature xapian engine">
11
    <meta name="Keywords" content=
12
      "full text search,fulltext,desktop search,unix,linux,solaris,open source,free">
13
    <meta http-equiv="Content-language" content="en">
14
    <meta http-equiv="content-type" content=
15
    "text/html; charset=iso-8859-1">
16
    <meta name="robots" content="All,Index,Follow">
17
    <link type="text/css" rel="stylesheet" href="styles/style.css">
18
  </head>
19
20
  <body>
21
22
    <div class="rightlinks">
23
      <ul>
24
  <li><a href="index.html">Home</a></li>
25
  <li><a href="pics/index.html">Screenshots</a></li>
26
  <li><a href="download.html">Downloads</a></li>
27
  <li><a href="doc.html">Documentation</a></li>
28
      </ul>
29
    </div>
30
31
    <div class="content">
32
33
      <h1 class="intro">Recoll: Indexing performance and index sizes</h1>
34
35
      <p>The time needed to index a given set of documents, and the
36
  resulting index size depend of many factors, such as file size
37
  and proportion of actual text content for the index size, cpu
38
  speed, available memory, average file size and format for the
39
  speed of indexing.</p>
40
41
      <p>We try here to give a number of reference points which can
42
  be used to roughly estimate the resources needed to create and
43
  store an index. Obviously, your data set will never fit one of
44
  the samples, so the results cannot be exactly predicted.</p>
45
46
      <p>The following data was obtained on a machine with a 1800 Mhz
47
  AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
48
  disk, running Suse 10.1.</p>
49
50
      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
51
  executed with the default flush threshold value. 
52
  The process memory usage is the one given by <b>ps</b></p>
53
54
      <table border=1>
55
  <thead>
56
    <tr>
57
      <th>Data</th>
58
      <th>Data size</th>
59
      <th>Indexing time</th>
60
      <th>Index size</th>
61
      <th>Peak process memory usage</th>
62
    </tr>
63
  <tbody>
64
    <tr>
65
      <td>Random pdfs harvested on Google</td>
66
      <td>1.7 GB, 3564 files</td>
67
      <td>27 mn</td>
68
      <td>230 MB</td>
69
      <td>225 MB</td>
70
    </tr>
71
    <tr>
72
      <td>Ietf mailing list archive</td>
73
      <td>211 MB, 44,000 messages</td>
74
      <td>8 mn</td>
75
      <td>350 MB</td>
76
      <td>90 MB</td>
77
    </tr>
78
    <tr>
79
      <td>Partial Wikipedia dump</td>
80
      <td>15 GB, one million files</td>
81
      <td>6H30</td>
82
      <td>10 GB</td>
83
      <td>324 MB</td>
84
    </tr>
85
    <tr>
86
      <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
87
      <td>Random pdfs harvested on Google<br>
88
      Recoll 1.9, <em>idxflushmb</em> set to 10</td>
89
      <td>1.7 GB, 3564 files</td>
90
      <td>25 mn</td>
91
      <td>262 MB</td>
92
      <td>65 MB</td>
93
    </tr>
94
  </tbody>
95
      </table>
96
97
      <p>Notice how the index size for the mail archive is bigger than
98
  the data size. Myriads of small pure text documents will do
99
  this. The factor of expansion would be even much worse with
100
  compressed folders of course (the test was on uncompressed
101
  data).</p>
102
103
      <p>The last test was performed with Recoll 1.9.0 which has an
104
  ajustable flush threshold (<em>idxflushmb</em> parameter), here
105
  set to 10 MB. Notice the much lower peak memory usage, with no
106
  performance degradation. The resulting index is bigger though,
107
  the exact reason is not known to me, possibly because of
108
  additional fragmentation </p>
109
      </p>
110
111
    </div>
112
  </body>
113
</html>
114