Switch to unified view

a/website/perfs.html b/website/perfs.html
...
...
41
      <p>We try here to give a number of reference points which can
41
      <p>We try here to give a number of reference points which can
42
    be used to roughly estimate the resources needed to create and
42
    be used to roughly estimate the resources needed to create and
43
    store an index. Obviously, your data set will never fit one of
43
    store an index. Obviously, your data set will never fit one of
44
    the samples, so the results cannot be exactly predicted.</p>
44
    the samples, so the results cannot be exactly predicted.</p>
45
45
46
      <p>The following data was obtained on a machine with a 1800 Mhz
46
      <p>The following very old data was obtained on a machine with a
47
        1800 Mhz
47
    AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
48
    AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
48
  disk, running Suse 10.1.</p>
49
  disk, running Suse 10.1. More recent data follows.</p>
49
50
50
      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
51
      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
51
    executed with the default flush threshold value. 
52
    executed with the default flush threshold value. 
52
    The process memory usage is the one given by <b>ps</b></p>
53
    The process memory usage is the one given by <b>ps</b></p>
53
54
...
...
104
    ajustable flush threshold (<em>idxflushmb</em> parameter), here
105
    ajustable flush threshold (<em>idxflushmb</em> parameter), here
105
    set to 10 MB. Notice the much lower peak memory usage, with no
106
    set to 10 MB. Notice the much lower peak memory usage, with no
106
    performance degradation. The resulting index is bigger though,
107
    performance degradation. The resulting index is bigger though,
107
    the exact reason is not known to me, possibly because of
108
    the exact reason is not known to me, possibly because of
108
    additional fragmentation </p>
109
    additional fragmentation </p>
110
111
      <p>There is more recent performance data (2012) at the end of
112
        the <a href="idxthreads/threadingRecoll.html">article about
113
          converting Recoll indexing to multithreading</a></p>
114
115
      <p>Update, March 2016: I took another sample of PDF performance
116
        data on a more modern machine, with Recoll multithreading turned
117
        on. The machine has an Intel Core I7-4770T Cpu, which has 4
118
        physical cores, and supports hyper-threading for a total of 8
119
        threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
120
        fanless, this is not a "beast" computer).</p>
121
        
122
      <table border=1>
123
  <thead>
124
    <tr>
125
      <th>Data</th>
126
      <th>Data size</th>
127
      <th>Indexing time</th>
128
      <th>Index size</th>
129
      <th>Peak process memory usage</th>
130
    </tr>
131
  <tbody>
132
    <tr>
133
      <td>Random pdfs harvested on Google<br>
134
      Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
135
      parameters 6/4/1</td>
136
      <td>11 GB, 5320 files</td>
137
      <td>3 mn 15 S</td>
138
      <td>400 MB</td>
139
      <td>545 MB</td>
140
    </tr>
141
  </tbody>
142
      </table>
143
        
144
      <p>The indexing process used 21 mn of CPU during these 3mn15 of
145
        real time, we are not letting these cores stay idle
146
        much... The improvement compared to the numbers above is quite
147
        spectacular (a factor of 11, approximately), mostly due to the
148
        multiprocessing, but also to the faster CPU and the SSD
149
        storage. Note that the peak memory value is for the
150
        recollindex process, and does not take into account the
151
        multiple Python and pdftotext instances (which are relatively
152
        small but things add up...).</p>
153
      
154
      <h5>Improving indexing performance with hardware:</h5>
155
      <p>I think
156
      that the following multi-step approach has a good chance to
157
        improve performance:
158
        <ul>
159
          <li>Check that multithreading is enabled (it is, by default
160
            with recent Recoll versions).</li>
161
          <li>Increase the flush threshold until the machine begins to
162
            have memory issues. Maybe add memory.</li>
163
          <li>Store the index on an SSD. If possible, also store the
164
            data on an SSD. Actually, when using many threads, it is
165
            probably almost more important to have the data on an
166
            SSD.</li>
167
          <li>If you have many files which will need temporary copies
168
            (email attachments, archive members, compressed files): use
169
            a memory temporary directory. Add memory.</li>
170
          <li>More CPUs...</li>
171
        </ul>
109
      </p>
172
      </p>
110
173
174
      <p>At some point, the index writing may become the
175
        bottleneck. As far as I can think, the only possible approach
176
        then is to partition the index.</p>
177
      
111
    </div>
178
    </div>
112
  </body>
179
  </body>
113
</html>
180
</html>
114
181