recoll / Code / Diff of /website/perfs.html

Diff of /website/perfs.html [42378d] .. [a905a9]

Switch to side-by-side view

--- a/website/perfs.html
+++ b/website/perfs.html
@@ -43,9 +43,10 @@
 	store an index. Obviously, your data set will never fit one of
 	the samples, so the results cannot be exactly predicted.</p>
 
-      <p>The following data was obtained on a machine with a 1800 Mhz
+      <p>The following very old data was obtained on a machine with a
+        1800 Mhz
 	AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
-	disk, running Suse 10.1.</p>
+	disk, running Suse 10.1. More recent data follows.</p>
 
       <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
 	executed with the default flush threshold value. 
@@ -106,8 +107,74 @@
 	performance degradation. The resulting index is bigger though,
 	the exact reason is not known to me, possibly because of
 	additional fragmentation </p>
+
+      <p>There is more recent performance data (2012) at the end of
+        the <a href="idxthreads/threadingRecoll.html">article about
+          converting Recoll indexing to multithreading</a></p>
+
+      <p>Update, March 2016: I took another sample of PDF performance
+        data on a more modern machine, with Recoll multithreading turned
+        on. The machine has an Intel Core I7-4770T Cpu, which has 4
+        physical cores, and supports hyper-threading for a total of 8
+        threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
+        fanless, this is not a "beast" computer).</p>
+        
+      <table border=1>
+	<thead>
+	  <tr>
+	    <th>Data</th>
+	    <th>Data size</th>
+	    <th>Indexing time</th>
+	    <th>Index size</th>
+	    <th>Peak process memory usage</th>
+	  </tr>
+	<tbody>
+	  <tr>
+	    <td>Random pdfs harvested on Google<br>
+	    Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
+	    parameters 6/4/1</td>
+	    <td>11 GB, 5320 files</td>
+	    <td>3 mn 15 S</td>
+	    <td>400 MB</td>
+	    <td>545 MB</td>
+	  </tr>
+	</tbody>
+      </table>
+        
+      <p>The indexing process used 21 mn of CPU during these 3mn15 of
+        real time, we are not letting these cores stay idle
+        much... The improvement compared to the numbers above is quite
+        spectacular (a factor of 11, approximately), mostly due to the
+        multiprocessing, but also to the faster CPU and the SSD
+        storage. Note that the peak memory value is for the
+        recollindex process, and does not take into account the
+        multiple Python and pdftotext instances (which are relatively
+        small but things add up...).</p>
+      
+      <h5>Improving indexing performance with hardware:</h5>
+      <p>I think
+      that the following multi-step approach has a good chance to
+        improve performance:
+        <ul>
+          <li>Check that multithreading is enabled (it is, by default
+            with recent Recoll versions).</li>
+          <li>Increase the flush threshold until the machine begins to
+            have memory issues. Maybe add memory.</li>
+          <li>Store the index on an SSD. If possible, also store the
+            data on an SSD. Actually, when using many threads, it is
+            probably almost more important to have the data on an
+            SSD.</li>
+          <li>If you have many files which will need temporary copies
+            (email attachments, archive members, compressed files): use
+            a memory temporary directory. Add memory.</li>
+          <li>More CPUs...</li>
+        </ul>
       </p>
 
+      <p>At some point, the index writing may become the
+        bottleneck. As far as I can think, the only possible approach
+        then is to partition the index.</p>
+      
     </div>
   </body>
 </html>