Switch to side-by-side view

--- a/website/perfs.html
+++ b/website/perfs.html
@@ -2,8 +2,7 @@
 
 <html>
   <head>
-    <title>RECOLL: a personal text search system for
-    Unix/Linux</title>
+    <title>RECOLL indexing performance and index sizes</title>
     <meta name="generator" content="HTML Tidy, see www.w3.org">
     <meta name="Author" content="Jean-Francois Dockes">
     <meta name="Description" content=
@@ -33,128 +32,289 @@
       <h1>Recoll: Indexing performance and index sizes</h1>
 
       <p>The time needed to index a given set of documents, and the
-	resulting index size depend of many factors, such as file size
-	and proportion of actual text content for the index size, cpu
-	speed, available memory, average file size and format for the
-	speed of indexing.</p>
-
-      <p>We try here to give a number of reference points which can
-	be used to roughly estimate the resources needed to create and
-	store an index. Obviously, your data set will never fit one of
-	the samples, so the results cannot be exactly predicted.</p>
-
-      <p>The following very old data was obtained on a machine with a
-        1800 Mhz
-	AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE
-	disk, running Suse 10.1. More recent data follows.</p>
-
-      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
-	executed with the default flush threshold value. 
-	The process memory usage is the one given by <b>ps</b></p>
+	resulting index size depend of many factors.
+
+      <p>The index size depends almost only on the size of the
+        uncompressed input text, and you can expect it to be roughly
+        of the same order of magnitude. Depending on the type of file,
+        the proportion of text to file size varies very widely, going
+        from close to 1 for pure text files to a very small factor
+        for, e.g., metadata tags in mp3 files.</p>
+
+      <p>Estimating indexing time is a much more complicated issue,
+        depending on the type and size of input and on system
+        performance. There is no general way to determine what part of
+        the hardware should be optimized. Depending on the type of
+        input, performance may be bound by I/O read or write
+        performance, CPU single-processing speed, or combined
+        multi-processing speed.</p>
+
+      <p>It should be noted that Recoll performance will not be an
+        issue for most people. The indexer can process 1000 typical
+        PDF files per minute, or 500 Wikipedia HTML pages per second
+        on medium-range hardware, meaning that the initial indexing of
+        a typical dataset will need a few dozen minutes at
+        most. Further incremental index updates will be much faster
+        because most files will not need to be processed again.</p>
+
+      <p>However, there are Recoll installations with
+        terabyte-sized datasets, on which indexing can take days. For
+        such operations (or even much smaller ones), it is very
+        important to know what kind of performance can be expected,
+        and what aspects of the hardware should be optimized.</p>
+
+      <p>In order to provide some reference points, I have run a
+        number of benchs on medium-sized datasets, using typical
+        mid-range desktop hardware, and varying the indexing
+        configuration parameters to show how they affect the results.</p>
+
+      <p>The following may help you check that you are getting typical
+        performance for your indexing, and give some indications about
+        what to adjust to improve it.</p>
+        
+      <p>From time to time, I receive a report about a system becoming
+        unusable during indexing. As far as I know, with the default
+        Recoll configuration, and barring an exceptional issue (bug),
+        this is always due to a system problem (typically bad hardware
+        such as a disk doing retries). The tests below were mostly run
+        while I was using the desktop, which never became
+        unusable. However, some tests rendered it less responsive and
+        this is noted with the results.</p>
+
+      <p>The following text refers to the indexing parameters without
+        further explanation. Here follow links to more explanation about the
+        <a href="http://www.lesbonscomptes.com/recoll/idxthreads/threadingRecoll.html#recoll.idxthreads.multistage">processing
+        model</a> and
+        <a href="https://www.lesbonscomptes.com/recoll/usermanual/webhelp/docs/RCL.INSTALL.CONFIG.RECOLLCONF.PERFS.html">configuration
+          parameters</a>.</p>
+      
+
+      <p>All text were run without generating the stemming database or
+        aspell dictionary. These phases are relatively short and there
+        is nothing which can be optimized about them.</p>
+      
+      <h2>Hardware</h2>
+
+      <p>The tests were run on what could be considered a mid-range
+        desktop PC:
+        <ul>
+          <li>Intel Core I7-4770T CPU: 2.5 Ghz, 4 physical cores, and
+            hyper-threading for a total of 8 hardware threads</li>
+          <li>8 GBytes of RAM</li>
+          <li>Asus H87I-Plus motherboard, Samsung 850 EVO SSD storage</li>
+        </ul>
+      </p>
+
+      <p>This is usually a fanless PC, but I did run a fan on the
+        external case fins during some of the tests (esp. PDF
+        indexing), because the CPU was running a bit too hot.</p>
+
+
+      <h2>Indexing PDF files</h2>
+      
+
+      <p>The tests were run on 18000 random PDFs harvested on
+        Google, with a total size of around 30 GB, using Recoll 1.22.3
+        and Xapian 1.2.22. The resulting index size was 1.2 GB.</p>
+
+      <h3>PDF: storage</h3>
+
+      <p>Typical PDF files have a low text to file size ratio, and a
+        lot of data needs to be read for indexing. With the test
+        configuration, the indexer needs to read around 45 MBytes / S
+        from multiple files. This means that input storage makes a
+        difference and that you need an SSD or a fast array for
+        optimal performance.</p>
 
       <table border=1>
 	<thead>
 	  <tr>
-	    <th>Data</th>
-	    <th>Data size</th>
-	    <th>Indexing time</th>
-	    <th>Index size</th>
-	    <th>Peak process memory usage</th>
+	    <th>Storage</th>
+	    <th>idxflushmb</th>
+	    <th>thrTCounts</th>
+	    <th>Real Time</th>
 	  </tr>
 	<tbody>
 	  <tr>
-	    <td>Random pdfs harvested on Google</td>
-	    <td>1.7 GB, 3564 files</td>
-	    <td>27 mn</td>
-	    <td>230 MB</td>
-	    <td>225 MB</td>
-	  </tr>
-	  <tr>
-	    <td>Ietf mailing list archive</td>
-	    <td>211 MB, 44,000 messages</td>
-	    <td>8 mn</td>
-	    <td>350 MB</td>
-	    <td>90 MB</td>
-	  </tr>
-	  <tr>
-	    <td>Partial Wikipedia dump</td>
-	    <td>15 GB, one million files</td>
-	    <td>6H30</td>
-	    <td>10 GB</td>
-	    <td>324 MB</td>
-	  </tr>
-	  <tr>
-	    <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
-	    <td>Random pdfs harvested on Google<br>
-	    Recoll 1.9, <em>idxflushmb</em> set to 10</td>
-	    <td>1.7 GB, 3564 files</td>
-	    <td>25 mn</td>
-	    <td>262 MB</td>
-	    <td>65 MB</td>
-	  </tr>
-	</tbody>
-      </table>
-
-      <p>Notice how the index size for the mail archive is bigger than
-	the data size. Myriads of small pure text documents will do
-	this. The factor of expansion would be even much worse with
-	compressed folders of course (the test was on uncompressed
-	data).</p>
-
-      <p>The last test was performed with Recoll 1.9.0 which has an
-	ajustable flush threshold (<em>idxflushmb</em> parameter), here
-	set to 10 MB. Notice the much lower peak memory usage, with no
-	performance degradation. The resulting index is bigger though,
-	the exact reason is not known to me, possibly because of
-	additional fragmentation </p>
-
-      <p>There is more recent performance data (2012) at the end of
-        the <a href="idxthreads/threadingRecoll.html">article about
-          converting Recoll indexing to multithreading</a></p>
-
-      <p>Update, March 2016: I took another sample of PDF performance
-        data on a more modern machine, with Recoll multithreading turned
-        on. The machine has an Intel Core I7-4770T Cpu, which has 4
-        physical cores, and supports hyper-threading for a total of 8
-        threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is
-        fanless, this is not a "beast" computer).</p>
-        
-      <table border=1>
-	<thead>
-	  <tr>
-	    <th>Data</th>
-	    <th>Data size</th>
-	    <th>Indexing time</th>
-	    <th>Index size</th>
-	    <th>Peak process memory usage</th>
-	  </tr>
-	<tbody>
-	  <tr>
-	    <td>Random pdfs harvested on Google<br>
-	    Recoll 1.21.5, <em>idxflushmb</em> set to 200, thread
-	    parameters 6/4/1</td>
-	    <td>11 GB, 5320 files</td>
-	    <td>3 mn 15 S</td>
-	    <td>400 MB</td>
-	    <td>545 MB</td>
+	    <td>NFS drive (gigabit)</td>
+	    <td>200</td>
+	    <td>6/4/1</td>
+	    <td>24m40</td>
+	  </tr>
+	  <tr>
+	    <td>local SSD</td>
+	    <td>200</td>
+	    <td>6/4/1</td>
+	    <td>11m40</td>
 	  </tr>
 	</tbody>
       </table>
         
-      <p>The indexing process used 21 mn of CPU during these 3mn15 of
-        real time, we are not letting these cores stay idle
-        much... The improvement compared to the numbers above is quite
-        spectacular (a factor of 11, approximately), mostly due to the
-        multiprocessing, but also to the faster CPU and the SSD
-        storage. Note that the peak memory value is for the
-        recollindex process, and does not take into account the
-        multiple Python and pdftotext instances (which are relatively
-        small but things add up...).</p>
-      
-      <h5>Improving indexing performance with hardware:</h5>
-      <p>I think
-      that the following multi-step approach has a good chance to
-        improve performance:
+
+      <h3>PDF: threading</h3>
+
+      <p>Because PDF files are bulky and complicated to process, the
+        dominant step for indexing them is input processing. PDF text
+        extraction is performed by multiple instances
+        the <i>pdftotext</i> program, and parallelisation works very
+        well.</p>
+
+      <p>The following table shows the indexing times with a variety
+        of threading parameters.</p>
+
+      <table border=1>
+	<thead>
+	  <tr>
+	    <th>idxflushmb</th>
+	    <th>thrQSizes</th>
+	    <th>thrTCounts</th>
+	    <th>Time R/U/S</th>
+	  </tr>
+          <tbody>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>2/1/1</td>
+	    <td>19m21</td>
+	  </tr>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>10/10/1</td>
+	    <td>10m38</td>
+	  </tr>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>100/10/1</td>
+	    <td>11m</td>
+	  </tr>
+          </tbody>
+      </table>
+
+      <p>10/10/1 was the best value for thrTCounts for this test. The
+        total CPU time was around 78 mn.</p>
+
+      <p>The last line shows the effect of a ridiculously high thread
+        count value for the input step, which is not much. Using
+        sligthly lower values than the optimum has not much impact
+        either. The only thing which really degrades performance is
+        configuring less threads than available from the hardware.</p>
+
+      <p>With the optimal parameters above, the peak recollindex
+        resident memory size is around 930 MB, to which we should add
+        ten instances of pdftotext (10MB typical), and of the
+        rclpdf.py Python input handler (around 15 MB each). This means
+        that the total resident memory used by indexing is around 1200
+        MB, quite a modest value in 2016.</p>
+
+
+      <h3>PDF: Xapian flushes</h3>
+
+      <p>idxflushmb has practically no influence on the indexing time
+        (tested from 40 to 1000), which is not too surprising because
+        the Xapian index size is very small relatively to the input
+        size, so that the cost of Xapian flushes to disk is not very
+        significant. The value of 200 used for the threading tests
+        could be lowered in practise, which would decrease memory
+        usage and not change the indexing time significantly.</p>
+
+      <h3>PDF: conclusion</h3>
+
+      <p>For indexing PDF files, you need many cores and a fast
+        input storage system. Neither single-thread performance nor
+        amount of memory will be critical aspects.</p>
+
+      <p>Running the PDF indexing tests had no influence on the system
+        "feel", I could work on it just as if it were quiescent.</p>
+
+
+      <h2>Indexing HTML files</h2>
+
+      <p>The tests were run on an (old) French Wikipedia dump: 2.9
+        million HTML files stored in 42000 directories, for an
+        approximate total size of 41 GB (average file size
+        14 KB).
+
+        <p>The files are stored on a local SSD. Just reading them with
+          find+cpio takes close to 8 mn.</p>
+
+        <p>The resulting index has a size of around 30 GB.</p>
+
+        <p>I was too lazy to extract 3 million entries tar file on a
+          spinning disk, so all tests were performed with the data
+          stored on a local SSD.</p>
+
+        <p>For this test, the indexing time is dominated by the Xapian
+          index updates. As these are single threaded, only the flush
+          interval has a real influence.</p>
+
+      <table border=1>
+	<thead>
+	  <tr>
+	    <th>idxflushmb</th>
+	    <th>thrQSizes</th>
+	    <th>thrTCounts</th>
+	    <th>Time R/U/S</th>
+	  </tr>
+          <tbody>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>2/1/1</td>
+	    <td>88m</td>
+	  </tr>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>6/4/1</td>
+	    <td>91m</td>
+	  </tr>
+	  <tr>
+	    <td>200</td>
+	    <td>2/2/2</td>
+	    <td>1/1/1</td>
+	    <td>96m</td>
+	  </tr>
+	  <tr>
+	    <td>100</td>
+	    <td>2/2/2</td>
+	    <td>1/2/1</td>
+	    <td>120m</td>
+	  </tr>
+	  <tr>
+	    <td>100</td>
+	    <td>2/2/2</td>
+	    <td>6/4/1</td>
+	    <td>121m</td>
+	  </tr>
+	  <tr>
+	    <td>40</td>
+	    <td>2/2/2</td>
+	    <td>1/2/1</td>
+	    <td>173m</td>
+	  </tr>
+          </tbody>
+      </table>
+
+
+      <p>The indexing process becomes quite big (resident size around
+        4GB), and the combination of high I/O load and high memory
+        usage makes the system less responsive at times (but not
+        unusable). As this happens principally when switching
+        applications, my guess would be that some program pages
+        (e.g. from the window manager and X) get flushed out, and take
+        time being read in, during which time the display appears
+        frozen.</p>
+
+      <p>For this kind of data, single-threaded CPU performance and
+        storage write speed can make a difference. Multithreading does
+        not help.</p>
+
+      <h2>Adjusting hardware to improve indexing performance</h2>
+
+      <p>I think that the following multi-step approach has a good
+        chance to improve performance:
         <ul>
           <li>Check that multithreading is enabled (it is, by default
             with recent Recoll versions).</li>
@@ -171,10 +331,85 @@
         </ul>
       </p>
 
-      <p>At some point, the index writing may become the
-        bottleneck. As far as I can think, the only possible approach
-        then is to partition the index.</p>
+      <p>At some point, the index updating and writing may become the
+        bottleneck (this depends on the data mix, very quickly with
+        HTML or text files). As far as I can think, the only possible
+        approach is then to partition the index. You can query the
+        multiple Xapian indices either by using the Recoll external
+        index capability, or by actually merging the results with
+        xapian-compact.</p>
+
+
+
+      <h5>Old benchmarks</h5>
+
+      <p>To provide a point of comparison for the evolution of
+        hardware and software...</p>
       
+      <p>The following very old data was obtained (around 2007?) on a
+        machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a
+        7200 RPM 160 GBytes IDE disk, running Suse 10.1.</p>
+
+      <p><b>recollindex</b> (version 1.8.2 with xapian 1.0.0) is
+	executed with the default flush threshold value. 
+	The process memory usage is the one given by <b>ps</b></p>
+
+      <table border=1>
+	<thead>
+	  <tr>
+	    <th>Data</th>
+	    <th>Data size</th>
+	    <th>Indexing time</th>
+	    <th>Index size</th>
+	    <th>Peak process memory usage</th>
+	  </tr>
+	<tbody>
+	  <tr>
+	    <td>Random pdfs harvested on Google</td>
+	    <td>1.7 GB, 3564 files</td>
+	    <td>27 mn</td>
+	    <td>230 MB</td>
+	    <td>225 MB</td>
+	  </tr>
+	  <tr>
+	    <td>Ietf mailing list archive</td>
+	    <td>211 MB, 44,000 messages</td>
+	    <td>8 mn</td>
+	    <td>350 MB</td>
+	    <td>90 MB</td>
+	  </tr>
+	  <tr>
+	    <td>Partial Wikipedia dump</td>
+	    <td>15 GB, one million files</td>
+	    <td>6H30</td>
+	    <td>10 GB</td>
+	    <td>324 MB</td>
+	  </tr>
+	  <tr>
+	    <!-- DB: ndocs 3564 lastdocid 3564 avglength 6460.71 -->
+	    <td>Random pdfs harvested on Google<br>
+	    Recoll 1.9, <em>idxflushmb</em> set to 10</td>
+	    <td>1.7 GB, 3564 files</td>
+	    <td>25 mn</td>
+	    <td>262 MB</td>
+	    <td>65 MB</td>
+	  </tr>
+	</tbody>
+      </table>
+
+      <p>Notice how the index size for the mail archive is bigger than
+	the data size. Myriads of small pure text documents will do
+	this. The factor of expansion would be even much worse with
+	compressed folders of course (the test was on uncompressed
+	data).</p>
+
+      <p>The last test was performed with Recoll 1.9.0 which has an
+	ajustable flush threshold (<em>idxflushmb</em> parameter), here
+	set to 10 MB. Notice the much lower peak memory usage, with no
+	performance degradation. The resulting index is bigger though,
+	the exact reason is not known to me, possibly because of
+	additional fragmentation </p>
+
     </div>
   </body>
 </html>