Recoll: Indexing performance and index sizes

The time needed to index a given set of documents, and the resulting index size depend of many factors.

The index size depends almost only on the size of the uncompressed input text, and you can expect it to be roughly of the same order of magnitude. Depending on the type of file, the proportion of text to file size varies very widely, going from close to 1 for pure text files to a very small factor for, e.g., metadata tags in mp3 files.

Estimating indexing time is a much more complicated issue, depending on the type and size of input and on system performance. There is no general way to determine what part of the hardware should be optimized. Depending on the type of input, performance may be bound by I/O read or write performance, CPU single-processing speed, or combined multi-processing speed.

It should be noted that Recoll performance will not be an issue for most people. The indexer can process 1000 typical PDF files per minute, or 500 Wikipedia HTML pages per second on medium-range hardware, meaning that the initial indexing of a typical dataset will need a few dozen minutes at most. Further incremental index updates will be much faster because most files will not need to be processed again.

However, there are Recoll installations with terabyte-sized datasets, on which indexing can take days. For such operations (or even much smaller ones), it is very important to know what kind of performance can be expected, and what aspects of the hardware should be optimized.

In order to provide some reference points, I have run a number of benchs on medium-sized datasets, using typical mid-range desktop hardware, and varying the indexing configuration parameters to show how they affect the results.

The following may help you check that you are getting typical performance for your indexing, and give some indications about what to adjust to improve it.

From time to time, I receive a report about a system becoming unusable during indexing. As far as I know, with the default Recoll configuration, and barring an exceptional issue (bug), this is always due to a system problem (typically bad hardware such as a disk doing retries). The tests below were mostly run while I was using the desktop, which never became unusable. However, some tests rendered it less responsive and this is noted with the results.

The following text refers to the indexing parameters without further explanation. Here follow links to more explanation about the processing model and configuration parameters.

All text were run without generating the stemming database or aspell dictionary. These phases are relatively short and there is nothing which can be optimized about them.

Hardware

The tests were run on what could be considered a mid-range desktop PC:

This is usually a fanless PC, but I did run a fan on the external case fins during some of the tests (esp. PDF indexing), because the CPU was running a bit too hot.

Indexing PDF files

The tests were run on 18000 random PDFs harvested on Google, with a total size of around 30 GB, using Recoll 1.22.3 and Xapian 1.2.22. The resulting index size was 1.2 GB.

PDF: storage

Typical PDF files have a low text to file size ratio, and a lot of data needs to be read for indexing. With the test configuration, the indexer needs to read around 45 MBytes / S from multiple files. This means that input storage makes a difference and that you need an SSD or a fast array for optimal performance.

Storage idxflushmb thrTCounts Real Time
NFS drive (gigabit) 200 6/4/1 24m40
local SSD 200 6/4/1 11m40

PDF: threading

Because PDF files are bulky and complicated to process, the dominant step for indexing them is input processing. PDF text extraction is performed by multiple instances the pdftotext program, and parallelisation works very well.

The following table shows the indexing times with a variety of threading parameters.

idxflushmb thrQSizes thrTCounts Time R/U/S
200 2/2/2 2/1/1 19m21
200 2/2/2 10/10/1 10m38
200 2/2/2 100/10/1 11m

10/10/1 was the best value for thrTCounts for this test. The total CPU time was around 78 mn.

The last line shows the effect of a ridiculously high thread count value for the input step, which is not much. Using sligthly lower values than the optimum has not much impact either. The only thing which really degrades performance is configuring less threads than available from the hardware.

With the optimal parameters above, the peak recollindex resident memory size is around 930 MB, to which we should add ten instances of pdftotext (10MB typical), and of the rclpdf.py Python input handler (around 15 MB each). This means that the total resident memory used by indexing is around 1200 MB, quite a modest value in 2016.

PDF: Xapian flushes

idxflushmb has practically no influence on the indexing time (tested from 40 to 1000), which is not too surprising because the Xapian index size is very small relatively to the input size, so that the cost of Xapian flushes to disk is not very significant. The value of 200 used for the threading tests could be lowered in practise, which would decrease memory usage and not change the indexing time significantly.

PDF: conclusion

For indexing PDF files, you need many cores and a fast input storage system. Neither single-thread performance nor amount of memory will be critical aspects.

Running the PDF indexing tests had no influence on the system "feel", I could work on it just as if it were quiescent.

Indexing HTML files

The tests were run on an (old) French Wikipedia dump: 2.9 million HTML files stored in 42000 directories, for an approximate total size of 41 GB (average file size 14 KB).

The files are stored on a local SSD. Just reading them with find+cpio takes close to 8 mn.

The resulting index has a size of around 30 GB.

I was too lazy to extract 3 million entries tar file on a spinning disk, so all tests were performed with the data stored on a local SSD.

For this test, the indexing time is dominated by the Xapian index updates. As these are single threaded, only the flush interval has a real influence.

idxflushmb thrQSizes thrTCounts Time R/U/S
200 2/2/2 2/1/1 88m
200 2/2/2 6/4/1 91m
200 2/2/2 1/1/1 96m
100 2/2/2 1/2/1 120m
100 2/2/2 6/4/1 121m
40 2/2/2 1/2/1 173m

The indexing process becomes quite big (resident size around 4GB), and the combination of high I/O load and high memory usage makes the system less responsive at times (but not unusable). As this happens principally when switching applications, my guess would be that some program pages (e.g. from the window manager and X) get flushed out, and take time being read in, during which time the display appears frozen.

For this kind of data, single-threaded CPU performance and storage write speed can make a difference. Multithreading does not help.

Adjusting hardware to improve indexing performance

I think that the following multi-step approach has a good chance to improve performance:

At some point, the index updating and writing may become the bottleneck (this depends on the data mix, very quickly with HTML or text files). As far as I can think, the only possible approach is then to partition the index. You can query the multiple Xapian indices either by using the Recoll external index capability, or by actually merging the results with xapian-compact.

Old benchmarks

To provide a point of comparison for the evolution of hardware and software...

The following very old data was obtained (around 2007?) on a machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE disk, running Suse 10.1.

recollindex (version 1.8.2 with xapian 1.0.0) is executed with the default flush threshold value. The process memory usage is the one given by ps

Data Data size Indexing time Index size Peak process memory usage
Random pdfs harvested on Google 1.7 GB, 3564 files 27 mn 230 MB 225 MB
Ietf mailing list archive 211 MB, 44,000 messages 8 mn 350 MB 90 MB
Partial Wikipedia dump 15 GB, one million files 6H30 10 GB 324 MB
Random pdfs harvested on Google
Recoll 1.9, idxflushmb set to 10
1.7 GB, 3564 files 25 mn 262 MB 65 MB

Notice how the index size for the mail archive is bigger than the data size. Myriads of small pure text documents will do this. The factor of expansion would be even much worse with compressed folders of course (the test was on uncompressed data).

The last test was performed with Recoll 1.9.0 which has an ajustable flush threshold (idxflushmb parameter), here set to 10 MB. Notice the much lower peak memory usage, with no performance degradation. The resulting index is bigger though, the exact reason is not known to me, possibly because of additional fragmentation