Recoll: Indexing performance and index sizes

The time needed to index a given set of documents, and the resulting index size depend of many factors, such as file size and proportion of actual text content for the index size, cpu speed, available memory, average file size and format for the speed of indexing.

We try here to give a number of reference points which can be used to roughly estimate the resources needed to create and store an index. Obviously, your data set will never fit one of the samples, so the results cannot be exactly predicted.

The following very old data was obtained on a machine with a 1800 Mhz AMD Duron CPU, 768Mb of Ram, and a 7200 RPM 160 GBytes IDE disk, running Suse 10.1. More recent data follows.

recollindex (version 1.8.2 with xapian 1.0.0) is executed with the default flush threshold value. The process memory usage is the one given by ps

Data Data size Indexing time Index size Peak process memory usage
Random pdfs harvested on Google 1.7 GB, 3564 files 27 mn 230 MB 225 MB
Ietf mailing list archive 211 MB, 44,000 messages 8 mn 350 MB 90 MB
Partial Wikipedia dump 15 GB, one million files 6H30 10 GB 324 MB
Random pdfs harvested on Google
Recoll 1.9, idxflushmb set to 10
1.7 GB, 3564 files 25 mn 262 MB 65 MB

Notice how the index size for the mail archive is bigger than the data size. Myriads of small pure text documents will do this. The factor of expansion would be even much worse with compressed folders of course (the test was on uncompressed data).

The last test was performed with Recoll 1.9.0 which has an ajustable flush threshold (idxflushmb parameter), here set to 10 MB. Notice the much lower peak memory usage, with no performance degradation. The resulting index is bigger though, the exact reason is not known to me, possibly because of additional fragmentation

There is more recent performance data (2012) at the end of the article about converting Recoll indexing to multithreading

Update, March 2016: I took another sample of PDF performance data on a more modern machine, with Recoll multithreading turned on. The machine has an Intel Core I7-4770T Cpu, which has 4 physical cores, and supports hyper-threading for a total of 8 threads, 8 GBytes of RAM, and SSD storage (incidentally the PC is fanless, this is not a "beast" computer).

Data Data size Indexing time Index size Peak process memory usage
Random pdfs harvested on Google
Recoll 1.21.5, idxflushmb set to 200, thread parameters 6/4/1
11 GB, 5320 files 3 mn 15 S 400 MB 545 MB

The indexing process used 21 mn of CPU during these 3mn15 of real time, we are not letting these cores stay idle much... The improvement compared to the numbers above is quite spectacular (a factor of 11, approximately), mostly due to the multiprocessing, but also to the faster CPU and the SSD storage. Note that the peak memory value is for the recollindex process, and does not take into account the multiple Python and pdftotext instances (which are relatively small but things add up...).

Improving indexing performance with hardware:

I think that the following multi-step approach has a good chance to improve performance:

At some point, the index writing may become the bottleneck. As far as I can think, the only possible approach then is to partition the index.