Fedora 29
Recoll 1.24.3
Xapian 1.4.9

Thanks for the great program.

When indexing, I noticed that a large amount of data is written to disk compared to the final size of the index. The concern is unnecessarily excessive writing wear on a solid state disk (SSD).

Example 1
Creating a database of a few home folders using recollindex. Total bytes written is monitored using iostat and separately Gnome System Monitor (whose values agree with each other).
xapiandb folder size 3.7 GB
SSD disk write total 14.9 GB

Example 2
Creating a database of a different folder using recollindex.
xapiandb folder size 10.1 GB
SSD disk write total 133.1 GB

Example 3
Creating a database of a lot of folders (from spinning HDDs) using recollindex. This indexing run was stopped before finishing after about 2 days.
xapiandb folder size 247 GB
SSD disk write total approx over 10 TB

This 10TB written is approx 1-2% of the total SSD life span of bytes written, for one recollindex run that didn’t finish.

Is this level of written bytes typical? And could anything be done to reduce it?

Writing the xapiandb folder to a spinning HDD could be done, but would be very slow. I noticed setting idxflushmb = 6000 lowered the xapiandb folder size and bytes written by about 20%.

Is the large amount of recollindex bytes written related to xapian using atomic commits? I noticed that xapian can use a ‘dangerous mode’ of updating the database in place.

Could recollindex optionally use this DB_DANGEROUS mode (off by default)? This could be useful for the first xapiandb creation, where no one will be querying the database, and if the power fails etc (unlikely) then the xapiandb folder is manually deleted and recollindex started again. Plus the final xapiandb folder will be smaller.

Update the database in-place.
Xapian's disk-based backends use block-based storage, with copy-on-write to allow the previous revision to be searched while a new revision forms.
This option means changed blocks get written back over the top of the old version. The benefits of this are that less I/O is required during indexing, and the result of indexing is more compact. The downsides are that you can't concurrently search while indexing, transactions can't be cancelled, and if indexing ends uncleanly (i.e. without commit() or WritableDatabase's destructor being called) then the database won't be usable.
Currently all the base files will be removed upon the first modification, and new base files will be written upon commit. This prevents new readers from opening the database while it unsafe to do so, but there's not currently a mechanism in Xapian to handle notifying existing readers.


  • medoc


    I just gave a try to DB_DANGEROUS (and DB_NO_SYNC too), but this does not seem to change the amount of disk writes a lot.

    This kind of makes sense: DB_DANGEROUS changes the place where writes are performed (existing block rather than copy), but probably not much the amount of writes. DB_NO_SYNC works on small indexes, but once the index becomes much bigger than the buffer cache, it's not very efficient any more.

    DB_DANGEROUS results in a smaller index though, so less writes overall, but the difference is not spectacular.

    It seems that the ratio between amount of writes and index size rises with the index size.
    Maybe it would make sense to create the index in several pieces and then merge them. I'm really not sure. The best place to ask about this writing problem would be the Xapian discuss mailing list.

    I'm subscribed, so if there is a need for recoll information, I can supply it there as needed.

    Also, in my experience, it's not necessarily a major performance issue to have the index on spinning disk. When indexing many small files it's actually more important that the source is on SSD. Did you actually give it a try ?

    Last, and mostly about your last try (I get approximately the same ratios as you for the smaller ones): sorry, but I have to ask: where is your swap partition ?


Cancel   Add attachment