Thanks for the great program.
When indexing, I noticed that a large amount of data is written to disk compared to the final size of the index. The concern is unnecessarily excessive writing wear on a solid state disk (SSD).
Creating a database of a few home folders using recollindex. Total bytes written is monitored using iostat and separately Gnome System Monitor (whose values agree with each other).
xapiandb folder size 3.7 GB
SSD disk write total 14.9 GB
Creating a database of a different folder using recollindex.
xapiandb folder size 10.1 GB
SSD disk write total 133.1 GB
Creating a database of a lot of folders (from spinning HDDs) using recollindex. This indexing run was stopped before finishing after about 2 days.
xapiandb folder size 247 GB
SSD disk write total approx over 10 TB
This 10TB written is approx 1-2% of the total SSD life span of bytes written, for one recollindex run that didn’t finish.
Is this level of written bytes typical? And could anything be done to reduce it?
Writing the xapiandb folder to a spinning HDD could be done, but would be very slow. I noticed setting idxflushmb = 6000 lowered the xapiandb folder size and bytes written by about 20%.
Is the large amount of recollindex bytes written related to xapian using atomic commits? I noticed that xapian can use a ‘dangerous mode’ of updating the database in place.
Could recollindex optionally use this DB_DANGEROUS mode (off by default)? This could be useful for the first xapiandb creation, where no one will be querying the database, and if the power fails etc (unlikely) then the xapiandb folder is manually deleted and recollindex started again. Plus the final xapiandb folder will be smaller.
Update the database in-place.
Xapian's disk-based backends use block-based storage, with copy-on-write to allow the previous revision to be searched while a new revision forms.
This option means changed blocks get written back over the top of the old version. The benefits of this are that less I/O is required during indexing, and the result of indexing is more compact. The downsides are that you can't concurrently search while indexing, transactions can't be cancelled, and if indexing ends uncleanly (i.e. without commit() or WritableDatabase's destructor being called) then the database won't be usable.
Currently all the base files will be removed upon the first modification, and new base files will be written upon commit. This prevents new readers from opening the database while it unsafe to do so, but there's not currently a mechanism in Xapian to handle notifying existing readers.