recoll / Code / Diff of /src/rcldb/stemdb.h

Diff of /src/rcldb/stemdb.h [0b981f] .. [35f7e6]

Switch to unified view


...
 * 
 * The map is then stored as a Xapian index where each stem is the
 * unique term indexing a document, and the list of expansions is stored
 * as the document data record. It would probably be possible to store
 * the expansions as the document term list instead (using a prefix to
 * distinguish the stem term). I tried this (chert, 08-2012) and the stem
 * db creation is very slightly slower than with the record approach, and
 * the result is 50% bigger.
 *
 * Another possible approach would be to update the stem map as we index. 
 * This would probably be be less efficient for a full index pass because
 * each term would be seen and stemmed many times, but it might be
 * more efficient for an incremental pass with a limited number of

	a/src/rcldb/stemdb.h		b/src/rcldb/stemdb.h
	...		...
32	*	32	*
33	* The map is then stored as a Xapian index where each stem is the	33	* The map is then stored as a Xapian index where each stem is the
34	* unique term indexing a document, and the list of expansions is stored	34	* unique term indexing a document, and the list of expansions is stored
35	* as the document data record. It would probably be possible to store	35	* as the document data record. It would probably be possible to store
36	* the expansions as the document term list instead (using a prefix to	36	* the expansions as the document term list instead (using a prefix to
37	* distinguish the stem term).	37	* distinguish the stem term). I tried this (chert, 08-2012) and the stem
		38	* db creation is very slightly slower than with the record approach, and
		39	* the result is 50% bigger.
38	*	40	*
39	* Another possible approach would be to update the stem map as we index.	41	* Another possible approach would be to update the stem map as we index.
40	* This would probably be be less efficient for a full index pass because	42	* This would probably be be less efficient for a full index pass because
41	* each term would be seen and stemmed many times, but it might be	43	* each term would be seen and stemmed many times, but it might be
42	* more efficient for an incremental pass with a limited number of	44	* more efficient for an incremental pass with a limited number of