recoll / Code / Diff of /src/rcldb/stemdb.h

Diff of /src/rcldb/stemdb.h [ec7b40] .. [0b981f]

Switch to unified view

-a/src/rcldb/stemdb.h
+b/src/rcldb/stemdb.h
 ...
  *   Free Software Foundation, Inc.,
  *   59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
  */
 #ifndef _STEMDB_H_INCLUDED_
 #define _STEMDB_H_INCLUDED_
-/// Stem database code
+/** Stem database code
-///
+ *
-/// Stem databases list stems and the set of index terms they expand to. They
+ * Stem databases list stems and the set of index terms they expand to. They
-/// are computed from index data by stemming each term and regrouping those
+ * are computed from index data by stemming each term and regrouping those
-/// that stem to the same value.
+ * that stem to the same value.
+ *
-/// Stem databases are stored as separate xapian databases (used as an
+ * Stem databases are stored as separate Xapian databases, in
-/// Isam method), in subdirectories of the index.
+ * subdirectories of the index (e.g.: stem_french, stem_german2)
+ *
+ * The stem database is generated at the end of an indexing session by
+ * walking the whole index term list, computing the stem for each
+ * term, and building a stem->terms map.
+ *
+ * The map is then stored as a Xapian index where each stem is the
+ * unique term indexing a document, and the list of expansions is stored
+ * as the document data record. It would probably be possible to store
+ * the expansions as the document term list instead (using a prefix to
+ * distinguish the stem term).
+ *
+ * Another possible approach would be to update the stem map as we index.
+ * This would probably be be less efficient for a full index pass because
+ * each term would be seen and stemmed many times, but it might be
+ * more efficient for an incremental pass with a limited number of
+ * updated documents. For a small update, the stem building part often
+ * dominates the indexing time.
+ *
+ * For future reference, I did try to store the map in a gdbm file and
+ * the result is bigger and takes more time to create than the Xapian version.
+ */
 #include <vector>
 #include <string>
 #include <xapian.h>