<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Recoll Index format</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
<meta name="Keywords" content=
"recherche textuelle,desktop,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="fr">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="content">
<h1>Recoll index format details</h1>
<p>A comparison of index formats for recoll 1.8 and omega
1.0.1</p>
<p>Recoll terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents. An auxiliary database
handles stem expansion. Omega stores both raw
terms and stemmed versions (with prefix Z)</p>
<h2>Special prefixed terms:</h2>
<p>A comparison of prefixed term usage between Recoll and
omega/xapian. <em>xapian-core</em> in the Omega column means
that the prefix is not used by Omega, but mentionned as
allocated in the xapian prefix definition document.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>T</td><td>mime type</td><td>Same</td>
</tr>
<tr><td>P</td><td>Truncated/hashed version of file path. For
single-document files, and for the file part of a
multi-document file. Used for up-to-date checks and for
retrieving a document by path. </td><td>Path part of URL (no
hashing). Uses U for the equivalent
term used for up to date checks.</td>
</tr>
<tr><td>Q</td><td>pathhash+ipath same + internal path for
documents inside multi-document files. Used to set the
existence flag for subdocs when a multi-document file is found
to be up to date, or for deleting all subdocs for a file, or
for retrieving a document by path+ipath. Compatible
with Q definition in xapian/termprefixes.txt: unique
identifier.</td><td>None</td>
</tr>
<tr><td>D</td><td>date: modification date of file, like
YYYYMMDD</td><td>Same</td>
</tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
</tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td>
</tr>
<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
file name searches</td><td>None</td>
</tr>
<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
of URL. Used for duplicate checks.</td>
</tr>
<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
</tr>
<tr><td>A</td><td>Author</td><td>xapian-core</td>
</tr>
<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
</tr>
</tbody>
</table>
<p>None of the "date" terms are currently used by recoll queries</p>
<h2>Values</h2>
<p>Recoll currently stores no document values.</p>
<p>Omega stores 2 values, for the md5 hash of the file, and the
last modification date (as unix time). The md5 value doesn't
appear to be currently used ?</p>
<h2>Document data record format</h2>
<p>Recoll has the same line based / prefixed data record format
as omega (name=value\n).</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>url=</td><td>Full url. Always file://abspath. The path is not
encoded to utf-8, this is the system file name ,usable as an
argument to open()</td><td>Same</td>
</tr>
<tr><td>mtype=</td><td>mime type (omega: type)</td><td>type=</td>
</tr>
<tr><td>fmtime=</td><td>file modification date</td><td>modtime=</td>
</tr>
<tr><td>dmtime=</td><td> document modification date</td><td>None</td>
</tr>
<tr><td>origcharset=</td><td> character set the text was
converted from</td><td>None</td>
</tr>
<tr><td>fbytes=</td><td> file size in bytes</td><td>size=</td>
</tr>
<tr><td>dbytes=</td><td>document size in bytes</td><td>None</td>
</tr>
<tr><td>ipath=</td><td>internal path for docs in multidoc
files</td><td>None</td>
</tr>
<tr><td>caption=</td><td>title of document, utf8</td><td>Same</td>
</tr>
<tr><td>keywords=</td><td>key words, utf8</td><td>None</td>
</tr>
<tr><td>abstract=</td><td>document abstract, utf8</td><td>sample=</td>
</tr>
</tbody>
</table>
</div>
<hr>
<address><a href="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois Dockes</a></address>
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
<!-- hhmts start -->
Last modified: Thu Jun 14 11:14:38 CEST 2007
<!-- hhmts end -->
</body>
</html>