|
a/website/rclidxfmt.html |
|
b/website/rclidxfmt.html |
|
... |
|
... |
17 |
|
17 |
|
18 |
<body>
|
18 |
<body>
|
19 |
<div class="content">
|
19 |
<div class="content">
|
20 |
<h1>Recoll index format details</h1>
|
20 |
<h1>Recoll index format details</h1>
|
21 |
|
21 |
|
22 |
<p>A comparison of index formats for recoll 1.8 and omega
|
22 |
<p>A comparison of index formats for recoll 1.17 and omega
|
23 |
1.0.1</p>
|
23 |
1.0.1</p>
|
24 |
|
24 |
|
25 |
<p>Recoll terms are not stemmed before being stored. They are turned to
|
25 |
<p>Recoll terms are not stemmed before being stored. They are turned to
|
26 |
all minuscule letters with no accents. An auxiliary database
|
26 |
all minuscule letters with no accents. An auxiliary database
|
27 |
handles stem expansion. Omega stores both raw
|
27 |
handles stem expansion. Omega stores both raw
|
28 |
terms and stemmed versions (with prefix Z)</p>
|
28 |
terms (with prefix R) and stemmed versions (with prefix Z).
|
|
|
29 |
The xapian-side of the information here comes from the relevant
|
|
|
30 |
xapian-omega <a
|
|
|
31 |
href="http://xapian.org/docs/omega/termprefixes.html">documentation
|
|
|
32 |
page</a>.
|
|
|
33 |
</p>
|
29 |
|
34 |
|
30 |
<h2>Special prefixed terms:</h2>
|
35 |
<h2>Special prefixed terms:</h2>
|
31 |
|
36 |
|
32 |
<p>A comparison of prefixed term usage between Recoll and
|
37 |
<p>A comparison of prefixed term usage between Recoll and
|
33 |
omega/xapian. <em>xapian-core</em> in the Omega column means
|
38 |
omega/xapian.</p>
|
34 |
that the prefix is not used by Omega, but mentionned as
|
|
|
35 |
allocated in the xapian prefix definition document.</p>
|
|
|
36 |
|
39 |
|
37 |
<table border=1 cellspacing=0 width="90%">
|
40 |
<table border=1 cellspacing=0 width="90%">
|
38 |
<thead>
|
41 |
<thead>
|
39 |
<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
|
42 |
<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
|
40 |
</tr>
|
43 |
</tr>
|
41 |
</thead>
|
44 |
</thead>
|
42 |
<tbody>
|
45 |
<tbody>
|
|
|
46 |
<tr><td>A</td><td>Author</td><td>Same</td></tr>
|
|
|
47 |
|
|
|
48 |
<tr><td>B</td><td>Unused</td><td>Reserved</td></tr>
|
|
|
49 |
<tr><td>C</td><td>Unused</td><td>Reserved</td></tr>
|
|
|
50 |
|
|
|
51 |
<tr><td>D</td><td>date: modification date of file, like
|
|
|
52 |
YYYYMMDD</td><td>Same</td></tr>
|
|
|
53 |
|
|
|
54 |
<tr><td>E</td><td>Unused. Recoll uses XE</td>
|
|
|
55 |
<td>file name extension folded to lowercase</td></tr>
|
|
|
56 |
|
|
|
57 |
|
|
|
58 |
<tr><td>F</td><td>Unused</td><td>Reserved</td></tr>
|
|
|
59 |
<tr><td>G</td><td>Unused</td><td>newGroup / forum name</td></tr>
|
|
|
60 |
|
|
|
61 |
<tr><td>H</td><td>Unused</td><td>host name</td></tr>
|
|
|
62 |
|
|
|
63 |
<tr><td>I</td><td>Unused</td><td>"Can see"</td></tr>
|
|
|
64 |
|
|
|
65 |
<tr><td>J</td><td>Unused</td><td>Reserved</td></tr>
|
|
|
66 |
<tr><td>K</td><td>Keyword</td><td>Same</td></tr>
|
|
|
67 |
|
|
|
68 |
<tr><td>L</td><td>Unused</td><td>ISO language code</td></tr>
|
|
|
69 |
|
|
|
70 |
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td></tr>
|
|
|
71 |
|
|
|
72 |
<tr><td>N</td><td>Unused</td><td>ISO country code</td></tr>
|
|
|
73 |
|
|
|
74 |
<tr><td>O</td><td>Unused</td><td>Owner</td></tr>
|
|
|
75 |
|
|
|
76 |
<tr><td>P</td><td>Unused</td><td>Path part of URL</td></tr>
|
|
|
77 |
|
|
|
78 |
<tr><td>Q</td><td>Unique Id. fs backend: trunc-hashed path+ipath
|
|
|
79 |
Other backends may use a different unique id.
|
|
|
80 |
</td><td>Unique Id</td></tr>
|
|
|
81 |
|
|
|
82 |
<tr><td>R</td><td>Unused</td><td>Raw (unstemmed) term</td></tr>
|
|
|
83 |
|
|
|
84 |
<tr><td>S</td><td>Subject/title</td><td>Same</td></tr>
|
|
|
85 |
|
43 |
<tr><td>T</td><td>mime type</td><td>Same</td>
|
86 |
<tr><td>T</td><td>mime type</td><td>Same</td></tr>
|
|
|
87 |
|
|
|
88 |
<tr><td>U</td><td>Unused</td><td>Full Url of indexed
|
|
|
89 |
document. Truncated/hashed version of URL. Used for
|
|
|
90 |
duplicate checks.</td></tr>
|
|
|
91 |
|
|
|
92 |
<tr><td>V</td><td>Unused</td><td>"Can't see"</td></tr>
|
|
|
93 |
|
|
|
94 |
<tr><td>W</td><td>Unused</td><td>Owner</td></tr>
|
|
|
95 |
|
|
|
96 |
<tr><td>X</td><td>Prefix prefix for multichar prefixes</td>
|
|
|
97 |
<td>Same</td></tr>
|
|
|
98 |
|
|
|
99 |
<tr><td>Y</td><td>year YYYY</td><td>Same</td></tr>
|
|
|
100 |
|
|
|
101 |
<tr><td>Z</td><td>Unused</td><td>Stemmed term</td></tr>
|
|
|
102 |
|
|
|
103 |
<tr><td>XE</td><td>File name extension folded as lowercase
|
|
|
104 |
(omega uses E)</td><td>Unused</td></tr>
|
|
|
105 |
|
|
|
106 |
<tr><td>XP</td><td>Path elements (for phrase-based directory filtering)
|
|
|
107 |
</td><td>Unused</td></tr>
|
|
|
108 |
|
|
|
109 |
<tr><td>XSFN</td><td>utf8 lowercased/unaccented version of
|
|
|
110 |
file name. Used for specific file name searches. NOT SPLIT
|
|
|
111 |
(spaces as normal chars).</td><td>None</td>
|
|
|
112 |
|
|
|
113 |
<tr><td>XTO</td><td>Recipient</td><td>None</td>
|
|
|
114 |
<tr><td>XXST</td><td>Not really a prefix: start of field
|
|
|
115 |
marker (for anchored phrase searches)</td><td>None</td>
|
|
|
116 |
<tr><td>XXND</td><td>Not really a prefix: end of field
|
|
|
117 |
marker (for anchored phrase searches)</td><td>None</td>
|
|
|
118 |
|
44 |
</tr>
|
119 |
</tr>
|
45 |
|
120 |
|
46 |
<tr><td>P</td><td>Truncated/hashed version of file path. For
|
|
|
47 |
single-document files, and for the file part of a
|
|
|
48 |
multi-document file. Used for up-to-date checks and for
|
|
|
49 |
retrieving a document by path. </td><td>Path part of URL (no
|
|
|
50 |
hashing). Uses U for the equivalent
|
|
|
51 |
term used for up to date checks.</td>
|
|
|
52 |
</tr>
|
|
|
53 |
|
|
|
54 |
<tr><td>Q</td><td>pathhash+ipath same + internal path for
|
|
|
55 |
documents inside multi-document files. Used to set the
|
|
|
56 |
existence flag for subdocs when a multi-document file is found
|
|
|
57 |
to be up to date, or for deleting all subdocs for a file, or
|
|
|
58 |
for retrieving a document by path+ipath. Compatible
|
|
|
59 |
with Q definition in xapian/termprefixes.txt: unique
|
|
|
60 |
identifier.</td><td>None</td>
|
|
|
61 |
</tr>
|
|
|
62 |
|
|
|
63 |
<tr><td>D</td><td>date: modification date of file, like
|
|
|
64 |
YYYYMMDD</td><td>Same</td>
|
|
|
65 |
</tr>
|
|
|
66 |
|
|
|
67 |
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
|
|
|
68 |
</tr>
|
|
|
69 |
<tr><td>Y</td><td>year YYYY</td><td>Same</td>
|
|
|
70 |
</tr>
|
|
|
71 |
|
|
|
72 |
<tr><td>XSFN</td><td>utf8 version of file name. Used for specific
|
|
|
73 |
file name searches</td><td>None</td>
|
|
|
74 |
</tr>
|
|
|
75 |
<tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
|
|
|
76 |
of URL. Used for duplicate checks.</td>
|
|
|
77 |
</tr>
|
|
|
78 |
|
|
|
79 |
<tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
|
|
|
80 |
</tr>
|
|
|
81 |
<tr><td>A</td><td>Author</td><td>xapian-core</td>
|
|
|
82 |
</tr>
|
|
|
83 |
<tr><td>K</td><td>Keyword</td><td>xapian-core</td>
|
|
|
84 |
</tr>
|
|
|
85 |
|
121 |
|
86 |
</tbody>
|
122 |
</tbody>
|
87 |
</table>
|
123 |
</table>
|
88 |
|
124 |
|
89 |
<p>None of the "date" terms are currently used by recoll queries</p>
|
|
|
90 |
|
125 |
|
91 |
<h2>Values</h2>
|
126 |
<h2>Values</h2>
|
92 |
<p>Recoll currently stores no document values.</p>
|
127 |
|
93 |
<p>Omega stores 2 values, for the md5 hash of the file, and the
|
128 |
<table border=1 cellspacing=0 width="90%">
|
94 |
last modification date (as unix time). The md5 value doesn't
|
129 |
<thead>
|
95 |
appear to be currently used ?</p>
|
130 |
<tr><th>Value slot</th><th>Recoll use</th><th>Omega use</th>
|
|
|
131 |
</tr>
|
|
|
132 |
</thead>
|
|
|
133 |
<tbody>
|
|
|
134 |
<tr><td>0</td><td>Unused</td><td>Unix modification time</td></tr>
|
|
|
135 |
<tr><td>1</td><td>MD5</td><td>Same</td></tr>
|
|
|
136 |
<tr><td>2</td><td>Unused</td><td>Size</td></tr>
|
|
|
137 |
<tr><td>10</td><td>Signature: value to be checked for
|
|
|
138 |
up-to-dateness, ie mtime|size for the fs
|
|
|
139 |
backend</td><td>Unused</td></tr>
|
|
|
140 |
</tbody>
|
|
|
141 |
</table>
|
|
|
142 |
|
96 |
|
143 |
|
97 |
<h2>Document data record format</h2>
|
144 |
<h2>Document data record format</h2>
|
|
|
145 |
|
98 |
<p>Recoll has the same line based / prefixed data record format
|
146 |
<p>Recoll has the same line based / prefixed data record format
|
99 |
as omega (name=value\n).</p>
|
147 |
as omega (name=value\n). The Omega data below is quite out of
|
|
|
148 |
date.</p>
|
100 |
|
149 |
|
101 |
<table border=1 cellspacing=0 width="90%">
|
150 |
<table border=1 cellspacing=0 width="90%">
|
102 |
<thead>
|
151 |
<thead>
|
103 |
<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
|
152 |
<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
|
104 |
</tr>
|
153 |
</tr>
|
|
... |
|
... |
139 |
|
188 |
|
140 |
<hr>
|
189 |
<hr>
|
141 |
<address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
|
190 |
<address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
|
142 |
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
|
191 |
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
|
143 |
<!-- hhmts start -->
|
192 |
<!-- hhmts start -->
|
144 |
Last modified: Thu Jun 14 11:14:38 CEST 2007
|
193 |
Last modified: Sat Feb 25 09:14:38 CEST 2012
|
145 |
<!-- hhmts end -->
|
194 |
<!-- hhmts end -->
|
146 |
</body>
|
195 |
</body>
|
147 |
</html>
|
196 |
</html>
|