Download this file

rclidxfmt.html    197 lines (149 with data), 6.3 kB

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<title>Recoll Index format</title>
<meta name="generator" content="HTML Tidy, see www.w3.org">
<meta name="Author" content="Jean-Francois Dockes">
<meta name="Description" content=
"recoll est un logiciel personnel de recherche textuelle pour unix et linux basé sur Xapian, un moteur d'indexation puissant et mature.">
<meta name="Keywords" content=
"recherche textuelle,desktop,unix,linux,solaris,open source,free">
<meta http-equiv="Content-language" content="fr">
<meta http-equiv="content-type" content=
"text/html; charset=iso-8859-1">
<meta name="robots" content="All,Index,Follow">
<link type="text/css" rel="stylesheet" href="styles/style.css">
</head>
<body>
<div class="content">
<h1>Recoll index format details</h1>
<p>A comparison of index formats for recoll 1.17 and omega
1.0.1</p>
<p>Recoll terms are not stemmed before being stored. They are turned to
all minuscule letters with no accents. An auxiliary database
handles stem expansion. Omega stores both raw
terms (with prefix R) and stemmed versions (with prefix Z).
The xapian-side of the information here comes from the relevant
xapian-omega <a
href="http://xapian.org/docs/omega/termprefixes.html">documentation
page</a>.
</p>
<h2>Special prefixed terms:</h2>
<p>A comparison of prefixed term usage between Recoll and
omega/xapian.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>A</td><td>Author</td><td>Same</td></tr>
<tr><td>B</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>C</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>D</td><td>date: modification date of file, like
YYYYMMDD</td><td>Same</td></tr>
<tr><td>E</td><td>Unused. Recoll uses XE</td>
<td>file name extension folded to lowercase</td></tr>
<tr><td>F</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>G</td><td>Unused</td><td>newGroup / forum name</td></tr>
<tr><td>H</td><td>Unused</td><td>host name</td></tr>
<tr><td>I</td><td>Unused</td><td>"Can see"</td></tr>
<tr><td>J</td><td>Unused</td><td>Reserved</td></tr>
<tr><td>K</td><td>Keyword</td><td>Same</td></tr>
<tr><td>L</td><td>Unused</td><td>ISO language code</td></tr>
<tr><td>M</td><td>month: YYYYMM</td><td>Same</td></tr>
<tr><td>N</td><td>Unused</td><td>ISO country code</td></tr>
<tr><td>O</td><td>Unused</td><td>Owner</td></tr>
<tr><td>P</td><td>Unused</td><td>Path part of URL</td></tr>
<tr><td>Q</td><td>Unique Id. fs backend: trunc-hashed path+ipath
Other backends may use a different unique id.
</td><td>Unique Id</td></tr>
<tr><td>R</td><td>Unused</td><td>Raw (unstemmed) term</td></tr>
<tr><td>S</td><td>Subject/title</td><td>Same</td></tr>
<tr><td>T</td><td>mime type</td><td>Same</td></tr>
<tr><td>U</td><td>Unused</td><td>Full Url of indexed
document. Truncated/hashed version of URL. Used for
duplicate checks.</td></tr>
<tr><td>V</td><td>Unused</td><td>"Can't see"</td></tr>
<tr><td>W</td><td>Unused</td><td>Owner</td></tr>
<tr><td>X</td><td>Prefix prefix for multichar prefixes</td>
<td>Same</td></tr>
<tr><td>Y</td><td>year YYYY</td><td>Same</td></tr>
<tr><td>Z</td><td>Unused</td><td>Stemmed term</td></tr>
<tr><td>XE</td><td>File name extension folded as lowercase
(omega uses E)</td><td>Unused</td></tr>
<tr><td>XP</td><td>Path elements (for phrase-based directory filtering)
</td><td>Unused</td></tr>
<tr><td>XSFN</td><td>utf8 lowercased/unaccented version of
file name. Used for specific file name searches. NOT SPLIT
(spaces as normal chars).</td><td>None</td>
<tr><td>XTO</td><td>Recipient</td><td>None</td>
<tr><td>XXST</td><td>Not really a prefix: start of field
marker (for anchored phrase searches)</td><td>None</td>
<tr><td>XXND</td><td>Not really a prefix: end of field
marker (for anchored phrase searches)</td><td>None</td>
</tr>
</tbody>
</table>
<h2>Values</h2>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Value slot</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>0</td><td>Unused</td><td>Unix modification time</td></tr>
<tr><td>1</td><td>MD5</td><td>Same</td></tr>
<tr><td>2</td><td>Unused</td><td>Size</td></tr>
<tr><td>10</td><td>Signature: value to be checked for
up-to-dateness, ie mtime|size for the fs
backend</td><td>Unused</td></tr>
</tbody>
</table>
<h2>Document data record format</h2>
<p>Recoll has the same line based / prefixed data record format
as omega (name=value\n). The Omega data below is quite out of
date.</p>
<table border=1 cellspacing=0 width="90%">
<thead>
<tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
</tr>
</thead>
<tbody>
<tr><td>url=</td><td>Full url. Always file://abspath. The path is not
encoded to utf-8, this is the system file name ,usable as an
argument to open()</td><td>Same</td>
</tr>
<tr><td>mtype=</td><td>mime type (omega: type)</td><td>type=</td>
</tr>
<tr><td>fmtime=</td><td>file modification date</td><td>modtime=</td>
</tr>
<tr><td>dmtime=</td><td> document modification date</td><td>None</td>
</tr>
<tr><td>origcharset=</td><td> character set the text was
converted from</td><td>None</td>
</tr>
<tr><td>fbytes=</td><td> file size in bytes</td><td>size=</td>
</tr>
<tr><td>dbytes=</td><td>document size in bytes</td><td>None</td>
</tr>
<tr><td>ipath=</td><td>internal path for docs in multidoc
files</td><td>None</td>
</tr>
<tr><td>caption=</td><td>title of document, utf8</td><td>Same</td>
</tr>
<tr><td>keywords=</td><td>key words, utf8</td><td>None</td>
</tr>
<tr><td>abstract=</td><td>document abstract, utf8</td><td>sample=</td>
</tr>
</tbody>
</table>
</div>
<hr>
<address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
<!-- Created: Thu Dec 7 13:07:40 CET 2006 -->
<!-- hhmts start -->
Last modified: Sat Feb 25 09:14:38 CEST 2012
<!-- hhmts end -->
</body>
</html>