Switch to unified view

a/website/rclidxfmt.html b/website/rclidxfmt.html
...
...
17
17
18
  <body>
18
  <body>
19
    <div class="content">
19
    <div class="content">
20
    <h1>Recoll index format details</h1>
20
    <h1>Recoll index format details</h1>
21
21
22
    <p>A comparison of index formats for recoll 1.8 and omega
22
    <p>A comparison of index formats for recoll 1.17 and omega
23
    1.0.1</p>
23
      1.0.1</p>
24
24
25
    <p>Recoll terms are not stemmed before being stored. They are turned to
25
    <p>Recoll terms are not stemmed before being stored. They are turned to
26
      all minuscule letters with no accents. An auxiliary database
26
      all minuscule letters with no accents. An auxiliary database
27
      handles stem expansion. Omega stores both raw
27
      handles stem expansion. Omega stores both raw
28
      terms and stemmed versions (with prefix Z)</p>
28
      terms (with prefix R) and stemmed versions (with prefix Z).
29
      The xapian-side of the information here comes from the relevant
30
      xapian-omega <a
31
      href="http://xapian.org/docs/omega/termprefixes.html">documentation
32
      page</a>. 
33
    </p>
29
34
30
    <h2>Special prefixed terms:</h2>
35
    <h2>Special prefixed terms:</h2>
31
36
32
    <p>A comparison of prefixed term usage between Recoll and
37
    <p>A comparison of prefixed term usage between Recoll and
33
      omega/xapian. <em>xapian-core</em> in the Omega column means
38
      omega/xapian.</p>
34
      that the prefix is not used by Omega, but mentionned as
35
      allocated in the xapian prefix definition document.</p>
36
39
37
    <table border=1 cellspacing=0 width="90%">
40
    <table border=1 cellspacing=0 width="90%">
38
    <thead>
41
    <thead>
39
    <tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
42
    <tr><th>Pref.</th><th>Recoll use</th><th>Omega use</th>
40
    </tr>
43
    </tr>
41
      </thead>
44
      </thead>
42
      <tbody>
45
      <tbody>
46
  <tr><td>A</td><td>Author</td><td>Same</td></tr>
47
48
  <tr><td>B</td><td>Unused</td><td>Reserved</td></tr>
49
  <tr><td>C</td><td>Unused</td><td>Reserved</td></tr>
50
51
  <tr><td>D</td><td>date: modification date of file, like
52
      YYYYMMDD</td><td>Same</td></tr>
53
54
        <tr><td>E</td><td>Unused. Recoll uses XE</td>
55
          <td>file name extension folded to lowercase</td></tr>
56
57
58
  <tr><td>F</td><td>Unused</td><td>Reserved</td></tr>
59
  <tr><td>G</td><td>Unused</td><td>newGroup / forum name</td></tr>
60
61
  <tr><td>H</td><td>Unused</td><td>host name</td></tr>
62
63
  <tr><td>I</td><td>Unused</td><td>"Can see"</td></tr>
64
65
  <tr><td>J</td><td>Unused</td><td>Reserved</td></tr>
66
  <tr><td>K</td><td>Keyword</td><td>Same</td></tr>
67
68
  <tr><td>L</td><td>Unused</td><td>ISO language code</td></tr>
69
70
  <tr><td>M</td><td>month: YYYYMM</td><td>Same</td></tr>
71
72
  <tr><td>N</td><td>Unused</td><td>ISO country code</td></tr>
73
74
  <tr><td>O</td><td>Unused</td><td>Owner</td></tr>
75
76
  <tr><td>P</td><td>Unused</td><td>Path part of URL</td></tr>
77
78
  <tr><td>Q</td><td>Unique Id. fs backend: trunc-hashed path+ipath
79
      Other backends may use a different unique id.
80
    </td><td>Unique Id</td></tr>
81
82
  <tr><td>R</td><td>Unused</td><td>Raw (unstemmed) term</td></tr>
83
84
  <tr><td>S</td><td>Subject/title</td><td>Same</td></tr>
85
43
    <tr><td>T</td><td>mime type</td><td>Same</td>
86
    <tr><td>T</td><td>mime type</td><td>Same</td></tr>
87
88
  <tr><td>U</td><td>Unused</td><td>Full Url of indexed
89
      document. Truncated/hashed version of URL. Used for
90
      duplicate checks.</td></tr> 
91
92
  <tr><td>V</td><td>Unused</td><td>"Can't see"</td></tr>
93
94
  <tr><td>W</td><td>Unused</td><td>Owner</td></tr>
95
96
  <tr><td>X</td><td>Prefix prefix for multichar prefixes</td>
97
          <td>Same</td></tr>
98
99
  <tr><td>Y</td><td>year YYYY</td><td>Same</td></tr>
100
101
  <tr><td>Z</td><td>Unused</td><td>Stemmed term</td></tr>
102
103
        <tr><td>XE</td><td>File name extension folded as lowercase
104
            (omega uses E)</td><td>Unused</td></tr>
105
106
        <tr><td>XP</td><td>Path elements (for phrase-based directory filtering)
107
          </td><td>Unused</td></tr>
108
109
  <tr><td>XSFN</td><td>utf8 lowercased/unaccented version of
110
      file name. Used for specific file name searches. NOT SPLIT
111
      (spaces as normal chars).</td><td>None</td>
112
113
  <tr><td>XTO</td><td>Recipient</td><td>None</td>
114
  <tr><td>XXST</td><td>Not really a prefix: start of field
115
      marker (for anchored phrase searches)</td><td>None</td>
116
  <tr><td>XXND</td><td>Not really a prefix: end of field
117
      marker (for anchored phrase searches)</td><td>None</td>
118
44
    </tr>
119
    </tr>
45
120
46
  <tr><td>P</td><td>Truncated/hashed version of file path. For
47
  single-document files, and for the file part of a
48
  multi-document file. Used for up-to-date checks and for
49
  retrieving a document by path. </td><td>Path part of URL (no
50
  hashing). Uses U for the equivalent
51
  term used for up to date checks.</td> 
52
  </tr>
53
54
  <tr><td>Q</td><td>pathhash+ipath same + internal path for
55
  documents inside multi-document files. Used to set the
56
  existence flag for subdocs when a multi-document file is found
57
  to be up to date, or for deleting all subdocs for a file, or
58
  for retrieving a document by path+ipath. Compatible
59
  with Q definition in xapian/termprefixes.txt: unique
60
  identifier.</td><td>None</td> 
61
  </tr>
62
63
  <tr><td>D</td><td>date: modification date of file, like
64
  YYYYMMDD</td><td>Same</td>
65
  </tr>
66
67
  <tr><td>M</td><td>month: YYYYMM</td><td>Same</td>
68
  </tr>
69
  <tr><td>Y</td><td>year YYYY</td><td>Same</td>
70
  </tr>
71
72
  <tr><td>XSFN</td><td>utf8 version of file name. Used for specific
73
  file name searches</td><td>None</td>
74
  </tr>
75
  <tr><td>U</td><td>None</td><td>Url term. Truncated/hashed version
76
      of URL. Used for duplicate checks.</td>
77
  </tr>
78
79
  <tr><td>S</td><td>Subject/title</td><td>xapian-core</td>
80
  </tr>
81
  <tr><td>A</td><td>Author</td><td>xapian-core</td>
82
  </tr>
83
  <tr><td>K</td><td>Keyword</td><td>xapian-core</td>
84
  </tr>
85
    
121
    
86
      </tbody>
122
      </tbody>
87
    </table>
123
    </table>
88
124
89
    <p>None of the "date" terms are currently used by recoll queries</p>
90
125
91
    <h2>Values</h2>
126
    <h2>Values</h2>
92
    <p>Recoll currently stores no document values.</p>
127
93
    <p>Omega stores 2 values, for the md5 hash of the file, and the
128
    <table border=1 cellspacing=0 width="90%">
94
      last modification date (as unix time). The md5 value doesn't
129
  <thead>
95
      appear to be currently used ?</p>
130
  <tr><th>Value slot</th><th>Recoll use</th><th>Omega use</th>
131
  </tr>
132
      </thead>
133
      <tbody>
134
  <tr><td>0</td><td>Unused</td><td>Unix modification time</td></tr>
135
  <tr><td>1</td><td>MD5</td><td>Same</td></tr>
136
  <tr><td>2</td><td>Unused</td><td>Size</td></tr>
137
  <tr><td>10</td><td>Signature: value to be checked for
138
      up-to-dateness, ie mtime|size for the fs
139
      backend</td><td>Unused</td></tr> 
140
      </tbody>
141
    </table>
142
96
143
97
    <h2>Document data record format</h2>
144
    <h2>Document data record format</h2>
145
98
      <p>Recoll has the same line based / prefixed data record format
146
      <p>Recoll has the same line based / prefixed data record format
99
      as omega (name=value\n).</p>
147
      as omega (name=value\n). The Omega data below is quite out of
148
      date.</p>
100
149
101
    <table border=1 cellspacing=0 width="90%">
150
    <table border=1 cellspacing=0 width="90%">
102
    <thead>
151
    <thead>
103
    <tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
152
    <tr><th>Prefix</th><th>Recoll use</th><th>Omega use</th>
104
    </tr>
153
    </tr>
...
...
139
188
140
    <hr>
189
    <hr>
141
    <address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
190
    <address><a href="mailto:jfd@recoll.org">Jean-Francois Dockes</a></address>
142
<!-- Created: Thu Dec  7 13:07:40 CET 2006 -->
191
<!-- Created: Thu Dec  7 13:07:40 CET 2006 -->
143
<!-- hhmts start -->
192
<!-- hhmts start -->
144
Last modified: Thu Jun 14 11:14:38 CEST 2007
193
Last modified: Sat Feb 25 09:14:38 CEST 2012
145
<!-- hhmts end -->
194
<!-- hhmts end -->
146
  </body>
195
  </body>
147
</html>
196
</html>