Recoll index format details
A comparison of index formats for recoll 1.17 and omega 1.0.1
Recoll terms are not stemmed before being stored. They are turned to all minuscule letters with no accents. An auxiliary database handles stem expansion. Omega stores both raw terms (with prefix R) and stemmed versions (with prefix Z). The xapian-side of the information here comes from the relevant xapian-omega documentation page.
Special prefixed terms:
A comparison of prefixed term usage between Recoll and omega/xapian.
Pref. | Recoll use | Omega use |
---|---|---|
A | Author | Same |
B | Unused | Reserved |
C | Unused | Reserved |
D | date: modification date of file, like YYYYMMDD | Same |
E | Unused. Recoll uses XE | file name extension folded to lowercase |
F | Unused | Reserved |
G | Unused | newGroup / forum name |
H | Unused | host name |
I | Unused | "Can see" |
J | Unused | Reserved |
K | Keyword | Same |
L | Unused | ISO language code |
M | month: YYYYMM | Same |
N | Unused | ISO country code |
O | Unused | Owner |
P | Unused | Path part of URL |
Q | Unique Id. fs backend: trunc-hashed path+ipath Other backends may use a different unique id. | Unique Id |
R | Unused | Raw (unstemmed) term |
S | Subject/title | Same |
T | mime type | Same |
U | Unused | Full Url of indexed document. Truncated/hashed version of URL. Used for duplicate checks. |
V | Unused | "Can't see" |
W | Unused | Owner |
X | Prefix prefix for multichar prefixes | Same |
Y | year YYYY | Same |
Z | Unused | Stemmed term |
XE | File name extension folded as lowercase (omega uses E) | Unused |
XP | Path elements (for phrase-based directory filtering) | Unused |
XSFN | utf8 lowercased/unaccented version of file name. Used for specific file name searches. NOT SPLIT (spaces as normal chars). | None |
XTO | Recipient | None |
XXST | Not really a prefix: start of field marker (for anchored phrase searches) | None |
XXND | Not really a prefix: end of field marker (for anchored phrase searches) | None |
Values
Value slot | Recoll use | Omega use |
---|---|---|
0 | Unused | Unix modification time |
1 | MD5 | Same |
2 | Unused | Size |
10 | Signature: value to be checked for up-to-dateness, ie mtime|size for the fs backend | Unused |
Document data record format
Recoll has the same line based / prefixed data record format as omega (name=value\n). The Omega data below is quite out of date.
Prefix | Recoll use | Omega use |
---|---|---|
url= | Full url. Always file://abspath. The path is not encoded to utf-8, this is the system file name ,usable as an argument to open() | Same |
mtype= | mime type (omega: type) | type= |
fmtime= | file modification date | modtime= |
dmtime= | document modification date | None |
origcharset= | character set the text was converted from | None |
fbytes= | file size in bytes | size= |
dbytes= | document size in bytes | None |
ipath= | internal path for docs in multidoc files | None |
caption= | title of document, utf8 | Same |
keywords= | key words, utf8 | None |
abstract= | document abstract, utf8 | sample= |