|
a/website/recoll_XMP/index.txt |
|
b/website/recoll_XMP/index.txt |
|
... |
|
... |
6 |
handler, which differs a lot from doing something equivalent with the
|
6 |
handler, which differs a lot from doing something equivalent with the
|
7 |
current Python-based one (for which XMP capability is available from
|
7 |
current Python-based one (for which XMP capability is available from
|
8 |
recoll 1.23.2, but the new handler can be used with previous Recoll
|
8 |
recoll 1.23.2, but the new handler can be used with previous Recoll
|
9 |
versions).
|
9 |
versions).
|
10 |
|
10 |
|
11 |
This page was adapted from the text by Jeffrey Dick, using input from
|
11 |
I based this page on the text by Jeffrey Dick, using input from Johannes
|
12 |
Johannes Menzel, (especially the result list paragraph format),
|
12 |
Menzel for all examples about the new features. The discussion which led to
|
13 |
adapting things for the new handler. The discussion which led to the
|
|
|
14 |
updated handler is a
|
13 |
the updated handler is a
|
15 |
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
|
14 |
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
|
16 |
Recoll issue].
|
15 |
Recoll issue].
|
17 |
|
16 |
|
18 |
== Introduction
|
17 |
== Introduction
|
19 |
|
18 |
|
|
... |
|
... |
40 |
field is a good place to tag the PDF with any words of your choosing
|
39 |
field is a good place to tag the PDF with any words of your choosing
|
41 |
to describe genre, topic, etc.
|
40 |
to describe genre, topic, etc.
|
42 |
|
41 |
|
43 |
image::jabref_metadata.png[Editing metadata with jabref]
|
42 |
image::jabref_metadata.png[Editing metadata with jabref]
|
44 |
|
43 |
|
45 |
== Custom indexing (fields file)
|
44 |
== Custom indexing short example (fields file)
|
46 |
|
45 |
|
47 |
Let's create two fields named "year" and "journal". The prefixes
|
46 |
The following example (extract from a complete configuration shown later)
|
48 |
starting with "XY" are extension prefixes that are added to the terms
|
47 |
creates two fields named "refjournal" and "refpages", which are both stored
|
49 |
in the Xapian database (Recoll internally does not use prefixes
|
48 |
(so they can be displayed in result list entries), and indexed (you can
|
50 |
starting with XY). Additionally, the year and journal are stored so
|
49 |
specifically search them).
|
51 |
they can be displayed in the results list. Some other types of
|
50 |
|
52 |
metadata, such as title, author and keywords, are already indexed by
|
51 |
Some other types of metadata, such as title, author and keywords, are
|
53 |
Recoll (the default rclpdf finds them using the *pdftotext*
|
52 |
already indexed by Recoll (the default rclpdf finds them using the
|
54 |
command) so there is no need to add those to the [prefixes] section.
|
53 |
*pdftotext* command) so there is no need to add those to the [prefixes]
|
|
|
54 |
section.
|
55 |
|
55 |
|
56 |
Add this text to the fields file in your Recoll configuration
|
56 |
This is taken from the `fields` file inside the configuration
|
57 |
directory ('~/.recoll/fields').
|
57 |
(e.g. '~/.recoll/fields').
|
|
|
58 |
|
58 |
|
59 |
|
59 |
----
|
60 |
----
|
60 |
[prefixes]
|
61 |
[prefixes]
|
61 |
year = XYEAR
|
62 |
refjournal=RFJOURNAL
|
62 |
journal = XYJOUR
|
63 |
refpages=RFPAGES
|
63 |
|
64 |
|
64 |
[stored]
|
65 |
[stored]
|
65 |
bibtex:year =
|
66 |
refjournal =
|
66 |
bibtex:journal =
|
67 |
refpages =
|
|
|
68 |
|
|
|
69 |
[aliases]
|
|
|
70 |
refjournal = bibtex:journal bibtex:journaltitle
|
|
|
71 |
refpages = bibtex:pages
|
67 |
----
|
72 |
----
|
68 |
|
73 |
|
69 |
== Telling the handler what fields to extract
|
74 |
== Telling the handler what fields to extract
|
70 |
|
75 |
|
71 |
As of Recoll 1.23.2, the PDF handler has the capability to use
|
76 |
As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo*
|
72 |
*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
|
77 |
for extracting XMP metadata. The switch for executing *pdfinfo* is the
|
73 |
is the 'pdfextrameta' configuration parameter, and the value of the
|
78 |
'pdfextrameta' configuration parameter, and the value of the parameter is a
|
74 |
parameter is a list of XMP tags to extract, with optional conversion
|
79 |
list of XMP tags to extract, with optional conversion to Recoll field names
|
75 |
to Recoll field names (the XMP qualified tag name is kept by
|
80 |
(the XMP qualified tag name is kept by default, the translation is
|
76 |
default). Example:
|
81 |
separated by a '|' character). Example (without translations):
|
77 |
|
82 |
|
78 |
----
|
83 |
----
|
79 |
pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title
|
84 |
pdfextrameta = bibtex:year bibtex:journal bibtex:journaltitle
|
80 |
----
|
85 |
----
|
81 |
|
86 |
|
82 |
Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
|
87 |
Note that it is quite equivalent to translate a field name inside
|
83 |
'bibtex:booktitle' is translated to 'title' (the example is not
|
88 |
'pdfextrameta' or to uses aliases inside the 'fields' file.
|
84 |
supposed to make sense)
|
|
|
85 |
|
89 |
|
86 |
== Editing the field values
|
90 |
== Editing the field values
|
87 |
|
91 |
|
88 |
Shortly after the 1.23.2 release, the new rclpdf.py was modified to
|
92 |
Shortly after the 1.23.2 release, the new rclpdf.py was modified to
|
89 |
enable calling external Python code for editing the values of the XMP
|
93 |
enable calling external Python code for editing the values of the XMP
|
|
... |
|
... |
125 |
pass
|
129 |
pass
|
126 |
|
130 |
|
127 |
return txt
|
131 |
return txt
|
128 |
----
|
132 |
----
|
129 |
|
133 |
|
|
|
134 |
|
|
|
135 |
The metadata-editing script can be modified to fill in the "journal" field for
|
|
|
136 |
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
|
|
|
137 |
for "InCollection" entries), by defining a 'wrapup()' method which will
|
|
|
138 |
be called with the whole metadata array (an array of '(nm,value)'
|
|
|
139 |
pairs) for global editing/removing/addition.
|
|
|
140 |
|
130 |
== Indexing
|
141 |
== Indexing
|
131 |
|
142 |
|
132 |
Then index away!
|
143 |
Then index away!
|
133 |
|
144 |
|
134 |
Note that you can also run the rclpdf.py script manually,
|
145 |
Note that you can also run the rclpdf.py script manually,
|
|
... |
|
... |
136 |
output. If things are working correctly, the <head> consists of the
|
147 |
output. If things are working correctly, the <head> consists of the
|
137 |
HTML meta elements, and the <body> contains the text of the PDF.
|
148 |
HTML meta elements, and the <body> contains the text of the PDF.
|
138 |
|
149 |
|
139 |
== Result paragraph format
|
150 |
== Result paragraph format
|
140 |
|
151 |
|
141 |
Here, the result is formatted to show the title, which is a link
|
152 |
The result paragraph format defines what fields are displayed inside Recoll
|
142 |
to open the document, in blue with underlining turned off. The next
|
153 |
result list, and how they are formatted.
|
143 |
two lines contain the authors, then the journal title in green
|
154 |
|
144 |
italicized text followed by year (in parentheses). The keywords are
|
|
|
145 |
listed in red after the abstract/text snippet.
|
|
|
146 |
|
|
|
147 |
Edit this using the Recoll GUI: Preferences > GUI configuration >
|
155 |
Edit this using the Recoll GUI: Preferences > GUI configuration >
|
148 |
Result List > Edit result paragraph format string.
|
156 |
Result List > Edit result paragraph format string.
|
149 |
|
157 |
|
150 |
----
|
158 |
----
|
151 |
<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5">
|
159 |
<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5">
|
|
... |
|
... |
175 |
|
183 |
|
176 |
</table>
|
184 |
</table>
|
177 |
|
185 |
|
178 |
----
|
186 |
----
|
179 |
|
187 |
|
180 |
The screenshot below also has the 'Highlight color for query terms'
|
188 |
There are
|
181 |
set to `black; font-weight:bold;` for bold, black text (instead
|
|
|
182 |
of the blue default). There
|
|
|
183 |
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
189 |
link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
184 |
methods for creating the thumbnails]; the ones here were made by
|
190 |
methods for creating the thumbnails]; the ones here were made by opening
|
185 |
opening the directory containing the PDFs in the Dolphin file manager
|
191 |
the directory containing the PDFs in the Dolphin file manager (part of KDE)
|
186 |
(part of KDE) and selecting the Preview option.
|
192 |
and selecting the Preview option.
|
187 |
|
193 |
|
|
|
194 |
And the result:
|
188 |
|
195 |
|
189 |
== A search example
|
196 |
image::recoll_query.png[Result list display]
|
190 |
|
|
|
191 |
The simple query is `cerevisiae keyword:protein`. This
|
|
|
192 |
returns only PDFs that have the text "cerevisiae" and have been
|
|
|
193 |
tagged with the "protein" keyword. The LaTeX-style formatting from
|
|
|
194 |
the BibTeX database is displayed as HTML (note the italicized words
|
|
|
195 |
in article title, and umlaut in author's name). Other queries could
|
|
|
196 |
be made based on the PDF metadata, e.g. 'journal:plos'
|
|
|
197 |
r 'year:2013'.
|
|
|
198 |
|
|
|
199 |
image::recoll_query.png
|
|
|
200 |
|
197 |
|
201 |
== More possibilities
|
198 |
== More possibilities
|
202 |
|
199 |
|
203 |
- The sort buttons (up- and down-arrows) in Recoll sort the
|
200 |
- The sort buttons (up- and down-arrows) in Recoll sort the
|
204 |
results by the modified date on the file at the time of indexing. If
|
201 |
results by the modified date on the file at the time of indexing. If
|
|
... |
|
... |
214 |
Note that the publication year could then be shown in
|
211 |
Note that the publication year could then be shown in
|
215 |
the result list using the stored date of the file (using "%D" in the
|
212 |
the result list using the stored date of the file (using "%D" in the
|
216 |
result paragraph format, and date format "%Y") instead of having to
|
213 |
result paragraph format, and date format "%Y") instead of having to
|
217 |
add the year to the index as shown above.
|
214 |
add the year to the index as shown above.
|
218 |
|
215 |
|
219 |
The filter can be modified to fill in the "journal" field for
|
216 |
|
220 |
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
|
217 |
== Complete example
|
221 |
for "InCollection" entries).
|
218 |
|
|
|
219 |
This was designed by Johannes Menzel, who kindly provided the data when we
|
|
|
220 |
worked on improving PDF XMP data extraction. The originals are listed in
|
|
|
221 |
this
|
|
|
222 |
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue]
|
|
|
223 |
|
|
|
224 |
The paragraph format is listed above.
|
|
|
225 |
|
|
|
226 |
=== 'recoll.conf' additions:
|
|
|
227 |
|
|
|
228 |
----
|
|
|
229 |
pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
|
|
|
230 |
bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
|
|
|
231 |
bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
|
|
|
232 |
bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
|
|
|
233 |
bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
|
|
|
234 |
bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
|
|
|
235 |
dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
|
|
|
236 |
|
|
|
237 |
defaultcharset = UTF-8//
|
|
|
238 |
|
|
|
239 |
pdfextrametafix = /home/hannes/.recoll/metafix.py
|
|
|
240 |
----
|
|
|
241 |
|
|
|
242 |
|
|
|
243 |
=== 'metafix.py' script:
|
|
|
244 |
|
|
|
245 |
----
|
|
|
246 |
import sys
|
|
|
247 |
import re
|
|
|
248 |
|
|
|
249 |
# This can be used for local XMP field editing.
|
|
|
250 |
#
|
|
|
251 |
# A new instance is created for each PDF document (so the object could
|
|
|
252 |
# keep state to avoid, e.g. duplicate values)
|
|
|
253 |
#
|
|
|
254 |
# The metafix method receives an (original) field name, and the text
|
|
|
255 |
# value, and should return the possibly modified text.
|
|
|
256 |
class MetaFixer(object):
|
|
|
257 |
def __init__(self):
|
|
|
258 |
pass
|
|
|
259 |
|
|
|
260 |
def metafix(self, nm, txt):
|
|
|
261 |
if nm == 'bibtex:pages':
|
|
|
262 |
txt = re.sub(r'--', '-', txt)
|
|
|
263 |
txt = re.sub(r'^', ', p. ', txt)
|
|
|
264 |
elif nm == 'bibtex:author':
|
|
|
265 |
txt = re.sub(r'$', ':\ ', txt)
|
|
|
266 |
pass
|
|
|
267 |
elif nm == 'bibtex:chapter':
|
|
|
268 |
txt = re.sub(r'^', ', in: id.: ', txt)
|
|
|
269 |
pass
|
|
|
270 |
elif nm == 'bibtex:editor':
|
|
|
271 |
txt = re.sub(r'^', ', in: ', txt)
|
|
|
272 |
txt = re.sub(r'$', ' (ed.):\ ', txt)
|
|
|
273 |
pass
|
|
|
274 |
elif nm == 'bibtex:year':
|
|
|
275 |
txt = re.sub(r'^', ', ', txt)
|
|
|
276 |
pass
|
|
|
277 |
elif nm == 'bibtex:date':
|
|
|
278 |
txt = re.sub(r'^', ', ', txt)
|
|
|
279 |
pass
|
|
|
280 |
elif nm == 'bibtex:volume':
|
|
|
281 |
txt = re.sub(r'^', ', vol. ', txt)
|
|
|
282 |
pass
|
|
|
283 |
elif nm == 'bibtex:number':
|
|
|
284 |
txt = re.sub(r'^', ', no. ', txt)
|
|
|
285 |
pass
|
|
|
286 |
elif nm == 'bibtex:journaltitle':
|
|
|
287 |
txt = re.sub(r'^', ', in: ', txt)
|
|
|
288 |
pass
|
|
|
289 |
elif nm == 'bibtex:journal':
|
|
|
290 |
txt = re.sub(r'^', ', in: ', txt)
|
|
|
291 |
pass
|
|
|
292 |
elif nm == 'bibtex:title':
|
|
|
293 |
txt = re.sub(r'^', '"', txt)
|
|
|
294 |
txt = re.sub(r'$', '"', txt)
|
|
|
295 |
pass
|
|
|
296 |
elif nm == 'bibtex:location':
|
|
|
297 |
txt = re.sub(r'^', ', ', txt)
|
|
|
298 |
txt = re.sub(r'$', ':\ ', txt)
|
|
|
299 |
pass
|
|
|
300 |
elif nm == 'bibtex:address':
|
|
|
301 |
txt = re.sub(r'^', ', ', txt)
|
|
|
302 |
txt = re.sub(r'$', ':\ ', txt)
|
|
|
303 |
pass
|
|
|
304 |
elif nm == 'bibtex:isbn':
|
|
|
305 |
txt = re.sub(r'^', 'ISBN: ', txt)
|
|
|
306 |
pass
|
|
|
307 |
elif nm == 'bibtex:issn':
|
|
|
308 |
txt = re.sub(r'^', 'ISSN: ', txt)
|
|
|
309 |
pass
|
|
|
310 |
elif nm == 'bibtex:doi':
|
|
|
311 |
txt = re.sub(r'^', 'DOI: ', txt)
|
|
|
312 |
pass
|
|
|
313 |
elif nm == 'bibtex:bibtexkey':
|
|
|
314 |
txt = re.sub(r'^', 'Key: ', txt)
|
|
|
315 |
pass
|
|
|
316 |
|
|
|
317 |
return txt
|
|
|
318 |
----
|
|
|
319 |
|
|
|
320 |
|
|
|
321 |
=== 'fields' file:
|
|
|
322 |
|
|
|
323 |
----
|
|
|
324 |
[prefixes]
|
|
|
325 |
|
|
|
326 |
refjournal=RFJOURNAL
|
|
|
327 |
refpages=RFPAGES
|
|
|
328 |
reftitle=RFTTITLE
|
|
|
329 |
refvolume=RFVOLUME
|
|
|
330 |
refauthor=RFAUTHOR
|
|
|
331 |
refyear=RFYYEAR
|
|
|
332 |
refisbn=RFISBN
|
|
|
333 |
refissn=RFISSN
|
|
|
334 |
refdoi=RFDOI
|
|
|
335 |
refeditor=RFEDITOR
|
|
|
336 |
refpublisher=RFPUBLISHER
|
|
|
337 |
refaddress=RFADDRESS
|
|
|
338 |
reflocation=RFLOCATION
|
|
|
339 |
refbooktitle=RFBOOKTITLE
|
|
|
340 |
refurl=RFURL
|
|
|
341 |
reftype=RFTYPE
|
|
|
342 |
refkey=RFKEY
|
|
|
343 |
refabstract=RFABSTRACT
|
|
|
344 |
refkeywords=RFKEYWORDS
|
|
|
345 |
refcomment=RFCOMMENT
|
|
|
346 |
refedition=RFEDITION
|
|
|
347 |
reflanguage=RFLANGUAGE
|
|
|
348 |
|
|
|
349 |
[stored]
|
|
|
350 |
|
|
|
351 |
refjournal=
|
|
|
352 |
refpages=
|
|
|
353 |
reftitle=
|
|
|
354 |
refvolume=
|
|
|
355 |
refauthor=
|
|
|
356 |
refyear=
|
|
|
357 |
refisbn=
|
|
|
358 |
refissn=
|
|
|
359 |
refdoi=
|
|
|
360 |
refeditor=
|
|
|
361 |
refpublisher=
|
|
|
362 |
refaddress=
|
|
|
363 |
reflocation=
|
|
|
364 |
refbooktitle=
|
|
|
365 |
refurl=
|
|
|
366 |
reftype=
|
|
|
367 |
refkey=
|
|
|
368 |
refabstract=
|
|
|
369 |
refkeywords=
|
|
|
370 |
refcomment=
|
|
|
371 |
refedition=
|
|
|
372 |
reflanguage=
|
|
|
373 |
refid=
|
|
|
374 |
|
|
|
375 |
[aliases]
|
|
|
376 |
|
|
|
377 |
refjournal = bibtex:journal bibtex:journaltitle
|
|
|
378 |
refpages = bibtex:pages
|
|
|
379 |
reftitle = bibtex:title
|
|
|
380 |
refvolume = bibtex:volume
|
|
|
381 |
refauthor = bibtex:author
|
|
|
382 |
refyear = bibtex:year bibtex:date
|
|
|
383 |
refid = dc:identifier bibtex:isbn bibtex:issn
|
|
|
384 |
refisbn = bibtex:isbn
|
|
|
385 |
refissn = bibtex:issn
|
|
|
386 |
refdoi = bibtex:doi
|
|
|
387 |
refeditor = bibtex:editor
|
|
|
388 |
refpublisher = bibtex:publisher
|
|
|
389 |
refaddress = bibtex:address
|
|
|
390 |
reflocation = bibtex:location
|
|
|
391 |
refbooktitle = bibtex:booktitle
|
|
|
392 |
refurl = bibtex:url
|
|
|
393 |
reftype = bibtex:entrytype bibtex:type
|
|
|
394 |
refkey = bibtex:bibtexkey
|
|
|
395 |
refabstract = bibtex:abstract
|
|
|
396 |
refkeywords = bibtex:keywords
|
|
|
397 |
refcomment = bibtex:comment
|
|
|
398 |
refedition = bibtex:edition
|
|
|
399 |
reflanguage = bibtex:language
|
|
|
400 |
author = xesam:author
|
|
|
401 |
----
|
|
|
402 |
|