a/website/recoll_XMP/index.txt b/website/recoll_XMP/index.txt
...
...
6
handler, which differs a lot from doing something equivalent with the
6
handler, which differs a lot from doing something equivalent with the
7
current Python-based one (for which XMP capability is available from
7
current Python-based one (for which XMP capability is available from
8
recoll 1.23.2, but the new handler can be used with previous Recoll
8
recoll 1.23.2, but the new handler can be used with previous Recoll
9
versions).
9
versions).
10
10
11
This page was adapted from the text by Jeffrey Dick, using input from
11
I based this page on the text by Jeffrey Dick, using input from Johannes
12
Johannes Menzel, (especially the result list paragraph format),
12
Menzel for all examples about the new features. The discussion which led to
13
adapting things for the new handler. The discussion which led to the
14
updated handler is a
13
the updated handler is a
15
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
14
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
16
Recoll issue].
15
Recoll issue].
17
  
16
  
18
== Introduction
17
== Introduction
19
18
...
...
40
field is a good place to tag the PDF with any words of your choosing
39
field is a good place to tag the PDF with any words of your choosing
41
to describe genre, topic, etc. 
40
to describe genre, topic, etc. 
42
41
43
image::jabref_metadata.png[Editing metadata with jabref]
42
image::jabref_metadata.png[Editing metadata with jabref]
44
43
45
== Custom indexing (fields file)
44
== Custom indexing short example (fields file)
46
45
47
Let's create two fields named "year" and "journal". The prefixes
46
The following example (extract from a complete configuration shown later)
48
starting with "XY" are extension prefixes that are added to the terms
47
creates two fields named "refjournal" and "refpages", which are both stored
49
in the Xapian database (Recoll internally does not use prefixes
48
(so they can be displayed in result list entries), and indexed (you can
50
starting with XY). Additionally, the year and journal are stored so
49
specifically search them).
51
they can be displayed in the results list. Some other types of
50
52
metadata, such as title, author and keywords, are already indexed by
51
Some other types of metadata, such as title, author and keywords, are
53
Recoll (the default rclpdf finds them using the *pdftotext*
52
already indexed by Recoll (the default rclpdf finds them using the
54
command) so there is no need to add those to the [prefixes] section. 
53
*pdftotext* command) so there is no need to add those to the [prefixes]
54
section.
55
55
56
Add this text to the fields file in your Recoll configuration
56
This is taken from the `fields` file inside the configuration
57
directory ('~/.recoll/fields'). 
57
(e.g. '~/.recoll/fields').
58
58
59
59
----
60
----
60
[prefixes]
61
[prefixes]
61
year = XYEAR
62
refjournal=RFJOURNAL
62
journal = XYJOUR
63
refpages=RFPAGES
63
64
64
[stored]
65
[stored]
65
bibtex:year =
66
refjournal =
66
bibtex:journal =
67
refpages =
68
69
[aliases]
70
refjournal = bibtex:journal bibtex:journaltitle
71
refpages = bibtex:pages
67
----
72
----
68
73
69
== Telling the handler what fields to extract
74
== Telling the handler what fields to extract
70
75
71
As of Recoll 1.23.2, the PDF handler has the capability to use
76
As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo*
72
*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
77
for extracting XMP metadata. The switch for executing *pdfinfo* is the
73
is the 'pdfextrameta' configuration parameter, and the value of the
78
'pdfextrameta' configuration parameter, and the value of the parameter is a
74
parameter is a list of XMP tags to extract, with optional conversion
79
list of XMP tags to extract, with optional conversion to Recoll field names
75
to Recoll field names (the XMP qualified tag name is kept by
80
(the XMP qualified tag name is kept by default, the translation is
76
default). Example:
81
separated by a '|' character). Example (without translations):
77
82
78
----
83
----
79
pdfextrameta =  bibtex:year bibtex:journal bibtex:booktitle|title
84
pdfextrameta =  bibtex:year bibtex:journal bibtex:journaltitle
80
----
85
----
81
86
82
Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
87
Note that it is quite equivalent to translate a field name inside
83
'bibtex:booktitle' is translated to 'title' (the example is not
88
'pdfextrameta' or to uses aliases inside the 'fields' file.
84
supposed to make sense)
85
89
86
== Editing the field values
90
== Editing the field values
87
91
88
Shortly after the 1.23.2 release, the new rclpdf.py was modified to
92
Shortly after the 1.23.2 release, the new rclpdf.py was modified to
89
enable calling external Python code for editing the values of the XMP
93
enable calling external Python code for editing the values of the XMP
...
...
125
            pass
129
            pass
126
    
130
    
127
        return txt
131
        return txt
128
----
132
----
129
133
134
135
The metadata-editing script can be modified to fill in the "journal" field for
136
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
137
for "InCollection" entries), by defining a 'wrapup()' method which will
138
be called with the whole metadata array (an array of '(nm,value)'
139
pairs) for global editing/removing/addition.
140
130
== Indexing
141
== Indexing
131
142
132
Then index away!
143
Then index away!
133
144
134
Note that you can also run the rclpdf.py script manually,
145
Note that you can also run the rclpdf.py script manually,
...
...
136
output. If things are working correctly, the <head> consists of the
147
output. If things are working correctly, the <head> consists of the
137
HTML meta elements, and the <body> contains the text of the PDF.
148
HTML meta elements, and the <body> contains the text of the PDF.
138
149
139
== Result paragraph format
150
== Result paragraph format
140
151
141
Here, the result is formatted to show the title, which is a link
152
The result paragraph format defines what fields are displayed inside Recoll
142
to open the document, in blue with underlining turned off. The next
153
result list, and how they are formatted.
143
two lines contain the authors, then the journal title in green
154
144
italicized text followed by year (in parentheses). The keywords are
145
listed in red after the abstract/text snippet.
146
    
147
Edit this using the Recoll GUI: Preferences > GUI configuration >
155
Edit this using the Recoll GUI: Preferences > GUI configuration >
148
    Result List > Edit result paragraph format string.
156
    Result List > Edit result paragraph format string.
149
157
150
----
158
----
151
<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5">
159
<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5">
...
...
175
183
176
</table>
184
</table>
177
185
178
----
186
----
179
187
180
The screenshot below also has the 'Highlight color for query terms'
188
There are
181
set to `black; font-weight:bold;` for bold, black text (instead
182
of the blue default). There
183
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
189
link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
184
methods for creating the thumbnails]; the ones here were made by
190
methods for creating the thumbnails]; the ones here were made by opening
185
opening the directory containing the PDFs in the Dolphin file manager
191
the directory containing the PDFs in the Dolphin file manager (part of KDE)
186
(part of KDE) and selecting the Preview option.
192
and selecting the Preview option.
187
193
194
And the result:
188
195
189
== A search example
196
image::recoll_query.png[Result list display]
190
191
The simple query is `cerevisiae keyword:protein`. This
192
returns only PDFs that have the text "cerevisiae" and have been
193
tagged with the "protein" keyword. The LaTeX-style formatting from
194
the BibTeX database is displayed as HTML (note the italicized words
195
in article title, and umlaut in author's name). Other queries could
196
be made based on the PDF metadata, e.g. 'journal:plos'
197
r 'year:2013'. 
198
199
image::recoll_query.png
200
197
201
== More possibilities
198
== More possibilities
202
199
203
- The sort buttons (up- and down-arrows) in Recoll sort the
200
- The sort buttons (up- and down-arrows) in Recoll sort the
204
  results by the modified date on the file at the time of indexing. If
201
  results by the modified date on the file at the time of indexing. If
...
...
214
Note that the publication year could then be shown in
211
Note that the publication year could then be shown in
215
the result list using the stored date of the file (using "%D" in the
212
the result list using the stored date of the file (using "%D" in the
216
result paragraph format, and date format "%Y") instead of having to
213
result paragraph format, and date format "%Y") instead of having to
217
add the year to the index as shown above. 
214
add the year to the index as shown above. 
218
215
219
 The filter can be modified to fill in the "journal" field for
216
220
  BibTex entries that aren't journal articles (e.g. bibtex:booktitle
217
== Complete example
221
  for "InCollection" entries). 
218
219
This was designed by Johannes Menzel, who kindly provided the data when we
220
worked on improving PDF XMP data extraction. The originals are listed in
221
this
222
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue]
223
224
The paragraph format is listed above.
225
226
=== 'recoll.conf' additions:
227
228
----
229
pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
230
  bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
231
  bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
232
  bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
233
  bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
234
  bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
235
  dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
236
237
defaultcharset = UTF-8//
238
239
pdfextrametafix = /home/hannes/.recoll/metafix.py
240
----
241
242
243
=== 'metafix.py' script:
244
245
----
246
import sys
247
import re
248
249
# This can be used for local XMP field editing.
250
#
251
# A new instance is created for each PDF document (so the object could
252
# keep state to avoid, e.g. duplicate values)
253
#
254
# The metafix method receives an (original) field name, and the text
255
# value, and should return the possibly modified text.
256
class MetaFixer(object):
257
    def __init__(self):
258
        pass
259
260
    def metafix(self, nm, txt):
261
        if nm == 'bibtex:pages':
262
            txt = re.sub(r'--', '-', txt)
263
            txt = re.sub(r'^', ', p. ', txt)
264
        elif nm == 'bibtex:author':
265
            txt = re.sub(r'$', ':\ ', txt)
266
            pass
267
        elif nm == 'bibtex:chapter':
268
            txt = re.sub(r'^', ', in: id.: ', txt)
269
            pass
270
        elif nm == 'bibtex:editor':
271
            txt = re.sub(r'^', ', in: ', txt)
272
            txt = re.sub(r'$', ' (ed.):\ ', txt)
273
            pass
274
        elif nm == 'bibtex:year':
275
            txt = re.sub(r'^', ', ', txt)
276
            pass
277
        elif nm == 'bibtex:date':
278
            txt = re.sub(r'^', ', ', txt)
279
            pass            
280
        elif nm == 'bibtex:volume':
281
            txt = re.sub(r'^', ', vol. ', txt)
282
            pass
283
        elif nm == 'bibtex:number':
284
            txt = re.sub(r'^', ', no. ', txt)
285
            pass
286
        elif nm == 'bibtex:journaltitle':
287
            txt = re.sub(r'^', ', in: ', txt)
288
            pass
289
        elif nm == 'bibtex:journal':
290
            txt = re.sub(r'^', ', in: ', txt)
291
            pass
292
        elif nm == 'bibtex:title':
293
            txt = re.sub(r'^', '"', txt)
294
            txt = re.sub(r'$', '"', txt)            
295
            pass
296
        elif nm == 'bibtex:location':
297
            txt = re.sub(r'^', ', ', txt)
298
            txt = re.sub(r'$', ':\ ', txt)            
299
            pass
300
        elif nm == 'bibtex:address':
301
            txt = re.sub(r'^', ', ', txt)
302
            txt = re.sub(r'$', ':\ ', txt)            
303
            pass
304
        elif nm == 'bibtex:isbn':
305
            txt = re.sub(r'^', 'ISBN: ', txt)            
306
            pass
307
        elif nm == 'bibtex:issn':
308
            txt = re.sub(r'^', 'ISSN: ', txt)
309
            pass
310
        elif nm == 'bibtex:doi':
311
            txt = re.sub(r'^', 'DOI: ', txt)
312
            pass
313
        elif nm == 'bibtex:bibtexkey':
314
            txt = re.sub(r'^', 'Key: ', txt)
315
            pass
316
317
        return txt
318
----
319
320
321
=== 'fields' file:
322
323
----
324
[prefixes]
325
326
refjournal=RFJOURNAL
327
refpages=RFPAGES
328
reftitle=RFTTITLE
329
refvolume=RFVOLUME
330
refauthor=RFAUTHOR
331
refyear=RFYYEAR
332
refisbn=RFISBN
333
refissn=RFISSN
334
refdoi=RFDOI
335
refeditor=RFEDITOR
336
refpublisher=RFPUBLISHER
337
refaddress=RFADDRESS
338
reflocation=RFLOCATION
339
refbooktitle=RFBOOKTITLE
340
refurl=RFURL
341
reftype=RFTYPE
342
refkey=RFKEY
343
refabstract=RFABSTRACT
344
refkeywords=RFKEYWORDS
345
refcomment=RFCOMMENT
346
refedition=RFEDITION
347
reflanguage=RFLANGUAGE
348
349
[stored]
350
351
refjournal=
352
refpages=
353
reftitle=
354
refvolume=
355
refauthor=
356
refyear=
357
refisbn=
358
refissn=
359
refdoi=
360
refeditor=
361
refpublisher=
362
refaddress=
363
reflocation=
364
refbooktitle=
365
refurl=
366
reftype=
367
refkey=
368
refabstract=
369
refkeywords=
370
refcomment=
371
refedition=
372
reflanguage=
373
refid=
374
375
[aliases]
376
377
refjournal = bibtex:journal bibtex:journaltitle
378
refpages = bibtex:pages
379
reftitle = bibtex:title
380
refvolume = bibtex:volume
381
refauthor = bibtex:author
382
refyear = bibtex:year bibtex:date
383
refid = dc:identifier bibtex:isbn bibtex:issn
384
refisbn = bibtex:isbn
385
refissn = bibtex:issn
386
refdoi = bibtex:doi
387
refeditor = bibtex:editor
388
refpublisher = bibtex:publisher
389
refaddress = bibtex:address
390
reflocation = bibtex:location
391
refbooktitle = bibtex:booktitle
392
refurl = bibtex:url
393
reftype = bibtex:entrytype bibtex:type
394
refkey = bibtex:bibtexkey
395
refabstract = bibtex:abstract
396
refkeywords = bibtex:keywords
397
refcomment = bibtex:comment
398
refedition = bibtex:edition
399
reflanguage = bibtex:language
400
author = xesam:author
401
----
402