|
a |
|
b/website/recoll_XMP/index.txt |
|
|
1 |
= Indexing PDF XMP-metadata with Recoll
|
|
|
2 |
|
|
|
3 |
The original document describing XMP metadata usage with Recoll was
|
|
|
4 |
written by Jeffrey Dick and is link:original-text.html[still available
|
|
|
5 |
here]. However it described using the old shell-based PDF Recoll input
|
|
|
6 |
handler, which differs a lot from doing something equivalent with the
|
|
|
7 |
current Python-based one (for which XMP capability is available from
|
|
|
8 |
recoll 1.23.2, but the new handler can be used with previous Recoll
|
|
|
9 |
versions).
|
|
|
10 |
|
|
|
11 |
This page was adapted from the text by Jeffrey Dick, using input from
|
|
|
12 |
Johannes Menzel, (especially the result list paragraph format),
|
|
|
13 |
adapting things for the new handler. The discussion which led to the
|
|
|
14 |
updated handler is a
|
|
|
15 |
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
|
|
|
16 |
Recoll issue].
|
|
|
17 |
|
|
|
18 |
== Introduction
|
|
|
19 |
|
|
|
20 |
Organizing and searching a large collection of PDFs as part of a
|
|
|
21 |
research project can be a demanding task.
|
|
|
22 |
link:http://en.wikipedia.org/wiki/Extensible_Metadata_Platform[XMP
|
|
|
23 |
metadata] stored in a PDF, such as journal title, publication year,
|
|
|
24 |
and user-added keywords, are often useful when searching for a
|
|
|
25 |
publication.
|
|
|
26 |
|
|
|
27 |
Here, we describe customizing Recoll to retrieve this metadata, store it,
|
|
|
28 |
and defining a result paragraph format to display it. See also a related
|
|
|
29 |
wiki entry,
|
|
|
30 |
link:https://bitbucket.org/medoc/recoll/wiki/HandleCustomField.wiki[Generating
|
|
|
31 |
a custom field and using it to sort results], for sorting results on PDF
|
|
|
32 |
page count.
|
|
|
33 |
|
|
|
34 |
== Saving metadata to PDFs
|
|
|
35 |
|
|
|
36 |
Bibliographic metadata can be saved in the PDF file itself. In
|
|
|
37 |
the link:http://jabref.sourceforge.net[JabRef] bibliography
|
|
|
38 |
manager, this is done with the "Write XMP-metadata to PDFs" menu
|
|
|
39 |
item. Note the presence of the keywords in the screenshot below; this
|
|
|
40 |
field is a good place to tag the PDF with any words of your choosing
|
|
|
41 |
to describe genre, topic, etc.
|
|
|
42 |
|
|
|
43 |
image::jabref_metadata.png[Editing metadata with jabref]
|
|
|
44 |
|
|
|
45 |
== Custom indexing (fields file)
|
|
|
46 |
|
|
|
47 |
Let's create two fields named "year" and "journal". The prefixes
|
|
|
48 |
starting with "XY" are extension prefixes that are added to the terms
|
|
|
49 |
in the Xapian database (Recoll internally does not use prefixes
|
|
|
50 |
starting with XY). Additionally, the year and journal are stored so
|
|
|
51 |
they can be displayed in the results list. Some other types of
|
|
|
52 |
metadata, such as title, author and keywords, are already indexed by
|
|
|
53 |
Recoll (the default rclpdf finds them using the *pdftotext*
|
|
|
54 |
command) so there is no need to add those to the [prefixes] section.
|
|
|
55 |
|
|
|
56 |
Add this text to the fields file in your Recoll configuration
|
|
|
57 |
directory ('~/.recoll/fields').
|
|
|
58 |
|
|
|
59 |
----
|
|
|
60 |
[prefixes]
|
|
|
61 |
year = XYEAR
|
|
|
62 |
journal = XYJOUR
|
|
|
63 |
|
|
|
64 |
[stored]
|
|
|
65 |
bibtex:year =
|
|
|
66 |
bibtex:journal =
|
|
|
67 |
----
|
|
|
68 |
|
|
|
69 |
== Telling the handler what fields to extract
|
|
|
70 |
|
|
|
71 |
As of Recoll 1.23.2, the PDF handler has the capability to use
|
|
|
72 |
*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
|
|
|
73 |
is the 'pdfextrameta' configuration parameter, and the value of the
|
|
|
74 |
parameter is a list of XMP tags to extract, with optional conversion
|
|
|
75 |
to Recoll field names (the XMP qualified tag name is kept by
|
|
|
76 |
default). Example:
|
|
|
77 |
|
|
|
78 |
----
|
|
|
79 |
pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title
|
|
|
80 |
----
|
|
|
81 |
|
|
|
82 |
Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
|
|
|
83 |
'bibtex:booktitle' is translated to 'title' (the example is not
|
|
|
84 |
supposed to make sense)
|
|
|
85 |
|
|
|
86 |
== Editing the field values
|
|
|
87 |
|
|
|
88 |
Shortly after the 1.23.2 release, the new rclpdf.py was modified to
|
|
|
89 |
enable calling external Python code for editing the values of the XMP
|
|
|
90 |
metadata fields. The name of the external script is defined by the
|
|
|
91 |
'pdfextrametafix' configuration variable, and it should define a
|
|
|
92 |
'MetaFixer' class, with a 'metafix()' method.
|
|
|
93 |
|
|
|
94 |
In practise, add the following to recoll.conf:
|
|
|
95 |
|
|
|
96 |
----
|
|
|
97 |
pdfextrametafix = /path/to/my/script.py
|
|
|
98 |
----
|
|
|
99 |
|
|
|
100 |
The Python script could look like the following:
|
|
|
101 |
|
|
|
102 |
----
|
|
|
103 |
import sys
|
|
|
104 |
import re
|
|
|
105 |
|
|
|
106 |
# This can be used for local XMP field editing.
|
|
|
107 |
#
|
|
|
108 |
# A new instance is created for each PDF document (so the object could
|
|
|
109 |
# keep state to avoid, e.g. duplicate values)
|
|
|
110 |
#
|
|
|
111 |
# The metafix method receives an (original) field name, and the text
|
|
|
112 |
# value, and should return the possibly modified text.
|
|
|
113 |
class MetaFixer(object):
|
|
|
114 |
def __init__(self):
|
|
|
115 |
pass
|
|
|
116 |
|
|
|
117 |
def metafix(self, nm, txt):
|
|
|
118 |
if nm == 'bibtex:pages':
|
|
|
119 |
txt = re.sub(r'--', '-', txt)
|
|
|
120 |
elif nm == 'someothername':
|
|
|
121 |
# do something else
|
|
|
122 |
pass
|
|
|
123 |
elif nm == 'stillanother':
|
|
|
124 |
# etc.
|
|
|
125 |
pass
|
|
|
126 |
|
|
|
127 |
return txt
|
|
|
128 |
----
|
|
|
129 |
|
|
|
130 |
== Indexing
|
|
|
131 |
|
|
|
132 |
Then index away!
|
|
|
133 |
|
|
|
134 |
Note that you can also run the rclpdf.py script manually,
|
|
|
135 |
e.g. `rclpdf.py -d /path/to/some.pdf`, to inspect the
|
|
|
136 |
output. If things are working correctly, the <head> consists of the
|
|
|
137 |
HTML meta elements, and the <body> contains the text of the PDF.
|
|
|
138 |
|
|
|
139 |
== Result paragraph format
|
|
|
140 |
|
|
|
141 |
Here, the result is formatted to show the title, which is a link
|
|
|
142 |
to open the document, in blue with underlining turned off. The next
|
|
|
143 |
two lines contain the authors, then the journal title in green
|
|
|
144 |
italicized text followed by year (in parentheses). The keywords are
|
|
|
145 |
listed in red after the abstract/text snippet.
|
|
|
146 |
|
|
|
147 |
Edit this using the Recoll GUI: Preferences > GUI configuration >
|
|
|
148 |
Result List > Edit result paragraph format string.
|
|
|
149 |
|
|
|
150 |
----
|
|
|
151 |
<table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5">
|
|
|
152 |
|
|
|
153 |
<thead style="vertical-align: top;">
|
|
|
154 |
<tr>
|
|
|
155 |
<td colspan="3" style="border-bottom: 1pt dotted #004070; font-size: smaller;"><a href="E%N">%u</a> | %S | Relevanz: %R</td>
|
|
|
156 |
</tr>
|
|
|
157 |
</thead>
|
|
|
158 |
|
|
|
159 |
<tbody style="vertical-align: top;">
|
|
|
160 |
<tr>
|
|
|
161 |
<td><a href="P%N"><img src="%I" alt="" width="64" height="auto" /></a></td>
|
|
|
162 |
<td style="width: 250px;"><span style="color: #004070;">
|
|
|
163 |
<div style="font-style: italic;">%(author)</div>
|
|
|
164 |
<div style="font-weight: bold;"><a href="E%N">»%T«</a></div>
|
|
|
165 |
<div style="text-transform: uppercase; margin-top: 5pt">%(reftype)</div></td>
|
|
|
166 |
<td>
|
|
|
167 |
<div style="font-size: smaller;">
|
|
|
168 |
%(refauthor)%(refchapter) %(reftitle)%(refeditor)%(refbooktitle)%(refjournal)%(refvolume)%(refnumber)%(refaddress)%(reflocation)%(refpublisher)%(refyear)%(refpages).</div>
|
|
|
169 |
<div style="text-align: justify; font-family: serif; margin-top: 5pt; margin-bottom: 5pt">»<a href="A%N">%A</a>«</div>
|
|
|
170 |
<div>%(refkeywords)</div>
|
|
|
171 |
<div style="font-size: smaller;"><a href="%(refurl)">%(refurl)</a></div>
|
|
|
172 |
<div style="font-size: smaller"> %(refkey) %(refisbn) %(refissn) %(refdoi)</div></td>
|
|
|
173 |
</tr>
|
|
|
174 |
</tbody>
|
|
|
175 |
|
|
|
176 |
</table>
|
|
|
177 |
|
|
|
178 |
----
|
|
|
179 |
|
|
|
180 |
The screenshot below also has the 'Highlight color for query terms'
|
|
|
181 |
set to `black; font-weight:bold;` for bold, black text (instead
|
|
|
182 |
of the blue default). There
|
|
|
183 |
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
|
|
|
184 |
methods for creating the thumbnails]; the ones here were made by
|
|
|
185 |
opening the directory containing the PDFs in the Dolphin file manager
|
|
|
186 |
(part of KDE) and selecting the Preview option.
|
|
|
187 |
|
|
|
188 |
|
|
|
189 |
== A search example
|
|
|
190 |
|
|
|
191 |
The simple query is `cerevisiae keyword:protein`. This
|
|
|
192 |
returns only PDFs that have the text "cerevisiae" and have been
|
|
|
193 |
tagged with the "protein" keyword. The LaTeX-style formatting from
|
|
|
194 |
the BibTeX database is displayed as HTML (note the italicized words
|
|
|
195 |
in article title, and umlaut in author's name). Other queries could
|
|
|
196 |
be made based on the PDF metadata, e.g. 'journal:plos'
|
|
|
197 |
r 'year:2013'.
|
|
|
198 |
|
|
|
199 |
image::recoll_query.png
|
|
|
200 |
|
|
|
201 |
== More possibilities
|
|
|
202 |
|
|
|
203 |
- The sort buttons (up- and down-arrows) in Recoll sort the
|
|
|
204 |
results by the modified date on the file at the time of indexing. If
|
|
|
205 |
you want this sorting to reflect the publication year, then the
|
|
|
206 |
timestamp should be set accordingly. If names of the PDFs contain
|
|
|
207 |
the year (e.g. BZS2007.pdf, CKE+2011.pdf), the following one-liner
|
|
|
208 |
would set the modified date to January 1st of the year:
|
|
|
209 |
|
|
|
210 |
----
|
|
|
211 |
for i in `ls *.pdf`; do touch -d `echo $i | sed 's/[^0-9]*//g'`-01-01 $i; done
|
|
|
212 |
----
|
|
|
213 |
|
|
|
214 |
Note that the publication year could then be shown in
|
|
|
215 |
the result list using the stored date of the file (using "%D" in the
|
|
|
216 |
result paragraph format, and date format "%Y") instead of having to
|
|
|
217 |
add the year to the index as shown above.
|
|
|
218 |
|
|
|
219 |
- The filter can be modified to fill in the "journal" field for
|
|
|
220 |
BibTex entries that aren't journal articles (e.g. bibtex:booktitle
|
|
|
221 |
for "InCollection" entries).
|