Switch to unified view

a/website/recoll_XMP/original-text.html b/website/recoll_XMP/original-text.html
...
...
4
</head>
4
</head>
5
5
6
<body>
6
<body>
7
7
8
<h2>Introduction</h2>
8
<h2>Introduction</h2>
9
9
<p>Organizing and searching a large collection of PDFs as part of a research project can be a demanding task.
10
<p>Organizing and searching a large collection of PDFs as part of a research project can be a demanding task.
10
<a href="http://en.wikipedia.org/wiki/Extensible_Metadata_Platform">XMP metadata</a> stored in a PDF, such as journal title, publication year, and user-added keywords, are often useful when searching for a publication.
11
<a href="http://en.wikipedia.org/wiki/Extensible_Metadata_Platform">XMP
11
Here, we describe the use of a custom Recoll filter to retrieve this metadata, an indexing configuration to store it, and result paragraph format to display it. See also a related wiki entry, <a href="https://bitbucket.org/medoc/recoll/wiki/HandleCustomField.wiki">Generating a custom field and using it to sort results</a>, for sorting results on PDF page count.
12
metadata</a> stored in a PDF, such as journal title, publication year,
13
and user-added keywords, are often useful when searching for a
14
publication.  Here, we describe the use of a custom Recoll filter to
15
retrieve this metadata, an indexing configuration to store it, and
16
result paragraph format to display it. See also a related wiki
17
entry, <a href="http://www.recoll.org/faqsandhowtos/HandleCustomField.html">
18
  Generating a custom field and using it to sort results</a>, for
19
sorting results on PDF page count. </p>
12
20
13
<h2>Saving metadata to PDFs</h2>
21
<h2>Saving metadata to PDFs</h2>
14
<p>Bibliographic metadata can be saved in the PDF file itself. In the <a href="http://jabref.sourceforge.net">JabRef</a> bibliography manager, this is done with the "Write XMP-metadata to PDFs" menu item. Note the presence of the keywords in the screenshot below; this field is a good place to tag the PDF with any words of your choosing to describe genre, topic, etc.
22
<p>Bibliographic metadata can be saved in the PDF file itself. In the <a href="http://jabref.sourceforge.net">JabRef</a> bibliography manager, this is done with the "Write XMP-metadata to PDFs" menu item. Note the presence of the keywords in the screenshot below; this field is a good place to tag the PDF with any words of your choosing to describe genre, topic, etc.
15
<p><img src="jabref_metadata.png">
23
<p><img src="jabref_metadata.png">
16
24
...
...
108
&amp;nbsp;&lt;font color="#009000">&lt;i>%(journal)&lt;/i>&lt;/font>&amp;nbsp;(%(year))
116
&amp;nbsp;&lt;font color="#009000">&lt;i>%(journal)&lt;/i>&lt;/font>&amp;nbsp;(%(year))
109
&amp;nbsp;&lt;table bgcolor="#e0e0e0"> &lt;tr>&lt;td>&lt;div>%A&lt;/div>&lt;/td>&lt;/tr>
117
&amp;nbsp;&lt;table bgcolor="#e0e0e0"> &lt;tr>&lt;td>&lt;div>%A&lt;/div>&lt;/td>&lt;/tr>
110
&lt;/table>&lt;font color="#900000">%K&lt;/font>
118
&lt;/table>&lt;font color="#900000">%K&lt;/font>
111
&lt;br>&lt;br>
119
&lt;br>&lt;br>
112
</pre>
120
</pre>
113
The screenshot below also has the "Highlight color for query terms" set to <tt>black; font-weight:bold;</tt> for bold, black text (instead of the blue default). There are <a href="https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails">various methods for creating the thumbnails</a>; the ones here were made by opening the directory containing the PDFs in the Dolphin file manager (part of KDE) and selecting the Preview option.
121
The screenshot below also has the "Highlight color for query terms"
122
set to <tt>black; font-weight:bold;</tt> for bold, black text (instead
123
of the blue default). There
124
are <a href="http://www.recoll.org/faqsandhowtos/ResultsThumbnails.html">
125
  various methods for creating the thumbnails</a>; the ones here were
126
made by opening the directory containing the PDFs in the Dolphin file manager
127
(part of KDE) and selecting the Preview option. 
114
128
115
<h2>A search example</h2>
129
<h2>A search example</h2>
116
<p>The simple query is <tt>cerevisiae keyword:protein</tt>. This returns only PDFs that have the text "cerevisiae" and have been tagged with the "protein" keyword. The LaTeX-style formatting from the BibTeX database is displayed as HTML (note the italicized words in article title, and umlaut in author's name). Other queries could be made based on the PDF metadata, e.g. <tt>journal:plos</tt> or <tt>year:2013</tt> .
130
131
<p>The simple query is <tt>cerevisiae keyword:protein</tt>. This
132
returns only PDFs that have the text "cerevisiae" and have been tagged
133
with the "protein" keyword. The LaTeX-style formatting from the BibTeX
134
database is displayed as HTML (note the italicized words in article
135
title, and umlaut in author's name). Other queries could be made based
136
  on the PDF metadata, e.g. <tt>journal:plos</tt> or <tt>year:2013</tt>.</p>
117
<p><img src="recoll_query.png">
137
<p><img src="recoll_query.png"></p>
118
138
119
<h2>More possibilities</h2>
139
<h2>More possibilities</h2>
140
120
<ul>
141
<ul>
121
  <li>The sort buttons (up- and down-arrows) in Recoll sort the results by the modified date on the file at the time of indexing. If you want this sorting to reflect the publication year, then the timestamp should be set accordingly. If names of the PDFs contain the year (e.g. BZS2007.pdf, CKE+2011.pdf), the following one-liner would set the modified date to January 1st of the year: <tt>for i in `ls *.pdf`; do touch -d `echo $i | sed 's/[^0-9]*//g'`-01-01 $i; done</tt> . Note that the publication year could then be shown in the result list using the stored date of the file (using "%D" in the result paragraph format, and date format "%Y") instead of having to add the year to the index as shown above.
142
  <li>The sort buttons (up- and down-arrows) in Recoll sort the
122
  <li>The filter can be modified to fill in the "journal" field for BibTex entries that aren't journal articles (e.g. bibtex:booktitle for "InCollection" entries).
143
  results by the modified date on the file at the time of indexing. If
144
  you want this sorting to reflect the publication year, then the
145
  timestamp should be set accordingly. If names of the PDFs contain
146
  the year (e.g. BZS2007.pdf, CKE+2011.pdf), the following one-liner
147
  would set the modified date to January 1st of the year: <tt>for i in
148
  `ls *.pdf`; do touch -d `echo $i | sed 's/[^0-9]*//g'`-01-01 $i;
149
  done</tt> . Note that the publication year could then be shown in
150
  the result list using the stored date of the file (using "%D" in the
151
  result paragraph format, and date format "%Y") instead of having to
152
  add the year to the index as shown above. 
153
154
  <li>The filter can be modified to fill in the "journal" field for
155
  BibTex entries that aren't journal articles (e.g. bibtex:booktitle
156
  for "InCollection" entries). 
157
123
</ul>
158
</ul>
124
159
125
</body>
160
</body>