a/website/recoll_XMP/index.html b/website/recoll_XMP/index.html
...
...
743
here</a>. However it described using the old shell-based PDF Recoll input
743
here</a>. However it described using the old shell-based PDF Recoll input
744
handler, which differs a lot from doing something equivalent with the
744
handler, which differs a lot from doing something equivalent with the
745
current Python-based one (for which XMP capability is available from
745
current Python-based one (for which XMP capability is available from
746
recoll 1.23.2, but the new handler can be used with previous Recoll
746
recoll 1.23.2, but the new handler can be used with previous Recoll
747
versions).</p></div>
747
versions).</p></div>
748
<div class="paragraph"><p>This page was adapted from the text by Jeffrey Dick, using input from
748
<div class="paragraph"><p>I based this page on the text by Jeffrey Dick, using input from Johannes
749
Johannes Menzel, (especially the result list paragraph format),
749
Menzel for all examples about the new features. The discussion which led to
750
adapting things for the new handler. The discussion which led to the
751
updated handler is a
750
the updated handler is a
752
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">Bitbucket
751
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">Bitbucket
753
Recoll issue</a>.</p></div>
752
Recoll issue</a>.</p></div>
754
</div>
753
</div>
755
</div>
754
</div>
756
<div class="sect1">
755
<div class="sect1">
...
...
785
</div>
784
</div>
786
</div>
785
</div>
787
</div>
786
</div>
788
</div>
787
</div>
789
<div class="sect1">
788
<div class="sect1">
790
<h2 id="_custom_indexing_fields_file">Custom indexing (fields file)</h2>
789
<h2 id="_custom_indexing_short_example_fields_file">Custom indexing short example (fields file)</h2>
791
<div class="sectionbody">
790
<div class="sectionbody">
792
<div class="paragraph"><p>Let&#8217;s create two fields named "year" and "journal". The prefixes
791
<div class="paragraph"><p>The following example (extract from a complete configuration shown later)
793
starting with "XY" are extension prefixes that are added to the terms
792
creates two fields named "refjournal" and "refpages", which are both stored
794
in the Xapian database (Recoll internally does not use prefixes
793
(so they can be displayed in result list entries), and indexed (you can
795
starting with XY). Additionally, the year and journal are stored so
794
specifically search them).</p></div>
796
they can be displayed in the results list. Some other types of
795
<div class="paragraph"><p>Some other types of metadata, such as title, author and keywords, are
797
metadata, such as title, author and keywords, are already indexed by
796
already indexed by Recoll (the default rclpdf finds them using the
798
Recoll (the default rclpdf finds them using the <strong>pdftotext</strong>
797
<strong>pdftotext</strong> command) so there is no need to add those to the [prefixes]
799
command) so there is no need to add those to the [prefixes] section.</p></div>
798
section.</p></div>
800
<div class="paragraph"><p>Add this text to the fields file in your Recoll configuration
799
<div class="paragraph"><p>This is taken from the <code>fields</code> file inside the configuration
801
directory (<em>~/.recoll/fields</em>).</p></div>
800
(e.g. <em>~/.recoll/fields</em>).</p></div>
802
<div class="listingblock">
801
<div class="listingblock">
803
<div class="content">
802
<div class="content">
804
<pre><code>[prefixes]
803
<pre><code>[prefixes]
805
year = XYEAR
804
refjournal=RFJOURNAL
806
journal = XYJOUR
805
refpages=RFPAGES
807
806
808
[stored]
807
[stored]
809
bibtex:year =
808
refjournal =
810
bibtex:journal =</code></pre>
809
refpages =
810
811
[aliases]
812
refjournal = bibtex:journal bibtex:journaltitle
813
refpages = bibtex:pages</code></pre>
811
</div></div>
814
</div></div>
812
</div>
815
</div>
813
</div>
816
</div>
814
<div class="sect1">
817
<div class="sect1">
815
<h2 id="_telling_the_handler_what_fields_to_extract">Telling the handler what fields to extract</h2>
818
<h2 id="_telling_the_handler_what_fields_to_extract">Telling the handler what fields to extract</h2>
816
<div class="sectionbody">
819
<div class="sectionbody">
817
<div class="paragraph"><p>As of Recoll 1.23.2, the PDF handler has the capability to use
820
<div class="paragraph"><p>As of Recoll 1.23.2, the PDF handler has the capability to use <strong>pdfinfo</strong>
818
<strong>pdfinfo</strong> for extracting XMP metadata. The switch for executing <strong>pdfinfo</strong>
821
for extracting XMP metadata. The switch for executing <strong>pdfinfo</strong> is the
819
is the <em>pdfextrameta</em> configuration parameter, and the value of the
822
<em>pdfextrameta</em> configuration parameter, and the value of the parameter is a
820
parameter is a list of XMP tags to extract, with optional conversion
823
list of XMP tags to extract, with optional conversion to Recoll field names
821
to Recoll field names (the XMP qualified tag name is kept by
824
(the XMP qualified tag name is kept by default, the translation is
822
default). Example:</p></div>
825
separated by a <em>|</em> character). Example (without translations):</p></div>
823
<div class="listingblock">
826
<div class="listingblock">
824
<div class="content">
827
<div class="content">
825
<pre><code>pdfextrameta =  bibtex:year bibtex:journal bibtex:booktitle|title</code></pre>
828
<pre><code>pdfextrameta =  bibtex:year bibtex:journal bibtex:journaltitle</code></pre>
826
</div></div>
829
</div></div>
827
<div class="paragraph"><p>Here, <em>bibtex:year</em> and <em>bibtex:journal</em> are used directly, and
830
<div class="paragraph"><p>Note that it is quite equivalent to translate a field name inside
828
<em>bibtex:booktitle</em> is translated to <em>title</em> (the example is not
831
<em>pdfextrameta</em> or to uses aliases inside the <em>fields</em> file.</p></div>
829
supposed to make sense)</p></div>
830
</div>
832
</div>
831
</div>
833
</div>
832
<div class="sect1">
834
<div class="sect1">
833
<h2 id="_editing_the_field_values">Editing the field values</h2>
835
<h2 id="_editing_the_field_values">Editing the field values</h2>
834
<div class="sectionbody">
836
<div class="sectionbody">
...
...
869
            # etc.
871
            # etc.
870
            pass
872
            pass
871
873
872
        return txt</code></pre>
874
        return txt</code></pre>
873
</div></div>
875
</div></div>
876
<div class="paragraph"><p>The metadata-editing script can be modified to fill in the "journal" field for
877
BibTex entries that aren&#8217;t journal articles (e.g. bibtex:booktitle
878
for "InCollection" entries), by defining a <em>wrapup()</em> method which will
879
be called with the whole metadata array (an array of <em>(nm,value)</em>
880
pairs) for global editing/removing/addition.</p></div>
874
</div>
881
</div>
875
</div>
882
</div>
876
<div class="sect1">
883
<div class="sect1">
877
<h2 id="_indexing">Indexing</h2>
884
<h2 id="_indexing">Indexing</h2>
878
<div class="sectionbody">
885
<div class="sectionbody">
...
...
884
</div>
891
</div>
885
</div>
892
</div>
886
<div class="sect1">
893
<div class="sect1">
887
<h2 id="_result_paragraph_format">Result paragraph format</h2>
894
<h2 id="_result_paragraph_format">Result paragraph format</h2>
888
<div class="sectionbody">
895
<div class="sectionbody">
889
<div class="paragraph"><p>Here, the result is formatted to show the title, which is a link
896
<div class="paragraph"><p>The result paragraph format defines what fields are displayed inside Recoll
890
to open the document, in blue with underlining turned off. The next
897
result list, and how they are formatted.</p></div>
891
two lines contain the authors, then the journal title in green
892
italicized text followed by year (in parentheses). The keywords are
893
listed in red after the abstract/text snippet.</p></div>
894
<div class="paragraph"><p>Edit this using the Recoll GUI: Preferences &gt; GUI configuration &gt;
898
<div class="paragraph"><p>Edit this using the Recoll GUI: Preferences &gt; GUI configuration &gt;
895
    Result List &gt; Edit result paragraph format string.</p></div>
899
    Result List &gt; Edit result paragraph format string.</p></div>
896
<div class="listingblock">
900
<div class="listingblock">
897
<div class="content">
901
<div class="content">
898
<pre><code>&lt;table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5"&gt;
902
<pre><code>&lt;table class="respar" style="padding-bottom: 10px;" cellspacing="5" cellpadding="5"&gt;
...
...
920
&lt;/tr&gt;
924
&lt;/tr&gt;
921
&lt;/tbody&gt;
925
&lt;/tbody&gt;
922
926
923
&lt;/table&gt;</code></pre>
927
&lt;/table&gt;</code></pre>
924
</div></div>
928
</div></div>
925
<div class="paragraph"><p>The screenshot below also has the <em>Highlight color for query terms</em>
929
<div class="paragraph"><p>There are
926
set to <code>black; font-weight:bold;</code> for bold, black text (instead
927
of the blue default). There
928
are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
930
<a href="https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails">various
929
methods for creating the thumbnails]; the ones here were made by
931
methods for creating the thumbnails</a>; the ones here were made by opening
930
opening the directory containing the PDFs in the Dolphin file manager
932
the directory containing the PDFs in the Dolphin file manager (part of KDE)
931
(part of KDE) and selecting the Preview option.</p></div>
933
and selecting the Preview option.</p></div>
932
</div>
934
<div class="paragraph"><p>And the result:</p></div>
933
</div>
935
<div class="imageblock">
934
<div class="sect1">
936
<div class="content">
935
<h2 id="_a_search_example">A search example</h2>
937
<img src="recoll_query.png" alt="Result list display" />
936
<div class="sectionbody">
938
</div>
937
<div class="paragraph"><p>The simple query is <code>cerevisiae keyword:protein</code>. This
939
</div>
938
returns only PDFs that have the text "cerevisiae" and have been
939
tagged with the "protein" keyword. The LaTeX-style formatting from
940
the BibTeX database is displayed as HTML (note the italicized words
941
in article title, and umlaut in author&#8217;s name). Other queries could
942
be made based on the PDF metadata, e.g. <em>journal:plos</em>
943
r <em>year:2013</em>.</p></div>
944
<div class="paragraph"><p>image::recoll_query.png</p></div>
945
</div>
940
</div>
946
</div>
941
</div>
947
<div class="sect1">
942
<div class="sect1">
948
<h2 id="_more_possibilities">More possibilities</h2>
943
<h2 id="_more_possibilities">More possibilities</h2>
949
<div class="sectionbody">
944
<div class="sectionbody">
...
...
965
</div></div>
960
</div></div>
966
<div class="paragraph"><p>Note that the publication year could then be shown in
961
<div class="paragraph"><p>Note that the publication year could then be shown in
967
the result list using the stored date of the file (using "%D" in the
962
the result list using the stored date of the file (using "%D" in the
968
result paragraph format, and date format "%Y") instead of having to
963
result paragraph format, and date format "%Y") instead of having to
969
add the year to the index as shown above.</p></div>
964
add the year to the index as shown above.</p></div>
970
<div class="ulist"><ul>
971
<li>
972
<p>
973
The filter can be modified to fill in the "journal" field for
974
  BibTex entries that aren&#8217;t journal articles (e.g. bibtex:booktitle
975
  for "InCollection" entries).
976
</p>
977
</li>
965
</div>
966
</div>
967
<div class="sect1">
968
<h2 id="_complete_example">Complete example</h2>
969
<div class="sectionbody">
970
<div class="paragraph"><p>This was designed by Johannes Menzel, who kindly provided the data when we
971
worked on improving PDF XMP data extraction. The originals are listed in
972
this
973
<a href="https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags">BitBucket issue</a></p></div>
974
<div class="paragraph"><p>The paragraph format is listed above.</p></div>
975
<div class="sect2">
976
<h3 id="_em_recoll_conf_em_additions"><em>recoll.conf</em> additions:</h3>
977
<div class="listingblock">
978
<div class="content">
979
<pre><code>pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
980
  bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
981
  bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
982
  bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
983
  bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
984
  bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
985
  dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
986
987
defaultcharset = UTF-8//
988
989
pdfextrametafix = /home/hannes/.recoll/metafix.py</code></pre>
978
</ul></div>
990
</div></div>
991
</div>
992
<div class="sect2">
993
<h3 id="_em_metafix_py_em_script"><em>metafix.py</em> script:</h3>
994
<div class="listingblock">
995
<div class="content">
996
<pre><code>import sys
997
import re
998
999
# This can be used for local XMP field editing.
1000
#
1001
# A new instance is created for each PDF document (so the object could
1002
# keep state to avoid, e.g. duplicate values)
1003
#
1004
# The metafix method receives an (original) field name, and the text
1005
# value, and should return the possibly modified text.
1006
class MetaFixer(object):
1007
    def __init__(self):
1008
        pass
1009
1010
    def metafix(self, nm, txt):
1011
        if nm == 'bibtex:pages':
1012
            txt = re.sub(r'--', '-', txt)
1013
            txt = re.sub(r'^', ', p. ', txt)
1014
        elif nm == 'bibtex:author':
1015
            txt = re.sub(r'$', ':\ ', txt)
1016
            pass
1017
        elif nm == 'bibtex:chapter':
1018
            txt = re.sub(r'^', ', in: id.: ', txt)
1019
            pass
1020
        elif nm == 'bibtex:editor':
1021
            txt = re.sub(r'^', ', in: ', txt)
1022
            txt = re.sub(r'$', ' (ed.):\ ', txt)
1023
            pass
1024
        elif nm == 'bibtex:year':
1025
            txt = re.sub(r'^', ', ', txt)
1026
            pass
1027
        elif nm == 'bibtex:date':
1028
            txt = re.sub(r'^', ', ', txt)
1029
            pass
1030
        elif nm == 'bibtex:volume':
1031
            txt = re.sub(r'^', ', vol. ', txt)
1032
            pass
1033
        elif nm == 'bibtex:number':
1034
            txt = re.sub(r'^', ', no. ', txt)
1035
            pass
1036
        elif nm == 'bibtex:journaltitle':
1037
            txt = re.sub(r'^', ', in: ', txt)
1038
            pass
1039
        elif nm == 'bibtex:journal':
1040
            txt = re.sub(r'^', ', in: ', txt)
1041
            pass
1042
        elif nm == 'bibtex:title':
1043
            txt = re.sub(r'^', '"', txt)
1044
            txt = re.sub(r'$', '"', txt)
1045
            pass
1046
        elif nm == 'bibtex:location':
1047
            txt = re.sub(r'^', ', ', txt)
1048
            txt = re.sub(r'$', ':\ ', txt)
1049
            pass
1050
        elif nm == 'bibtex:address':
1051
            txt = re.sub(r'^', ', ', txt)
1052
            txt = re.sub(r'$', ':\ ', txt)
1053
            pass
1054
        elif nm == 'bibtex:isbn':
1055
            txt = re.sub(r'^', 'ISBN: ', txt)
1056
            pass
1057
        elif nm == 'bibtex:issn':
1058
            txt = re.sub(r'^', 'ISSN: ', txt)
1059
            pass
1060
        elif nm == 'bibtex:doi':
1061
            txt = re.sub(r'^', 'DOI: ', txt)
1062
            pass
1063
        elif nm == 'bibtex:bibtexkey':
1064
            txt = re.sub(r'^', 'Key: ', txt)
1065
            pass
1066
1067
        return txt</code></pre>
1068
</div></div>
1069
</div>
1070
<div class="sect2">
1071
<h3 id="_em_fields_em_file"><em>fields</em> file:</h3>
1072
<div class="listingblock">
1073
<div class="content">
1074
<pre><code>[prefixes]
1075
1076
refjournal=RFJOURNAL
1077
refpages=RFPAGES
1078
reftitle=RFTTITLE
1079
refvolume=RFVOLUME
1080
refauthor=RFAUTHOR
1081
refyear=RFYYEAR
1082
refisbn=RFISBN
1083
refissn=RFISSN
1084
refdoi=RFDOI
1085
refeditor=RFEDITOR
1086
refpublisher=RFPUBLISHER
1087
refaddress=RFADDRESS
1088
reflocation=RFLOCATION
1089
refbooktitle=RFBOOKTITLE
1090
refurl=RFURL
1091
reftype=RFTYPE
1092
refkey=RFKEY
1093
refabstract=RFABSTRACT
1094
refkeywords=RFKEYWORDS
1095
refcomment=RFCOMMENT
1096
refedition=RFEDITION
1097
reflanguage=RFLANGUAGE
1098
1099
[stored]
1100
1101
refjournal=
1102
refpages=
1103
reftitle=
1104
refvolume=
1105
refauthor=
1106
refyear=
1107
refisbn=
1108
refissn=
1109
refdoi=
1110
refeditor=
1111
refpublisher=
1112
refaddress=
1113
reflocation=
1114
refbooktitle=
1115
refurl=
1116
reftype=
1117
refkey=
1118
refabstract=
1119
refkeywords=
1120
refcomment=
1121
refedition=
1122
reflanguage=
1123
refid=
1124
1125
[aliases]
1126
1127
refjournal = bibtex:journal bibtex:journaltitle
1128
refpages = bibtex:pages
1129
reftitle = bibtex:title
1130
refvolume = bibtex:volume
1131
refauthor = bibtex:author
1132
refyear = bibtex:year bibtex:date
1133
refid = dc:identifier bibtex:isbn bibtex:issn
1134
refisbn = bibtex:isbn
1135
refissn = bibtex:issn
1136
refdoi = bibtex:doi
1137
refeditor = bibtex:editor
1138
refpublisher = bibtex:publisher
1139
refaddress = bibtex:address
1140
reflocation = bibtex:location
1141
refbooktitle = bibtex:booktitle
1142
refurl = bibtex:url
1143
reftype = bibtex:entrytype bibtex:type
1144
refkey = bibtex:bibtexkey
1145
refabstract = bibtex:abstract
1146
refkeywords = bibtex:keywords
1147
refcomment = bibtex:comment
1148
refedition = bibtex:edition
1149
reflanguage = bibtex:language
1150
author = xesam:author</code></pre>
1151
</div></div>
1152
</div>
979
</div>
1153
</div>
980
</div>
1154
</div>
981
</div>
1155
</div>
982
<div id="footnotes"><hr /></div>
1156
<div id="footnotes"><hr /></div>
983
<div id="footer">
1157
<div id="footer">
984
<div id="footer-text">
1158
<div id="footer-text">
985
Last updated
1159
Last updated
986
 2017-05-17 07:27:42 CEST
1160
 2017-05-23 09:26:52 CEST
987
</div>
1161
</div>
988
</div>
1162
</div>
989
</body>
1163
</body>
990
</html>
1164
</html>