--- a/website/recoll_XMP/index.txt
+++ b/website/recoll_XMP/index.txt
@@ -8,10 +8,9 @@
recoll 1.23.2, but the new handler can be used with previous Recoll
versions).
-This page was adapted from the text by Jeffrey Dick, using input from
-Johannes Menzel, (especially the result list paragraph format),
-adapting things for the new handler. The discussion which led to the
-updated handler is a
+I based this page on the text by Jeffrey Dick, using input from Johannes
+Menzel for all examples about the new features. The discussion which led to
+the updated handler is a
link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
Recoll issue].
@@ -42,46 +41,51 @@
image::jabref_metadata.png[Editing metadata with jabref]
-== Custom indexing (fields file)
-
-Let's create two fields named "year" and "journal". The prefixes
-starting with "XY" are extension prefixes that are added to the terms
-in the Xapian database (Recoll internally does not use prefixes
-starting with XY). Additionally, the year and journal are stored so
-they can be displayed in the results list. Some other types of
-metadata, such as title, author and keywords, are already indexed by
-Recoll (the default rclpdf finds them using the *pdftotext*
-command) so there is no need to add those to the [prefixes] section.
-
-Add this text to the fields file in your Recoll configuration
-directory ('~/.recoll/fields').
+== Custom indexing short example (fields file)
+
+The following example (extract from a complete configuration shown later)
+creates two fields named "refjournal" and "refpages", which are both stored
+(so they can be displayed in result list entries), and indexed (you can
+specifically search them).
+
+Some other types of metadata, such as title, author and keywords, are
+already indexed by Recoll (the default rclpdf finds them using the
+*pdftotext* command) so there is no need to add those to the [prefixes]
+section.
+
+This is taken from the `fields` file inside the configuration
+(e.g. '~/.recoll/fields').
+
----
[prefixes]
-year = XYEAR
-journal = XYJOUR
+refjournal=RFJOURNAL
+refpages=RFPAGES
[stored]
-bibtex:year =
-bibtex:journal =
+refjournal =
+refpages =
+
+[aliases]
+refjournal = bibtex:journal bibtex:journaltitle
+refpages = bibtex:pages
----
== Telling the handler what fields to extract
-As of Recoll 1.23.2, the PDF handler has the capability to use
-*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
-is the 'pdfextrameta' configuration parameter, and the value of the
-parameter is a list of XMP tags to extract, with optional conversion
-to Recoll field names (the XMP qualified tag name is kept by
-default). Example:
-
-----
-pdfextrameta = bibtex:year bibtex:journal bibtex:booktitle|title
-----
-
-Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
-'bibtex:booktitle' is translated to 'title' (the example is not
-supposed to make sense)
+As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo*
+for extracting XMP metadata. The switch for executing *pdfinfo* is the
+'pdfextrameta' configuration parameter, and the value of the parameter is a
+list of XMP tags to extract, with optional conversion to Recoll field names
+(the XMP qualified tag name is kept by default, the translation is
+separated by a '|' character). Example (without translations):
+
+----
+pdfextrameta = bibtex:year bibtex:journal bibtex:journaltitle
+----
+
+Note that it is quite equivalent to translate a field name inside
+'pdfextrameta' or to uses aliases inside the 'fields' file.
== Editing the field values
@@ -127,6 +131,13 @@
return txt
----
+
+The metadata-editing script can be modified to fill in the "journal" field for
+BibTex entries that aren't journal articles (e.g. bibtex:booktitle
+for "InCollection" entries), by defining a 'wrapup()' method which will
+be called with the whole metadata array (an array of '(nm,value)'
+pairs) for global editing/removing/addition.
+
== Indexing
Then index away!
@@ -138,12 +149,9 @@
== Result paragraph format
-Here, the result is formatted to show the title, which is a link
-to open the document, in blue with underlining turned off. The next
-two lines contain the authors, then the journal title in green
-italicized text followed by year (in parentheses). The keywords are
-listed in red after the abstract/text snippet.
-
+The result paragraph format defines what fields are displayed inside Recoll
+result list, and how they are formatted.
+
Edit this using the Recoll GUI: Preferences > GUI configuration >
Result List > Edit result paragraph format string.
@@ -177,26 +185,15 @@
----
-The screenshot below also has the 'Highlight color for query terms'
-set to `black; font-weight:bold;` for bold, black text (instead
-of the blue default). There
-are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
-methods for creating the thumbnails]; the ones here were made by
-opening the directory containing the PDFs in the Dolphin file manager
-(part of KDE) and selecting the Preview option.
-
-
-== A search example
-
-The simple query is `cerevisiae keyword:protein`. This
-returns only PDFs that have the text "cerevisiae" and have been
-tagged with the "protein" keyword. The LaTeX-style formatting from
-the BibTeX database is displayed as HTML (note the italicized words
-in article title, and umlaut in author's name). Other queries could
-be made based on the PDF metadata, e.g. 'journal:plos'
-r 'year:2013'.
-
-image::recoll_query.png
+There are
+link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
+methods for creating the thumbnails]; the ones here were made by opening
+the directory containing the PDFs in the Dolphin file manager (part of KDE)
+and selecting the Preview option.
+
+And the result:
+
+image::recoll_query.png[Result list display]
== More possibilities
@@ -216,6 +213,190 @@
result paragraph format, and date format "%Y") instead of having to
add the year to the index as shown above.
-- The filter can be modified to fill in the "journal" field for
- BibTex entries that aren't journal articles (e.g. bibtex:booktitle
- for "InCollection" entries).
+
+== Complete example
+
+This was designed by Johannes Menzel, who kindly provided the data when we
+worked on improving PDF XMP data extraction. The originals are listed in
+this
+link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue]
+
+The paragraph format is listed above.
+
+=== 'recoll.conf' additions:
+
+----
+pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
+ bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
+ bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
+ bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
+ bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
+ bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
+ dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
+
+defaultcharset = UTF-8//
+
+pdfextrametafix = /home/hannes/.recoll/metafix.py
+----
+
+
+=== 'metafix.py' script:
+
+----
+import sys
+import re
+
+# This can be used for local XMP field editing.
+#
+# A new instance is created for each PDF document (so the object could
+# keep state to avoid, e.g. duplicate values)
+#
+# The metafix method receives an (original) field name, and the text
+# value, and should return the possibly modified text.
+class MetaFixer(object):
+ def __init__(self):
+ pass
+
+ def metafix(self, nm, txt):
+ if nm == 'bibtex:pages':
+ txt = re.sub(r'--', '-', txt)
+ txt = re.sub(r'^', ', p. ', txt)
+ elif nm == 'bibtex:author':
+ txt = re.sub(r'$', ':\ ', txt)
+ pass
+ elif nm == 'bibtex:chapter':
+ txt = re.sub(r'^', ', in: id.: ', txt)
+ pass
+ elif nm == 'bibtex:editor':
+ txt = re.sub(r'^', ', in: ', txt)
+ txt = re.sub(r'$', ' (ed.):\ ', txt)
+ pass
+ elif nm == 'bibtex:year':
+ txt = re.sub(r'^', ', ', txt)
+ pass
+ elif nm == 'bibtex:date':
+ txt = re.sub(r'^', ', ', txt)
+ pass
+ elif nm == 'bibtex:volume':
+ txt = re.sub(r'^', ', vol. ', txt)
+ pass
+ elif nm == 'bibtex:number':
+ txt = re.sub(r'^', ', no. ', txt)
+ pass
+ elif nm == 'bibtex:journaltitle':
+ txt = re.sub(r'^', ', in: ', txt)
+ pass
+ elif nm == 'bibtex:journal':
+ txt = re.sub(r'^', ', in: ', txt)
+ pass
+ elif nm == 'bibtex:title':
+ txt = re.sub(r'^', '"', txt)
+ txt = re.sub(r'$', '"', txt)
+ pass
+ elif nm == 'bibtex:location':
+ txt = re.sub(r'^', ', ', txt)
+ txt = re.sub(r'$', ':\ ', txt)
+ pass
+ elif nm == 'bibtex:address':
+ txt = re.sub(r'^', ', ', txt)
+ txt = re.sub(r'$', ':\ ', txt)
+ pass
+ elif nm == 'bibtex:isbn':
+ txt = re.sub(r'^', 'ISBN: ', txt)
+ pass
+ elif nm == 'bibtex:issn':
+ txt = re.sub(r'^', 'ISSN: ', txt)
+ pass
+ elif nm == 'bibtex:doi':
+ txt = re.sub(r'^', 'DOI: ', txt)
+ pass
+ elif nm == 'bibtex:bibtexkey':
+ txt = re.sub(r'^', 'Key: ', txt)
+ pass
+
+ return txt
+----
+
+
+=== 'fields' file:
+
+----
+[prefixes]
+
+refjournal=RFJOURNAL
+refpages=RFPAGES
+reftitle=RFTTITLE
+refvolume=RFVOLUME
+refauthor=RFAUTHOR
+refyear=RFYYEAR
+refisbn=RFISBN
+refissn=RFISSN
+refdoi=RFDOI
+refeditor=RFEDITOR
+refpublisher=RFPUBLISHER
+refaddress=RFADDRESS
+reflocation=RFLOCATION
+refbooktitle=RFBOOKTITLE
+refurl=RFURL
+reftype=RFTYPE
+refkey=RFKEY
+refabstract=RFABSTRACT
+refkeywords=RFKEYWORDS
+refcomment=RFCOMMENT
+refedition=RFEDITION
+reflanguage=RFLANGUAGE
+
+[stored]
+
+refjournal=
+refpages=
+reftitle=
+refvolume=
+refauthor=
+refyear=
+refisbn=
+refissn=
+refdoi=
+refeditor=
+refpublisher=
+refaddress=
+reflocation=
+refbooktitle=
+refurl=
+reftype=
+refkey=
+refabstract=
+refkeywords=
+refcomment=
+refedition=
+reflanguage=
+refid=
+
+[aliases]
+
+refjournal = bibtex:journal bibtex:journaltitle
+refpages = bibtex:pages
+reftitle = bibtex:title
+refvolume = bibtex:volume
+refauthor = bibtex:author
+refyear = bibtex:year bibtex:date
+refid = dc:identifier bibtex:isbn bibtex:issn
+refisbn = bibtex:isbn
+refissn = bibtex:issn
+refdoi = bibtex:doi
+refeditor = bibtex:editor
+refpublisher = bibtex:publisher
+refaddress = bibtex:address
+reflocation = bibtex:location
+refbooktitle = bibtex:booktitle
+refurl = bibtex:url
+reftype = bibtex:entrytype bibtex:type
+refkey = bibtex:bibtexkey
+refabstract = bibtex:abstract
+refkeywords = bibtex:keywords
+refcomment = bibtex:comment
+refedition = bibtex:edition
+reflanguage = bibtex:language
+author = xesam:author
+----
+