Switch to side-by-side view

--- a/website/recoll_XMP/index.txt
+++ b/website/recoll_XMP/index.txt
@@ -8,10 +8,9 @@
 recoll 1.23.2, but the new handler can be used with previous Recoll
 versions).
 
-This page was adapted from the text by Jeffrey Dick, using input from
-Johannes Menzel, (especially the result list paragraph format),
-adapting things for the new handler. The discussion which led to the
-updated handler is a
+I based this page on the text by Jeffrey Dick, using input from Johannes
+Menzel for all examples about the new features. The discussion which led to
+the updated handler is a
 link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[Bitbucket
 Recoll issue].
   
@@ -42,46 +41,51 @@
 
 image::jabref_metadata.png[Editing metadata with jabref]
 
-== Custom indexing (fields file)
-
-Let's create two fields named "year" and "journal". The prefixes
-starting with "XY" are extension prefixes that are added to the terms
-in the Xapian database (Recoll internally does not use prefixes
-starting with XY). Additionally, the year and journal are stored so
-they can be displayed in the results list. Some other types of
-metadata, such as title, author and keywords, are already indexed by
-Recoll (the default rclpdf finds them using the *pdftotext*
-command) so there is no need to add those to the [prefixes] section. 
-
-Add this text to the fields file in your Recoll configuration
-directory ('~/.recoll/fields'). 
+== Custom indexing short example (fields file)
+
+The following example (extract from a complete configuration shown later)
+creates two fields named "refjournal" and "refpages", which are both stored
+(so they can be displayed in result list entries), and indexed (you can
+specifically search them).
+
+Some other types of metadata, such as title, author and keywords, are
+already indexed by Recoll (the default rclpdf finds them using the
+*pdftotext* command) so there is no need to add those to the [prefixes]
+section.
+
+This is taken from the `fields` file inside the configuration
+(e.g. '~/.recoll/fields').
+
 
 ----
 [prefixes]
-year = XYEAR
-journal = XYJOUR
+refjournal=RFJOURNAL
+refpages=RFPAGES
 
 [stored]
-bibtex:year =
-bibtex:journal =
+refjournal =
+refpages =
+
+[aliases]
+refjournal = bibtex:journal bibtex:journaltitle
+refpages = bibtex:pages
 ----
 
 == Telling the handler what fields to extract
 
-As of Recoll 1.23.2, the PDF handler has the capability to use
-*pdfinfo* for extracting XMP metadata. The switch for executing *pdfinfo*
-is the 'pdfextrameta' configuration parameter, and the value of the
-parameter is a list of XMP tags to extract, with optional conversion
-to Recoll field names (the XMP qualified tag name is kept by
-default). Example:
-
-----
-pdfextrameta =  bibtex:year bibtex:journal bibtex:booktitle|title
-----
-
-Here, 'bibtex:year' and 'bibtex:journal' are used directly, and
-'bibtex:booktitle' is translated to 'title' (the example is not
-supposed to make sense)
+As of Recoll 1.23.2, the PDF handler has the capability to use *pdfinfo*
+for extracting XMP metadata. The switch for executing *pdfinfo* is the
+'pdfextrameta' configuration parameter, and the value of the parameter is a
+list of XMP tags to extract, with optional conversion to Recoll field names
+(the XMP qualified tag name is kept by default, the translation is
+separated by a '|' character). Example (without translations):
+
+----
+pdfextrameta =  bibtex:year bibtex:journal bibtex:journaltitle
+----
+
+Note that it is quite equivalent to translate a field name inside
+'pdfextrameta' or to uses aliases inside the 'fields' file.
 
 == Editing the field values
 
@@ -127,6 +131,13 @@
         return txt
 ----
 
+
+The metadata-editing script can be modified to fill in the "journal" field for
+BibTex entries that aren't journal articles (e.g. bibtex:booktitle
+for "InCollection" entries), by defining a 'wrapup()' method which will
+be called with the whole metadata array (an array of '(nm,value)'
+pairs) for global editing/removing/addition.
+
 == Indexing
 
 Then index away!
@@ -138,12 +149,9 @@
 
 == Result paragraph format
 
-Here, the result is formatted to show the title, which is a link
-to open the document, in blue with underlining turned off. The next
-two lines contain the authors, then the journal title in green
-italicized text followed by year (in parentheses). The keywords are
-listed in red after the abstract/text snippet.
-    
+The result paragraph format defines what fields are displayed inside Recoll
+result list, and how they are formatted.
+
 Edit this using the Recoll GUI: Preferences > GUI configuration >
     Result List > Edit result paragraph format string.
 
@@ -177,26 +185,15 @@
 
 ----
 
-The screenshot below also has the 'Highlight color for query terms'
-set to `black; font-weight:bold;` for bold, black text (instead
-of the blue default). There
-are linkhttps://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
-methods for creating the thumbnails]; the ones here were made by
-opening the directory containing the PDFs in the Dolphin file manager
-(part of KDE) and selecting the Preview option.
-
-
-== A search example
-
-The simple query is `cerevisiae keyword:protein`. This
-returns only PDFs that have the text "cerevisiae" and have been
-tagged with the "protein" keyword. The LaTeX-style formatting from
-the BibTeX database is displayed as HTML (note the italicized words
-in article title, and umlaut in author's name). Other queries could
-be made based on the PDF metadata, e.g. 'journal:plos'
-r 'year:2013'. 
-
-image::recoll_query.png
+There are
+link:https://bitbucket.org/medoc/recoll/wiki/ResultsThumbnails[various
+methods for creating the thumbnails]; the ones here were made by opening
+the directory containing the PDFs in the Dolphin file manager (part of KDE)
+and selecting the Preview option.
+
+And the result:
+
+image::recoll_query.png[Result list display]
 
 == More possibilities
 
@@ -216,6 +213,190 @@
 result paragraph format, and date format "%Y") instead of having to
 add the year to the index as shown above. 
 
-- The filter can be modified to fill in the "journal" field for
-  BibTex entries that aren't journal articles (e.g. bibtex:booktitle
-  for "InCollection" entries). 
+
+== Complete example
+
+This was designed by Johannes Menzel, who kindly provided the data when we
+worked on improving PDF XMP data extraction. The originals are listed in
+this
+link:https://bitbucket.org/medoc/recoll/issues/300/extracting-xmp-metadata-and-tmsu-tags[BitBucket issue]
+
+The paragraph format is listed above.
+
+=== 'recoll.conf' additions:
+
+----
+pdfextrameta = bibtex:journal bibtex:journaltitle bibtex:pages \
+  bibtex:volume bibtex:number bibtex:booktitle bibtex:year bibtex:author \
+  bibtex:title bibtex:isbn bibtex:issn bibtex:editor bibtex:address \
+  bibtex:location bibtex:doi bibtex:chapter bibtex:url bibtex:entrytype \
+  bibtex:bibtexkey bibtex:abstract bibtex:date bibtex:keywords \
+  bibtex:comment bibtex:language bibtex:edition bibtex:totalpages \
+  dc:creator dc:relation dc:publisher dc:title dc:type dc:identifier
+
+defaultcharset = UTF-8//
+
+pdfextrametafix = /home/hannes/.recoll/metafix.py
+----
+
+
+=== 'metafix.py' script:
+
+----
+import sys
+import re
+
+# This can be used for local XMP field editing.
+#
+# A new instance is created for each PDF document (so the object could
+# keep state to avoid, e.g. duplicate values)
+#
+# The metafix method receives an (original) field name, and the text
+# value, and should return the possibly modified text.
+class MetaFixer(object):
+    def __init__(self):
+        pass
+
+    def metafix(self, nm, txt):
+        if nm == 'bibtex:pages':
+            txt = re.sub(r'--', '-', txt)
+            txt = re.sub(r'^', ', p. ', txt)
+        elif nm == 'bibtex:author':
+            txt = re.sub(r'$', ':\ ', txt)
+            pass
+        elif nm == 'bibtex:chapter':
+            txt = re.sub(r'^', ', in: id.: ', txt)
+            pass
+        elif nm == 'bibtex:editor':
+            txt = re.sub(r'^', ', in: ', txt)
+            txt = re.sub(r'$', ' (ed.):\ ', txt)
+            pass
+        elif nm == 'bibtex:year':
+            txt = re.sub(r'^', ', ', txt)
+            pass
+        elif nm == 'bibtex:date':
+            txt = re.sub(r'^', ', ', txt)
+            pass            
+        elif nm == 'bibtex:volume':
+            txt = re.sub(r'^', ', vol. ', txt)
+            pass
+        elif nm == 'bibtex:number':
+            txt = re.sub(r'^', ', no. ', txt)
+            pass
+        elif nm == 'bibtex:journaltitle':
+            txt = re.sub(r'^', ', in: ', txt)
+            pass
+        elif nm == 'bibtex:journal':
+            txt = re.sub(r'^', ', in: ', txt)
+            pass
+        elif nm == 'bibtex:title':
+            txt = re.sub(r'^', '"', txt)
+            txt = re.sub(r'$', '"', txt)            
+            pass
+        elif nm == 'bibtex:location':
+            txt = re.sub(r'^', ', ', txt)
+            txt = re.sub(r'$', ':\ ', txt)            
+            pass
+        elif nm == 'bibtex:address':
+            txt = re.sub(r'^', ', ', txt)
+            txt = re.sub(r'$', ':\ ', txt)            
+            pass
+        elif nm == 'bibtex:isbn':
+            txt = re.sub(r'^', 'ISBN: ', txt)            
+            pass
+        elif nm == 'bibtex:issn':
+            txt = re.sub(r'^', 'ISSN: ', txt)
+            pass
+        elif nm == 'bibtex:doi':
+            txt = re.sub(r'^', 'DOI: ', txt)
+            pass
+        elif nm == 'bibtex:bibtexkey':
+            txt = re.sub(r'^', 'Key: ', txt)
+            pass
+
+        return txt
+----
+
+
+=== 'fields' file:
+
+----
+[prefixes]
+
+refjournal=RFJOURNAL
+refpages=RFPAGES
+reftitle=RFTTITLE
+refvolume=RFVOLUME
+refauthor=RFAUTHOR
+refyear=RFYYEAR
+refisbn=RFISBN
+refissn=RFISSN
+refdoi=RFDOI
+refeditor=RFEDITOR
+refpublisher=RFPUBLISHER
+refaddress=RFADDRESS
+reflocation=RFLOCATION
+refbooktitle=RFBOOKTITLE
+refurl=RFURL
+reftype=RFTYPE
+refkey=RFKEY
+refabstract=RFABSTRACT
+refkeywords=RFKEYWORDS
+refcomment=RFCOMMENT
+refedition=RFEDITION
+reflanguage=RFLANGUAGE
+
+[stored]
+
+refjournal=
+refpages=
+reftitle=
+refvolume=
+refauthor=
+refyear=
+refisbn=
+refissn=
+refdoi=
+refeditor=
+refpublisher=
+refaddress=
+reflocation=
+refbooktitle=
+refurl=
+reftype=
+refkey=
+refabstract=
+refkeywords=
+refcomment=
+refedition=
+reflanguage=
+refid=
+
+[aliases]
+
+refjournal = bibtex:journal bibtex:journaltitle
+refpages = bibtex:pages
+reftitle = bibtex:title
+refvolume = bibtex:volume
+refauthor = bibtex:author
+refyear = bibtex:year bibtex:date
+refid = dc:identifier bibtex:isbn bibtex:issn
+refisbn = bibtex:isbn
+refissn = bibtex:issn
+refdoi = bibtex:doi
+refeditor = bibtex:editor
+refpublisher = bibtex:publisher
+refaddress = bibtex:address
+reflocation = bibtex:location
+refbooktitle = bibtex:booktitle
+refurl = bibtex:url
+reftype = bibtex:entrytype bibtex:type
+refkey = bibtex:bibtexkey
+refabstract = bibtex:abstract
+refkeywords = bibtex:keywords
+refcomment = bibtex:comment
+refedition = bibtex:edition
+reflanguage = bibtex:language
+author = xesam:author
+----
+