Switch to unified view

a/src/doc/user/usermanual.xml b/src/doc/user/usermanual.xml
...
...
1095
      couple the tag update with a <literal>recollindex -e -i
1095
      couple the tag update with a <literal>recollindex -e -i
1096
      filename.</literal></para>
1096
      filename.</literal></para>
1097
1097
1098
</sect1>
1098
</sect1>
1099
1099
1100
1101
    <sect1 id="RCL.INDEXING.PDF">
1102
      <title>The PDF input handler</title>
1103
1104
      <para>The PDF format is very important for scientific and technical
1105
      documentation, and document archival. It has extensive
1106
      facilities for storing metadata along with the document, and these
1107
      facilities are actually used in the real world.</para>
1108
1109
      <para>In consequence, the <filename>rclpdf.py</filename> PDF input
1110
      handler has more complex capabilities than most others, and it is
1111
      also more configurable. Specifically, <filename>rclpdf.py</filename>
1112
      can automatically use <application>tesseract</application> to perform
1113
      OCR if the document text is empty, it can be configured to extract
1114
      specific metadata tags from an XMP packet, and to extract PDF
1115
      attachments.</para>
1116
1117
      <sect2 id="RCL.INDEXING.PDF.OCR">
1118
        <title>OCR with Tesseract</title>
1119
1120
        <para>If both <application>tesseract</application> and
1121
        <command>pdftoppm</command> (generally from the
1122
        <application>poppler-utils</application> package) are installed,
1123
        the PDF handler may attempt OCR on PDF files with no text
1124
        content. This is controlled by the <link
1125
        linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
1126
        configuration variable, which is false by default because
1127
        OCR is very slow.</para>
1128
1129
        <para>The choice of language is very important for successfull
1130
        OCR. Recoll has currently no way to determine this from the
1131
        document itself. You can set the language to use through the
1132
        contents of a <filename>.ocrpdflang</filename> text file in the
1133
        same directory as the PDF document, or through the
1134
        <envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
1135
        through the contents of an <filename>ocrpdf</filename> text file
1136
        inside the configuration directory. If none of the above are used,
1137
        &RCL; will try to guess the language from the NLS
1138
        environment.</para>
1139
1140
      </sect2>
1141
      
1142
      <sect2 id="RCL.INDEXING.PDF.XMP">
1143
        <title>XMP fields extraction</title>
1144
1145
        <para>The <filename>rclpdf.py</filename> script in &RCL; version
1146
        1.23.2 and later can extract XMP metadata fields by executing the
1147
        <command>pdfinfo</command> command (usually found with
1148
        <application>poppler-utils</application>). This is controlled by
1149
        the <link
1150
        linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</link>
1151
        configuration variable, which specifies which tags to extract and,
1152
        possibly, how to rename them.</para>
1153
1154
        <para>The <link
1155
        linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</link>
1156
        variable can be used to designate a file with Python code to edit
1157
        the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2
1158
        has equivalent code inside the handler script). Example:</para>
1159
        <programlisting>import sys
1160
import re
1161
1162
class MetaFixer(object):
1163
    def __init__(self):
1164
        pass
1165
1166
    def metafix(self, nm, txt):
1167
        if nm == 'bibtex:pages':
1168
            txt = re.sub(r'--', '-', txt)
1169
        elif nm == 'someothername':
1170
            # do something else
1171
            pass
1172
        elif nm == 'stillanother':
1173
            # etc.
1174
            pass
1175
    
1176
        return txt
1177
        </programlisting>
1178
1179
        
1180
        <!-- <para> There is a <ulink url="&WIKI;PDFXMP.wiki">complete example of XMP
1181
        tags setup</ulink>, including a  nice result list paragraph format in the 
1182
        &RCL; Wiki </para> -->
1183
      
1184
1185
      </sect2>
1186
1187
      <sect2 id="RCL.INDEXING.PDF.ATTACH">
1188
        <title>PDF attachment indexing</title>
1189
1190
        <para>If <application>pdftk</application> is installed, and if the
1191
        the <link
1192
        linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
1193
        configuration variable is set, the PDF input handler will try to
1194
        extract PDF attachements for indexing as sub-documents of the PDF
1195
        file. This is disabled by default, because it slows down PDF
1196
        indexing a bit even if not one attachment is ever found (PDF
1197
        attachments are uncommon in my experience).</para>
1198
1199
      </sect2>
1200
      
1201
    </sect1>
1100
1202
1101
    <sect1 id="RCL.INDEXING.PERIODIC">
1203
    <sect1 id="RCL.INDEXING.PERIODIC">
1102
      <title>Periodic indexing</title>
1204
      <title>Periodic indexing</title>
1103
1205
1104
      <sect2 id="RCL.INDEXING.PERIODIC.EXEC">
1206
      <sect2 id="RCL.INDEXING.PERIODIC.EXEC">