|
a/src/doc/user/usermanual.xml |
|
b/src/doc/user/usermanual.xml |
|
... |
|
... |
1095 |
couple the tag update with a <literal>recollindex -e -i
|
1095 |
couple the tag update with a <literal>recollindex -e -i
|
1096 |
filename.</literal></para>
|
1096 |
filename.</literal></para>
|
1097 |
|
1097 |
|
1098 |
</sect1>
|
1098 |
</sect1>
|
1099 |
|
1099 |
|
|
|
1100 |
|
|
|
1101 |
<sect1 id="RCL.INDEXING.PDF">
|
|
|
1102 |
<title>The PDF input handler</title>
|
|
|
1103 |
|
|
|
1104 |
<para>The PDF format is very important for scientific and technical
|
|
|
1105 |
documentation, and document archival. It has extensive
|
|
|
1106 |
facilities for storing metadata along with the document, and these
|
|
|
1107 |
facilities are actually used in the real world.</para>
|
|
|
1108 |
|
|
|
1109 |
<para>In consequence, the <filename>rclpdf.py</filename> PDF input
|
|
|
1110 |
handler has more complex capabilities than most others, and it is
|
|
|
1111 |
also more configurable. Specifically, <filename>rclpdf.py</filename>
|
|
|
1112 |
can automatically use <application>tesseract</application> to perform
|
|
|
1113 |
OCR if the document text is empty, it can be configured to extract
|
|
|
1114 |
specific metadata tags from an XMP packet, and to extract PDF
|
|
|
1115 |
attachments.</para>
|
|
|
1116 |
|
|
|
1117 |
<sect2 id="RCL.INDEXING.PDF.OCR">
|
|
|
1118 |
<title>OCR with Tesseract</title>
|
|
|
1119 |
|
|
|
1120 |
<para>If both <application>tesseract</application> and
|
|
|
1121 |
<command>pdftoppm</command> (generally from the
|
|
|
1122 |
<application>poppler-utils</application> package) are installed,
|
|
|
1123 |
the PDF handler may attempt OCR on PDF files with no text
|
|
|
1124 |
content. This is controlled by the <link
|
|
|
1125 |
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFOCR">pdfocr</link>
|
|
|
1126 |
configuration variable, which is false by default because
|
|
|
1127 |
OCR is very slow.</para>
|
|
|
1128 |
|
|
|
1129 |
<para>The choice of language is very important for successfull
|
|
|
1130 |
OCR. Recoll has currently no way to determine this from the
|
|
|
1131 |
document itself. You can set the language to use through the
|
|
|
1132 |
contents of a <filename>.ocrpdflang</filename> text file in the
|
|
|
1133 |
same directory as the PDF document, or through the
|
|
|
1134 |
<envar>RECOLL_TESSERACT_LANG</envar> environment variable, or
|
|
|
1135 |
through the contents of an <filename>ocrpdf</filename> text file
|
|
|
1136 |
inside the configuration directory. If none of the above are used,
|
|
|
1137 |
&RCL; will try to guess the language from the NLS
|
|
|
1138 |
environment.</para>
|
|
|
1139 |
|
|
|
1140 |
</sect2>
|
|
|
1141 |
|
|
|
1142 |
<sect2 id="RCL.INDEXING.PDF.XMP">
|
|
|
1143 |
<title>XMP fields extraction</title>
|
|
|
1144 |
|
|
|
1145 |
<para>The <filename>rclpdf.py</filename> script in &RCL; version
|
|
|
1146 |
1.23.2 and later can extract XMP metadata fields by executing the
|
|
|
1147 |
<command>pdfinfo</command> command (usually found with
|
|
|
1148 |
<application>poppler-utils</application>). This is controlled by
|
|
|
1149 |
the <link
|
|
|
1150 |
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETA">pdfextrameta</link>
|
|
|
1151 |
configuration variable, which specifies which tags to extract and,
|
|
|
1152 |
possibly, how to rename them.</para>
|
|
|
1153 |
|
|
|
1154 |
<para>The <link
|
|
|
1155 |
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFEXTRAMETAFIX">pdfextrametafix</link>
|
|
|
1156 |
variable can be used to designate a file with Python code to edit
|
|
|
1157 |
the metadata fields (available for &RCL; 1.23.3 and later. 1.23.2
|
|
|
1158 |
has equivalent code inside the handler script). Example:</para>
|
|
|
1159 |
<programlisting>import sys
|
|
|
1160 |
import re
|
|
|
1161 |
|
|
|
1162 |
class MetaFixer(object):
|
|
|
1163 |
def __init__(self):
|
|
|
1164 |
pass
|
|
|
1165 |
|
|
|
1166 |
def metafix(self, nm, txt):
|
|
|
1167 |
if nm == 'bibtex:pages':
|
|
|
1168 |
txt = re.sub(r'--', '-', txt)
|
|
|
1169 |
elif nm == 'someothername':
|
|
|
1170 |
# do something else
|
|
|
1171 |
pass
|
|
|
1172 |
elif nm == 'stillanother':
|
|
|
1173 |
# etc.
|
|
|
1174 |
pass
|
|
|
1175 |
|
|
|
1176 |
return txt
|
|
|
1177 |
</programlisting>
|
|
|
1178 |
|
|
|
1179 |
|
|
|
1180 |
<!-- <para> There is a <ulink url="&WIKI;PDFXMP.wiki">complete example of XMP
|
|
|
1181 |
tags setup</ulink>, including a nice result list paragraph format in the
|
|
|
1182 |
&RCL; Wiki </para> -->
|
|
|
1183 |
|
|
|
1184 |
|
|
|
1185 |
</sect2>
|
|
|
1186 |
|
|
|
1187 |
<sect2 id="RCL.INDEXING.PDF.ATTACH">
|
|
|
1188 |
<title>PDF attachment indexing</title>
|
|
|
1189 |
|
|
|
1190 |
<para>If <application>pdftk</application> is installed, and if the
|
|
|
1191 |
the <link
|
|
|
1192 |
linkend="RCL.INSTALL.CONFIG.RECOLLCONF.PDFATTACH">pdfattach</link>
|
|
|
1193 |
configuration variable is set, the PDF input handler will try to
|
|
|
1194 |
extract PDF attachements for indexing as sub-documents of the PDF
|
|
|
1195 |
file. This is disabled by default, because it slows down PDF
|
|
|
1196 |
indexing a bit even if not one attachment is ever found (PDF
|
|
|
1197 |
attachments are uncommon in my experience).</para>
|
|
|
1198 |
|
|
|
1199 |
</sect2>
|
|
|
1200 |
|
|
|
1201 |
</sect1>
|
1100 |
|
1202 |
|
1101 |
<sect1 id="RCL.INDEXING.PERIODIC">
|
1203 |
<sect1 id="RCL.INDEXING.PERIODIC">
|
1102 |
<title>Periodic indexing</title>
|
1204 |
<title>Periodic indexing</title>
|
1103 |
|
1205 |
|
1104 |
<sect2 id="RCL.INDEXING.PERIODIC.EXEC">
|
1206 |
<sect2 id="RCL.INDEXING.PERIODIC.EXEC">
|