|
a/src/doc/user/usermanual.xml |
|
b/src/doc/user/usermanual.xml |
|
... |
|
... |
5194 |
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
|
5194 |
<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
|
5195 |
<title>The rclextract module</title>
|
5195 |
<title>The rclextract module</title>
|
5196 |
|
5196 |
|
5197 |
<para>Index queries do not provide document content (only a
|
5197 |
<para>Index queries do not provide document content (only a
|
5198 |
partial and unprecise reconstruction is performed to show the
|
5198 |
partial and unprecise reconstruction is performed to show the
|
5199 |
snippets text). In order to access the actual document data,
|
5199 |
snippets text). In order to access the actual document data, the
|
5200 |
the data extraction part of the indexing process
|
5200 |
data extraction part of the indexing process must be performed
|
5201 |
must be performed (subdocument access and format
|
5201 |
(subdocument access and format translation). This is not trivial
|
5202 |
translation). This is not trivial in
|
5202 |
in the case of embedded documents. The
|
5203 |
general. The <literal>rclextract</literal> module currently
|
5203 |
<literal>rclextract</literal> module provides a single class
|
5204 |
provides a single class which can be used to access the data
|
5204 |
which can be used to access the data content for result
|
5205 |
content for result documents.</para>
|
5205 |
documents.</para>
|
5206 |
|
5206 |
|
5207 |
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
|
5207 |
<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
|
5208 |
<title>Classes</title>
|
5208 |
<title>Classes</title>
|
5209 |
|
5209 |
|
5210 |
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
5210 |
<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
|
|
... |
|
... |
5218 |
built from a <literal>Doc</literal> object, output
|
5218 |
built from a <literal>Doc</literal> object, output
|
5219 |
from a query.</para></listitem>
|
5219 |
from a query.</para></listitem>
|
5220 |
</varlistentry>
|
5220 |
</varlistentry>
|
5221 |
<varlistentry>
|
5221 |
<varlistentry>
|
5222 |
<term>Extractor.textextract(ipath)</term>
|
5222 |
<term>Extractor.textextract(ipath)</term>
|
5223 |
<listitem><para>Extract document defined
|
5223 |
<listitem><para>Extract document defined by
|
5224 |
by <replaceable>ipath</replaceable> and return
|
5224 |
<replaceable>ipath</replaceable> and return a
|
5225 |
a <literal>Doc</literal> object. The doc.text field
|
5225 |
<literal>Doc</literal> object. The
|
5226 |
has the document text converted to either text/plain or
|
5226 |
<literal>doc.text</literal> field has the document text
|
5227 |
text/html according to doc.mimetype. The typical use
|
5227 |
converted to either text/plain or text/html according to
|
|
|
5228 |
<literal>doc.mimetype</literal>. The typical use would be
|
5228 |
would be as follows:
|
5229 |
as follows:</para>
|
5229 |
<programlisting>
|
5230 |
<programlisting>
|
5230 |
qdoc = query.fetchone()
|
5231 |
qdoc = query.fetchone()
|
5231 |
extractor = recoll.Extractor(qdoc)
|
5232 |
extractor = recoll.Extractor(qdoc)
|
5232 |
doc = extractor.textextract(qdoc.ipath)
|
5233 |
doc = extractor.textextract(qdoc.ipath)
|
5233 |
# use doc.text, e.g. for previewing
|
5234 |
# use doc.text, e.g. for previewing</programlisting>
|
5234 |
</programlisting>
|
5235 |
<para>Passing <literal>qdoc.ipath</literal> to
|
|
|
5236 |
<literal>textextract()</literal> is redundant, but
|
|
|
5237 |
reflects the fact that the <literal>Extractor</literal>
|
|
|
5238 |
object actually has the capability to access the other
|
|
|
5239 |
entries in a compound document.</para>
|
5235 |
</para></listitem>
|
5240 |
</listitem>
|
5236 |
</varlistentry>
|
5241 |
</varlistentry>
|
5237 |
<varlistentry>
|
5242 |
<varlistentry>
|
5238 |
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
5243 |
<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
|
5239 |
<listitem><para>Extracts document into an output file,
|
5244 |
<listitem><para>Extracts document into an output file,
|
5240 |
which can be given explicitly or will be created as a
|
5245 |
which can be given explicitly or will be created as a
|
5241 |
temporary file to be deleted by the caller. Typical use:
|
5246 |
temporary file to be deleted by the caller. Typical
|
5242 |
<programlisting>
|
5247 |
use:</para>
|
5243 |
qdoc = query.fetchone()
|
5248 |
<programlisting>
|
|
|
5249 |
qdoc = query.fetchone()
|
5244 |
extractor = recoll.Extractor(qdoc)
|
5250 |
extractor = recoll.Extractor(qdoc)
|
5245 |
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
5251 |
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
|
5246 |
|
5252 |
|
|
|
5253 |
<para>In all cases the output is a copy, even if the
|
|
|
5254 |
requested document is a regular system file, which may be
|
|
|
5255 |
wasteful in some cases. If you want to avoid this, you
|
|
|
5256 |
can test for a simple file document as follows:
|
|
|
5257 |
<programlisting>
|
|
|
5258 |
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
|
|
|
5259 |
</programlisting>
|
5247 |
</para></listitem>
|
5260 |
</para></listitem>
|
5248 |
</varlistentry>
|
5261 |
</varlistentry>
|
5249 |
|
5262 |
|
5250 |
</variablelist>
|
5263 |
</variablelist>
|
5251 |
|
5264 |
|
5252 |
</sect5> <!-- Extractor class -->
|
5265 |
</sect5> <!-- Extractor class -->
|
5253 |
</sect4> <!-- rclextract classes -->
|
5266 |
</sect4> <!-- rclextract classes -->
|
5254 |
</sect3> <!-- rclextract module -->
|
5267 |
</sect3> <!-- rclextract module -->
|
|
|
5268 |
|
5255 |
|
5269 |
|
5256 |
<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
|
5270 |
<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
|
5257 |
<title>Search API usage example</title>
|
5271 |
<title>Search API usage example</title>
|
5258 |
|
5272 |
|
5259 |
<para>The following sample would query the index with a user
|
5273 |
<para>The following sample would query the index with a user
|
|
... |
|
... |
5261 |
directory inside the &RCL; source for other
|
5275 |
directory inside the &RCL; source for other
|
5262 |
examples. The <filename>recollgui</filename> subdirectory
|
5276 |
examples. The <filename>recollgui</filename> subdirectory
|
5263 |
has a very embryonic GUI which demonstrates the
|
5277 |
has a very embryonic GUI which demonstrates the
|
5264 |
highlighting and data extraction functions.</para>
|
5278 |
highlighting and data extraction functions.</para>
|
5265 |
|
5279 |
|
5266 |
<programlisting>
|
5280 |
<programlisting><![CDATA[
|
5267 |
#!/usr/bin/env python
|
5281 |
#!/usr/bin/env python
|
5268 |
<![CDATA[
|
5282 |
|
5269 |
from recoll import recoll
|
5283 |
from recoll import recoll
|
5270 |
|
5284 |
|
5271 |
db = recoll.connect()
|
5285 |
db = recoll.connect()
|
5272 |
db.setAbstractParams(maxchars=80, contextwords=4)
|
5286 |
db.setAbstractParams(maxchars=80, contextwords=4)
|
5273 |
|
5287 |
|
5274 |
query = db.query()
|
5288 |
query = db.query()
|
5275 |
nres = query.execute("some user question")
|
5289 |
nres = query.execute("some user question")
|
5276 |
print "Result count: ", nres
|
5290 |
print "Result count: ", nres
|
5277 |
if nres > 5:
|
5291 |
if nres > 5:
|
5278 |
nres = 5
|
5292 |
nres = 5
|
5279 |
for i in range(nres):
|
5293 |
for i in range(nres):
|
5280 |
doc = query.fetchone()
|
5294 |
doc = query.fetchone()
|
5281 |
print "Result #%d" % (query.rownumber,)
|
5295 |
print "Result #%d" % (query.rownumber,)
|
5282 |
for k in ("title", "size"):
|
5296 |
for k in ("title", "size"):
|
5283 |
print k, ":", getattr(doc, k).encode('utf-8')
|
5297 |
print k, ":", getattr(doc, k).encode('utf-8')
|
5284 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
5298 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
5285 |
print abs
|
5299 |
print abs
|
5286 |
print
|
5300 |
print
|
5287 |
|
5301 |
]]></programlisting>
|
5288 |
]]>
|
|
|
5289 |
</programlisting>
|
|
|
5290 |
|
5302 |
|
5291 |
</sect3>
|
5303 |
</sect3>
|
5292 |
</sect2>
|
5304 |
</sect2>
|
5293 |
|
5305 |
|
5294 |
|
5306 |
|