|
a/src/doc/user/usermanual.html |
|
b/src/doc/user/usermanual.html |
|
... |
|
... |
6665 |
<p>Index queries do not provide document content (only
|
6665 |
<p>Index queries do not provide document content (only
|
6666 |
a partial and unprecise reconstruction is performed to
|
6666 |
a partial and unprecise reconstruction is performed to
|
6667 |
show the snippets text). In order to access the actual
|
6667 |
show the snippets text). In order to access the actual
|
6668 |
document data, the data extraction part of the indexing
|
6668 |
document data, the data extraction part of the indexing
|
6669 |
process must be performed (subdocument access and
|
6669 |
process must be performed (subdocument access and
|
6670 |
format translation). This is not trivial in general.
|
6670 |
format translation). This is not trivial in the case of
|
6671 |
The <code class="literal">rclextract</code> module
|
6671 |
embedded documents. The <code class=
|
6672 |
currently provides a single class which can be used to
|
6672 |
"literal">rclextract</code> module provides a single
|
6673 |
access the data content for result documents.</p>
|
6673 |
class which can be used to access the data content for
|
|
|
6674 |
result documents.</p>
|
6674 |
<div class="sect4">
|
6675 |
<div class="sect4">
|
6675 |
<div class="titlepage">
|
6676 |
<div class="titlepage">
|
6676 |
<div>
|
6677 |
<div>
|
6677 |
<div>
|
6678 |
<div>
|
6678 |
<h5 class="title"><a name=
|
6679 |
<h5 class="title"><a name=
|
|
... |
|
... |
6707 |
"term">Extractor.textextract(ipath)</span></dt>
|
6708 |
"term">Extractor.textextract(ipath)</span></dt>
|
6708 |
<dd>
|
6709 |
<dd>
|
6709 |
<p>Extract document defined by <em class=
|
6710 |
<p>Extract document defined by <em class=
|
6710 |
"replaceable"><code>ipath</code></em> and
|
6711 |
"replaceable"><code>ipath</code></em> and
|
6711 |
return a <code class="literal">Doc</code>
|
6712 |
return a <code class="literal">Doc</code>
|
6712 |
object. The doc.text field has the document
|
6713 |
object. The <code class=
|
|
|
6714 |
"literal">doc.text</code> field has the
|
6713 |
text converted to either text/plain or
|
6715 |
document text converted to either text/plain
|
6714 |
text/html according to doc.mimetype. The
|
6716 |
or text/html according to <code class=
|
|
|
6717 |
"literal">doc.mimetype</code>. The typical
|
6715 |
typical use would be as follows:</p>
|
6718 |
use would be as follows:</p>
|
6716 |
<pre class="programlisting">
|
6719 |
<pre class="programlisting">
|
6717 |
qdoc = query.fetchone()
|
6720 |
qdoc = query.fetchone()
|
6718 |
extractor = recoll.Extractor(qdoc)
|
6721 |
extractor = recoll.Extractor(qdoc)
|
6719 |
doc = extractor.textextract(qdoc.ipath)
|
6722 |
doc = extractor.textextract(qdoc.ipath)
|
6720 |
# use doc.text, e.g. for previewing
|
6723 |
# use doc.text, e.g. for previewing</pre>
|
6721 |
</pre>
|
6724 |
<p>Passing <code class=
|
|
|
6725 |
"literal">qdoc.ipath</code> to <code class=
|
|
|
6726 |
"literal">textextract()</code> is redundant,
|
|
|
6727 |
but reflects the fact that the <code class=
|
|
|
6728 |
"literal">Extractor</code> object actually
|
|
|
6729 |
has the capability to access the other
|
|
|
6730 |
entries in a compound document.</p>
|
6722 |
</dd>
|
6731 |
</dd>
|
6723 |
<dt><span class=
|
6732 |
<dt><span class=
|
6724 |
"term">Extractor.idoctofile(ipath, targetmtype,
|
6733 |
"term">Extractor.idoctofile(ipath, targetmtype,
|
6725 |
outfile='')</span></dt>
|
6734 |
outfile='')</span></dt>
|
6726 |
<dd>
|
6735 |
<dd>
|
6727 |
<p>Extracts document into an output file,
|
6736 |
<p>Extracts document into an output file,
|
6728 |
which can be given explicitly or will be
|
6737 |
which can be given explicitly or will be
|
6729 |
created as a temporary file to be deleted by
|
6738 |
created as a temporary file to be deleted by
|
6730 |
the caller. Typical use:</p>
|
6739 |
the caller. Typical use:</p>
|
6731 |
<pre class="programlisting">
|
6740 |
<pre class="programlisting">
|
6732 |
qdoc = query.fetchone()
|
6741 |
qdoc = query.fetchone()
|
6733 |
extractor = recoll.Extractor(qdoc)
|
6742 |
extractor = recoll.Extractor(qdoc)
|
6734 |
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
|
6743 |
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
|
|
|
6744 |
<p>In all cases the output is a copy, even if
|
|
|
6745 |
the requested document is a regular system
|
|
|
6746 |
file, which may be wasteful in some cases. If
|
|
|
6747 |
you want to avoid this, you can test for a
|
|
|
6748 |
simple file document as follows:</p>
|
|
|
6749 |
<pre class="programlisting">
|
|
|
6750 |
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
|
|
|
6751 |
</pre>
|
6735 |
</dd>
|
6752 |
</dd>
|
6736 |
</dl>
|
6753 |
</dl>
|
6737 |
</div>
|
6754 |
</div>
|
6738 |
</div>
|
6755 |
</div>
|
6739 |
</div>
|
6756 |
</div>
|
|
... |
|
... |
6756 |
other examples. The <code class=
|
6773 |
other examples. The <code class=
|
6757 |
"filename">recollgui</code> subdirectory has a very
|
6774 |
"filename">recollgui</code> subdirectory has a very
|
6758 |
embryonic GUI which demonstrates the highlighting and
|
6775 |
embryonic GUI which demonstrates the highlighting and
|
6759 |
data extraction functions.</p>
|
6776 |
data extraction functions.</p>
|
6760 |
<pre class="programlisting">
|
6777 |
<pre class="programlisting">
|
6761 |
#!/usr/bin/env python
|
6778 |
#!/usr/bin/env python
|
6762 |
|
6779 |
|
6763 |
from recoll import recoll
|
6780 |
from recoll import recoll
|
6764 |
|
6781 |
|
6765 |
db = recoll.connect()
|
6782 |
db = recoll.connect()
|
6766 |
db.setAbstractParams(maxchars=80, contextwords=4)
|
6783 |
db.setAbstractParams(maxchars=80, contextwords=4)
|
6767 |
|
6784 |
|
6768 |
query = db.query()
|
6785 |
query = db.query()
|
6769 |
nres = query.execute("some user question")
|
6786 |
nres = query.execute("some user question")
|
6770 |
print "Result count: ", nres
|
6787 |
print "Result count: ", nres
|
6771 |
if nres > 5:
|
6788 |
if nres > 5:
|
6772 |
nres = 5
|
6789 |
nres = 5
|
6773 |
for i in range(nres):
|
6790 |
for i in range(nres):
|
6774 |
doc = query.fetchone()
|
6791 |
doc = query.fetchone()
|
6775 |
print "Result #%d" % (query.rownumber,)
|
6792 |
print "Result #%d" % (query.rownumber,)
|
6776 |
for k in ("title", "size"):
|
6793 |
for k in ("title", "size"):
|
6777 |
print k, ":", getattr(doc, k).encode('utf-8')
|
6794 |
print k, ":", getattr(doc, k).encode('utf-8')
|
6778 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
6795 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
6779 |
print abs
|
6796 |
print abs
|
6780 |
print
|
6797 |
print
|
6781 |
|
6798 |
</pre>
|
6782 |
|
|
|
6783 |
</pre>
|
|
|
6784 |
</div>
|
6799 |
</div>
|
6785 |
</div>
|
6800 |
</div>
|
6786 |
<div class="sect2">
|
6801 |
<div class="sect2">
|
6787 |
<div class="titlepage">
|
6802 |
<div class="titlepage">
|
6788 |
<div>
|
6803 |
<div>
|