recoll / Code / Diff of /src/doc/user/usermanual.html

Diff of /src/doc/user/usermanual.html [216c69] .. [fe2eb1]

Switch to unified view


...
            <p>Index queries do not provide document content (only
            a partial and unprecise reconstruction is performed to
            show the snippets text). In order to access the actual
            document data, the data extraction part of the indexing
            process must be performed (subdocument access and
            format translation). This is not trivial in the case of
            embedded documents. The <code class=
            "literal">rclextract</code> module provides a single
            class which can be used to access the data content for
            result documents.</p>
            <div class="sect4">
              <div class="titlepage">
                <div>
                  <div>
                    <h5 class="title"><a name=
...
                    "term">Extractor.textextract(ipath)</span></dt>
                    <dd>
                      <p>Extract document defined by <em class=
                      "replaceable"><code>ipath</code></em> and
                      return a <code class="literal">Doc</code>
                      object. The <code class=
                      "literal">doc.text</code> field has the
                      document text converted to either text/plain
                      or text/html according to <code class=
                      "literal">doc.mimetype</code>. The typical
                      use would be as follows:</p>
                      <pre class="programlisting">
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing</pre>
                      <p>Passing <code class=
                      "literal">qdoc.ipath</code> to <code class=
                      "literal">textextract()</code> is redundant,
                      but reflects the fact that the <code class=
                      "literal">Extractor</code> object actually
                      has the capability to access the other
                      entries in a compound document.</p>
                    </dd>
                    <dt><span class=
                    "term">Extractor.idoctofile(ipath, targetmtype,
                    outfile='')</span></dt>
                    <dd>
                      <p>Extracts document into an output file,
                      which can be given explicitly or will be
                      created as a temporary file to be deleted by
                      the caller. Typical use:</p>
                      <pre class="programlisting">
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
                      <p>In all cases the output is a copy, even if
                      the requested document is a regular system
                      file, which may be wasteful in some cases. If
                      you want to avoid this, you can test for a
                      simple file document as follows:</p>
                      <pre class="programlisting">
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</pre>
                    </dd>
                  </dl>
                </div>
              </div>
            </div>
...
            other examples. The <code class=
            "filename">recollgui</code> subdirectory has a very
            embryonic GUI which demonstrates the highlighting and
            data extraction functions.</p>
            <pre class="programlisting">
#!/usr/bin/env python

from recoll import recoll

db = recoll.connect()
db.setAbstractParams(maxchars=80, contextwords=4)

query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres &gt; 5:
    nres = 5
for i in range(nres):
    doc = query.fetchone()
    print "Result #%d" % (query.rownumber,)
    for k in ("title", "size"):
        print k, ":", getattr(doc, k).encode('utf-8')
    abs = db.makeDocAbstract(doc, query).encode('utf-8')
    print abs
    print
</pre>


          </div>
        </div>
        <div class="sect2">
          <div class="titlepage">
            <div>

	a/src/doc/user/usermanual.html		b/src/doc/user/usermanual.html
	...		...
6665	<p>Index queries do not provide document content (only	6665	<p>Index queries do not provide document content (only
6666	a partial and unprecise reconstruction is performed to	6666	a partial and unprecise reconstruction is performed to
6667	show the snippets text). In order to access the actual	6667	show the snippets text). In order to access the actual
6668	document data, the data extraction part of the indexing	6668	document data, the data extraction part of the indexing
6669	process must be performed (subdocument access and	6669	process must be performed (subdocument access and
6670	format translation). This is not trivial in general.	6670	format translation). This is not trivial in the case of
6671	The <code class="literal">rclextract</code> module	6671	embedded documents. The <code class=
6672	currently provides a single class which can be used to	6672	"literal">rclextract</code> module provides a single
6673	access the data content for result documents.</p>	6673	class which can be used to access the data content for
		6674	result documents.</p>
6674	<div class="sect4">	6675	<div class="sect4">
6675	<div class="titlepage">	6676	<div class="titlepage">
6676	<div>	6677	<div>
6677	<div>	6678	<div>
6678	<h5 class="title"><a name=	6679	<h5 class="title"><a name=
	...		...
6707	"term">Extractor.textextract(ipath)</span></dt>	6708	"term">Extractor.textextract(ipath)</span></dt>
6708	<dd>	6709	<dd>
6709	<p>Extract document defined by <em class=	6710	<p>Extract document defined by <em class=
6710	"replaceable"><code>ipath</code></em> and	6711	"replaceable"><code>ipath</code></em> and
6711	return a <code class="literal">Doc</code>	6712	return a <code class="literal">Doc</code>
6712	object. The doc.text field has the document	6713	object. The <code class=
		6714	"literal">doc.text</code> field has the
6713	text converted to either text/plain or	6715	document text converted to either text/plain
6714	text/html according to doc.mimetype. The	6716	or text/html according to <code class=
		6717	"literal">doc.mimetype</code>. The typical
6715	typical use would be as follows:</p>	6718	use would be as follows:</p>
6716	<pre class="programlisting">	6719	<pre class="programlisting">
6717	qdoc = query.fetchone()	6720	qdoc = query.fetchone()
6718	extractor = recoll.Extractor(qdoc)	6721	extractor = recoll.Extractor(qdoc)
6719	doc = extractor.textextract(qdoc.ipath)	6722	doc = extractor.textextract(qdoc.ipath)
6720	# use doc.text, e.g. for previewing	6723	# use doc.text, e.g. for previewing</pre>
6721	</pre>	6724	<p>Passing <code class=
		6725	"literal">qdoc.ipath</code> to <code class=
		6726	"literal">textextract()</code> is redundant,
		6727	but reflects the fact that the <code class=
		6728	"literal">Extractor</code> object actually
		6729	has the capability to access the other
		6730	entries in a compound document.</p>
6722	</dd>	6731	</dd>
6723	<dt><span class=	6732	<dt><span class=
6724	"term">Extractor.idoctofile(ipath, targetmtype,	6733	"term">Extractor.idoctofile(ipath, targetmtype,
6725	outfile='')</span></dt>	6734	outfile='')</span></dt>
6726	<dd>	6735	<dd>
6727	<p>Extracts document into an output file,	6736	<p>Extracts document into an output file,
6728	which can be given explicitly or will be	6737	which can be given explicitly or will be
6729	created as a temporary file to be deleted by	6738	created as a temporary file to be deleted by
6730	the caller. Typical use:</p>	6739	the caller. Typical use:</p>
6731	<pre class="programlisting">	6740	<pre class="programlisting">
6732	qdoc = query.fetchone()	6741	qdoc = query.fetchone()
6733	extractor = recoll.Extractor(qdoc)	6742	extractor = recoll.Extractor(qdoc)
6734	filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>	6743	filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</pre>
		6744	<p>In all cases the output is a copy, even if
		6745	the requested document is a regular system
		6746	file, which may be wasteful in some cases. If
		6747	you want to avoid this, you can test for a
		6748	simple file document as follows:</p>
		6749	<pre class="programlisting">
		6750	not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
		6751	</pre>
6735	</dd>	6752	</dd>
6736	</dl>	6753	</dl>
6737	</div>	6754	</div>
6738	</div>	6755	</div>
6739	</div>	6756	</div>
	...		...
6756	other examples. The <code class=	6773	other examples. The <code class=
6757	"filename">recollgui</code> subdirectory has a very	6774	"filename">recollgui</code> subdirectory has a very
6758	embryonic GUI which demonstrates the highlighting and	6775	embryonic GUI which demonstrates the highlighting and
6759	data extraction functions.</p>	6776	data extraction functions.</p>
6760	<pre class="programlisting">	6777	<pre class="programlisting">
6761	#!/usr/bin/env python	6778	#!/usr/bin/env python
6762		6779
6763	from recoll import recoll	6780	from recoll import recoll
6764		6781
6765	db = recoll.connect()	6782	db = recoll.connect()
6766	db.setAbstractParams(maxchars=80, contextwords=4)	6783	db.setAbstractParams(maxchars=80, contextwords=4)
6767		6784
6768	query = db.query()	6785	query = db.query()
6769	nres = query.execute("some user question")	6786	nres = query.execute("some user question")
6770	print "Result count: ", nres	6787	print "Result count: ", nres
6771	if nres > 5:	6788	if nres > 5:
6772	nres = 5	6789	nres = 5
6773	for i in range(nres):	6790	for i in range(nres):
6774	doc = query.fetchone()	6791	doc = query.fetchone()
6775	print "Result #%d" % (query.rownumber,)	6792	print "Result #%d" % (query.rownumber,)
6776	for k in ("title", "size"):	6793	for k in ("title", "size"):
6777	print k, ":", getattr(doc, k).encode('utf-8')	6794	print k, ":", getattr(doc, k).encode('utf-8')
6778	abs = db.makeDocAbstract(doc, query).encode('utf-8')	6795	abs = db.makeDocAbstract(doc, query).encode('utf-8')
6779	print abs	6796	print abs
6780	print	6797	print
6781		6798	</pre>
6782
6783	</pre>
6784	</div>	6799	</div>
6785	</div>	6800	</div>
6786	<div class="sect2">	6801	<div class="sect2">
6787	<div class="titlepage">	6802	<div class="titlepage">
6788	<div>	6803	<div>