recoll / Code / Diff of /src/doc/user/usermanual.xml

Diff of /src/doc/user/usermanual.xml [216c69] .. [fe2eb1]

Switch to unified view


...
        <sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
          <title>The rclextract module</title>

          <para>Index queries do not provide document content (only a
          partial and unprecise reconstruction is performed to show the
          snippets text). In order to access the actual document data, the
          data extraction part of the indexing process must be performed
          (subdocument access and format translation). This is not trivial
          in the case of embedded documents. The
          <literal>rclextract</literal> module provides a single class
          which can be used to access the data content for result
          documents.</para>

          <sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
            <title>Classes</title>
            
            <sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
...
                  built from a <literal>Doc</literal> object, output
                  from a query.</para></listitem>
                </varlistentry>
                <varlistentry>
                  <term>Extractor.textextract(ipath)</term>
                  <listitem><para>Extract document defined by
                  <replaceable>ipath</replaceable> and return a
                  <literal>Doc</literal> object. The
                  <literal>doc.text</literal> field has the document text
                  converted to either text/plain or text/html according to
                  <literal>doc.mimetype</literal>. The typical use would be
                  as follows:</para>
<programlisting>
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
doc = extractor.textextract(qdoc.ipath)
# use doc.text, e.g. for previewing</programlisting>
                  <para>Passing <literal>qdoc.ipath</literal> to
                  <literal>textextract()</literal> is redundant, but
                  reflects the fact that the <literal>Extractor</literal>
                  object actually has the capability to access the other
                  entries in a compound document.</para>
                  </listitem>
                </varlistentry>
                <varlistentry>
                  <term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
                  <listitem><para>Extracts document into an output file,
                  which can be given explicitly or will be created as a
                  temporary file to be deleted by the caller. Typical
                  use:</para> 
<programlisting>
qdoc = query.fetchone()
extractor = recoll.Extractor(qdoc)
filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
 
                  <para>In all cases the output is a copy, even if the
                  requested document is a regular system file, which may be
                  wasteful in some cases. If you want to avoid this, you
                  can test for a simple file document as follows:
<programlisting>
not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
</programlisting>
                  </para></listitem>
                </varlistentry>

              </variablelist>

            </sect5> <!-- Extractor class -->
          </sect4> <!-- rclextract classes -->
        </sect3> <!-- rclextract module -->


        <sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
          <title>Search API usage example</title>

          <para>The following sample would query the index with a user
...
          directory inside the &RCL; source for other
          examples. The <filename>recollgui</filename> subdirectory
          has a very embryonic GUI which demonstrates the
          highlighting and data extraction functions.</para>

<programlisting><![CDATA[
#!/usr/bin/env python

from recoll import recoll

db = recoll.connect()
db.setAbstractParams(maxchars=80, contextwords=4)

query = db.query()
nres = query.execute("some user question")
print "Result count: ", nres
if nres > 5:
    nres = 5
for i in range(nres):
    doc = query.fetchone()
    print "Result #%d" % (query.rownumber,)
    for k in ("title", "size"):
        print k, ":", getattr(doc, k).encode('utf-8')
    abs = db.makeDocAbstract(doc, query).encode('utf-8')
    print abs
    print
]]></programlisting>



        </sect3>
      </sect2>



	a/src/doc/user/usermanual.xml		b/src/doc/user/usermanual.xml
	...		...
5194	<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">	5194	<sect3 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT">
5195	<title>The rclextract module</title>	5195	<title>The rclextract module</title>
5196		5196
5197	<para>Index queries do not provide document content (only a	5197	<para>Index queries do not provide document content (only a
5198	partial and unprecise reconstruction is performed to show the	5198	partial and unprecise reconstruction is performed to show the
5199	snippets text). In order to access the actual document data,	5199	snippets text). In order to access the actual document data, the
5200	the data extraction part of the indexing process	5200	data extraction part of the indexing process must be performed
5201	must be performed (subdocument access and format	5201	(subdocument access and format translation). This is not trivial
5202	translation). This is not trivial in	5202	in the case of embedded documents. The
5203	general. The <literal>rclextract</literal> module currently	5203	<literal>rclextract</literal> module provides a single class
5204	provides a single class which can be used to access the data	5204	which can be used to access the data content for result
5205	content for result documents.</para>	5205	documents.</para>
5206		5206
5207	<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">	5207	<sect4 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES">
5208	<title>Classes</title>	5208	<title>Classes</title>
5209		5209
5210	<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">	5210	<sect5 id="RCL.PROGRAM.PYTHONAPI.RCLEXTRACT.CLASSES.EXTRACTOR">
	...		...
5218	built from a <literal>Doc</literal> object, output	5218	built from a <literal>Doc</literal> object, output
5219	from a query.</para></listitem>	5219	from a query.</para></listitem>
5220	</varlistentry>	5220	</varlistentry>
5221	<varlistentry>	5221	<varlistentry>
5222	<term>Extractor.textextract(ipath)</term>	5222	<term>Extractor.textextract(ipath)</term>
5223	<listitem><para>Extract document defined	5223	<listitem><para>Extract document defined by
5224	by <replaceable>ipath</replaceable> and return	5224	<replaceable>ipath</replaceable> and return a
5225	a <literal>Doc</literal> object. The doc.text field	5225	<literal>Doc</literal> object. The
5226	has the document text converted to either text/plain or	5226	<literal>doc.text</literal> field has the document text
5227	text/html according to doc.mimetype. The typical use	5227	converted to either text/plain or text/html according to
		5228	<literal>doc.mimetype</literal>. The typical use would be
5228	would be as follows:	5229	as follows:</para>
5229	<programlisting>	5230	<programlisting>
5230	qdoc = query.fetchone()	5231	qdoc = query.fetchone()
5231	extractor = recoll.Extractor(qdoc)	5232	extractor = recoll.Extractor(qdoc)
5232	doc = extractor.textextract(qdoc.ipath)	5233	doc = extractor.textextract(qdoc.ipath)
5233	# use doc.text, e.g. for previewing	5234	# use doc.text, e.g. for previewing</programlisting>
5234	</programlisting>	5235	<para>Passing <literal>qdoc.ipath</literal> to
		5236	<literal>textextract()</literal> is redundant, but
		5237	reflects the fact that the <literal>Extractor</literal>
		5238	object actually has the capability to access the other
		5239	entries in a compound document.</para>
5235	</para></listitem>	5240	</listitem>
5236	</varlistentry>	5241	</varlistentry>
5237	<varlistentry>	5242	<varlistentry>
5238	<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>	5243	<term>Extractor.idoctofile(ipath, targetmtype, outfile='')</term>
5239	<listitem><para>Extracts document into an output file,	5244	<listitem><para>Extracts document into an output file,
5240	which can be given explicitly or will be created as a	5245	which can be given explicitly or will be created as a
5241	temporary file to be deleted by the caller. Typical use:	5246	temporary file to be deleted by the caller. Typical
5242	<programlisting>	5247	use:</para>
5243	qdoc = query.fetchone()	5248	<programlisting>
		5249	qdoc = query.fetchone()
5244	extractor = recoll.Extractor(qdoc)	5250	extractor = recoll.Extractor(qdoc)
5245	filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>	5251	filename = extractor.idoctofile(qdoc.ipath, qdoc.mimetype)</programlisting>
5246		5252
		5253	<para>In all cases the output is a copy, even if the
		5254	requested document is a regular system file, which may be
		5255	wasteful in some cases. If you want to avoid this, you
		5256	can test for a simple file document as follows:
		5257	<programlisting>
		5258	not doc.ipath and (not "rclbes" in doc.keys() or doc["rclbes"] == "FS")
		5259	</programlisting>
5247	</para></listitem>	5260	</para></listitem>
5248	</varlistentry>	5261	</varlistentry>
5249		5262
5250	</variablelist>	5263	</variablelist>
5251		5264
5252	</sect5> <!-- Extractor class -->	5265	</sect5> <!-- Extractor class -->
5253	</sect4> <!-- rclextract classes -->	5266	</sect4> <!-- rclextract classes -->
5254	</sect3> <!-- rclextract module -->	5267	</sect3> <!-- rclextract module -->
		5268
5255		5269
5256	<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">	5270	<sect3 id="RCL.PROGRAM.PYTHONAPI.SEARCH.EXAMPLE">
5257	<title>Search API usage example</title>	5271	<title>Search API usage example</title>
5258		5272
5259	<para>The following sample would query the index with a user	5273	<para>The following sample would query the index with a user
	...		...
5261	directory inside the &RCL; source for other	5275	directory inside the &RCL; source for other
5262	examples. The <filename>recollgui</filename> subdirectory	5276	examples. The <filename>recollgui</filename> subdirectory
5263	has a very embryonic GUI which demonstrates the	5277	has a very embryonic GUI which demonstrates the
5264	highlighting and data extraction functions.</para>	5278	highlighting and data extraction functions.</para>
5265		5279
5266	<programlisting>	5280	<programlisting><![CDATA[
5267	#!/usr/bin/env python	5281	#!/usr/bin/env python
5268	<![CDATA[	5282
5269	from recoll import recoll	5283	from recoll import recoll
5270		5284
5271	db = recoll.connect()	5285	db = recoll.connect()
5272	db.setAbstractParams(maxchars=80, contextwords=4)	5286	db.setAbstractParams(maxchars=80, contextwords=4)
5273		5287
5274	query = db.query()	5288	query = db.query()
5275	nres = query.execute("some user question")	5289	nres = query.execute("some user question")
5276	print "Result count: ", nres	5290	print "Result count: ", nres
5277	if nres > 5:	5291	if nres > 5:
5278	nres = 5	5292	nres = 5
5279	for i in range(nres):	5293	for i in range(nres):
5280	doc = query.fetchone()	5294	doc = query.fetchone()
5281	print "Result #%d" % (query.rownumber,)	5295	print "Result #%d" % (query.rownumber,)
5282	for k in ("title", "size"):	5296	for k in ("title", "size"):
5283	print k, ":", getattr(doc, k).encode('utf-8')	5297	print k, ":", getattr(doc, k).encode('utf-8')
5284	abs = db.makeDocAbstract(doc, query).encode('utf-8')	5298	abs = db.makeDocAbstract(doc, query).encode('utf-8')
5285	print abs	5299	print abs
5286	print	5300	print
5287		5301	]]></programlisting>
5288	]]>
5289	</programlisting>
5290		5302
5291	</sect3>	5303	</sect3>
5292	</sect2>	5304	</sect2>
5293		5305
5294		5306