recoll / Code / Diff of /src/doc/user/usermanual.sgml

Diff of /src/doc/user/usermanual.sgml [4a2938] .. [583877]

Switch to side-by-side view

--- a/src/doc/user/usermanual.sgml
+++ b/src/doc/user/usermanual.sgml
@@ -24,7 +24,7 @@
       Dockes</holder>
     </copyright>
 
-    <releaseinfo>$Id: usermanual.sgml,v 1.66 2008-10-08 16:12:36 dockes Exp $</releaseinfo>
+    <releaseinfo>$Id: usermanual.sgml,v 1.67 2008-10-10 08:19:12 dockes Exp $</releaseinfo>
 
     <abstract>
       <para>This document introduces full text search notions
@@ -1575,12 +1575,329 @@
       <para>Your main database (the one the current configuration
       indexes to), is always implicitly active. If this is not
       desirable, you can set up your configuration so that it indexes,
-      for example, an empty directory.</para>
+      for example, an empty directory. An alternative indexer may also
+      need to implement a way of purging the index from stale data,
+      </para>
 
     </sect1>
 
   </chapter>
 
+  <chapter id="rcl.program">
+    <title>Programming interface</title>
+
+    <sect1 id="rcl.program.elements">
+      <title>Interface elements</title>
+
+      <para>A few elements in the interface are specific and and need
+      an explanation.</para>
+
+      <variablelist>
+
+	<varlistentry>
+	  <term>udi</term> <listitem><para>An udi (unique document
+            identifier) identifies a document. Because of limitations
+            inside the index engine, it is restricted in length (to
+            200 bytes), which is why a regular URI cannot be used. The
+            structure and contents of the udi is defined by the
+            application and opaque to the index engine. For example,
+            the internal file system indexer uses the complete
+            document path (file path + internal path), truncated to
+            length, the suppressed part being replaced by a hash
+            value.</para> </listitem>
+	</varlistentry>
+
+	<varlistentry> 
+	  <term>ipath</term> 
+	  
+	  <listitem><para>This data value (set as a field in the Doc
+	  object) is stored, along with the URL, but not indexed by
+	  &RCL;. Its contents are not interpreted, and its use is up
+	  to the application. For example, the &RCL; internal file
+	  system indexer stores the part of the document access path
+	  internal to the container file (<literal>ipath</literal> in
+	  this case is a list of subdocument sequential numbers). url
+	  and ipath are returned in every search result and permit
+	  access to the original document.</para>
+	  </listitem>
+	</varlistentry>
+
+	<varlistentry> 
+	  <term>Stored and indexed fields</term> 
+	  
+	  <listitem><para>The <filename>fields</filename> file inside
+	  the &RCL; configuration defines which document fields are
+	  either "indexed" (searchable), "stored" (retrievable with
+	  search results), or both.</para>
+	  </listitem>
+	</varlistentry>
+
+	</variablelist>
+
+      <para>Data for an external indexer, should be stored in a
+      separate index, not the one for the &RCL; internal file system
+      indexer, except if the latter is not used at all). The reason
+      is that the main document indexer purge pass would remove all
+      the other indexer's documents, as they were not seen during
+      indexing. The main indexer documents would also probably be a
+      problem for the external indexer purge operation.</para>
+
+    </sect1>
+
+    <sect1 id="rcl.program.python">
+      <title>Python interface</title>
+
+      <sect2 id="rcl.program.python.intro">
+	<title>Introduction</title>
+
+	  <para>&RCL; versions after 1.11 define a Python programming
+	  interface, both for searching and indexing.</para> 
+
+	<para>The python interface is not built by default and can be
+	found in the source package, under python/recoll. The
+	directory contains the usual <filename>setup.py</filename>
+	script which you can use to build and install the
+	module:
+
+	  <screen>
+        <userinput>cd recoll-xxx/python/recoll</userinput>
+        <userinput>python setup.py build</userinput>
+        <userinput>python setup.py install</userinput>
+      </screen>
+          </para> 
+
+      </sect2>
+
+
+      <sect2 id="rcl.program.python.manual">
+	<title>Interface manual</title>
+
+      <literalLayout>
+NAME
+    recoll - This is an interface to the Recoll full text indexer.
+
+FILE
+    /usr/local/lib/python2.5/site-packages/recoll.so
+
+CLASSES
+        Db
+        Doc
+        Query
+        SearchData
+    
+    class Db(__builtin__.object)
+     |  Db([confdir=None], [extra_dbs=None], [writable = False])
+     |  
+     |  A Db object holds a connection to a Recoll index. Use the connect()
+     |  function to create one.
+     |  confdir specifies a Recoll configuration directory (default: 
+     |   $RECOLL_CONFDIR or ~/.recoll).
+     |  extra_dbs is a list of external databases (xapian directories)
+     |  writable decides if we can index new data through this connection
+     |  
+     |  Methods defined here:
+     |  
+     |  
+     |  addOrUpdate(...)
+     |      addOrUpdate(udi, doc, parent_udi=None) -> None
+     |      Add or update index data for a given document
+     |      The udi string must define a unique id for the document. It is not
+     |      interpreted inside Recoll
+     |      doc is a Doc object
+     |      if parent_udi is set, this is a unique identifier for the
+     |      top-level container (ie mbox file)
+     |  
+     |  delete(...)
+     |      delete(udi) -> Bool.
+     |      Purge index from all data for udi. If udi matches a container
+     |      document, purge all subdocs (docs with a parent_udi matching udi).
+     |  
+     |  makeDocAbstract(...)
+     |      makeDocAbstract(Doc, Query) -> string
+     |      Build and return 'keyword-in-context' abstract for document
+     |      and query.
+     |  
+     |  needUpdate(...)
+     |      needUpdate(udi, sig) -> Bool.
+     |      Check if the index is up to date for the document defined by udi,
+     |      having the current signature sig.
+     |  
+     |  purge(...)
+     |      purge() -> Bool.
+     |      Delete all documents that were not touched during the just finished
+     |      indexing pass (since open-for-write). These are the documents for
+     |      the needUpdate() call was not performed, indicating that they no
+     |      longer exist in the primary storage system.
+     |  
+     |  query(...)
+     |      query() -> Query. Return a new, blank query object for this index.
+     |  
+     |  setAbstractParams(...)
+     |      setAbstractParams(maxchars, contextwords).
+     |      Set the parameters used to build 'keyword-in-context' abstracts
+     |  
+     |  ----------------------------------------------------------------------
+     |  Data and other attributes defined here:
+     |  
+    
+    class Doc(__builtin__.object)
+     |  Doc()
+     |  
+     |  A Doc object contains index data for a given document.
+     |  The data is extracted from the index when searching, or set by the
+     |  indexer program when updating. The Doc object has no useful methods but
+     |  many attributes to be read or set by its user. It matches exactly the
+     |  Rcl::Doc c++ object. Some of the attributes are predefined, but, 
+     |  especially when indexing, others can be set, the name of which will be
+     |  processed as field names by the indexing configuration.
+     |  Inputs can be specified as unicode or strings.
+     |  Outputs are unicode objects.
+     |  All dates are specified as unix timestamps, printed as strings
+     |  Predefined attributes (index/query/both):
+     |   text (index): document plain text
+     |   url (both)
+     |   fbytes (both) optional) file size in bytes
+     |   filename (both)
+     |   fmtime (both) optional file modification date. Unix time printed 
+     |      as string
+     |   dbytes (both) document text bytes
+     |   dmtime (both) document creation/modification date
+     |   ipath (both) value private to the app.: internal access path
+     |      inside file
+     |   mtype (both) mime type for original document
+     |   mtime (query) dmtime if set else fmtime
+     |   origcharset (both) charset the text was converted from
+     |   size (query) dbytes if set, else fbytes
+     |   sig (both) app-defined file modification signature. 
+     |      For up to date checks
+     |   relevancyrating (query)
+     |   abstract (both)
+     |   author (both)
+     |   title (both)
+     |   keywords (both)
+     |  
+     |  Methods defined here:
+     |  
+     |  
+     |  ----------------------------------------------------------------------
+     |  Data and other attributes defined here:
+     |  
+    
+    class Query(__builtin__.object)
+     |  Recoll Query objects are used to execute index searches. 
+     |  They must be created by the Db.query() method.
+     |  
+     |  Methods defined here:
+     |  
+     |  
+     |  execute(...)
+     |      execute(query_string, stemming=1|0)
+     |      
+     |      Starts a search for query_string, a Recoll search language string
+     |      (mostly Xesam-compatible).
+     |      The query can be a simple list of terms (and'ed by default), or more
+     |      complicated with field specs etc. See the Recoll manual.
+     |  
+     |  executesd(...)
+     |      executesd(SearchData)
+     |      
+     |      Starts a search for the query defined by the SearchData object.
+     |  
+     |  fetchone(...)
+     |      fetchone(None) -> Doc
+     |      
+     |      Fetches the next Doc object in the current search results.
+     |  
+     |  sortby(...)
+     |      sortby(field=fieldname, ascending=true)
+     |      Sort results by 'fieldname', in ascending or descending order.
+     |      Only one field can be used, no subsorts for now.
+     |      Must be called before executing the search
+     |  
+     |  ----------------------------------------------------------------------
+     |  Data descriptors defined here:
+     |  
+     |  next
+     |      Next index to be fetched from results. Normally increments after
+     |      each fetchone() call, but can be set/reset before the call effect
+     |      seeking. Starts at 0
+     |  
+     |  ----------------------------------------------------------------------
+     |  Data and other attributes defined here:
+     |  
+    
+    class SearchData(__builtin__.object)
+     |  SearchData()
+     |  
+     |  A SearchData object describes a query. It has a number of global
+     |  parameters and a chain of search clauses.
+     |  
+     |  Methods defined here:
+     |  
+     |  
+     |  addclause(...)
+     |      addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
+     |                qstring=string, slack=int, field=string, stemming=1|0,
+     |                subSearch=SearchData)
+     |      Adds a simple clause to the SearchData And/Or chain, or a subquery
+     |      defined by another SearchData object
+     |  
+     |  ----------------------------------------------------------------------
+     |  Data and other attributes defined here:
+     |  
+
+FUNCTIONS
+    connect(...)
+        connect([confdir=None], [extra_dbs=None], [writable = False])
+                 -> Db.
+        
+        Connects to a Recoll database and returns a Db object.
+        confdir specifies a Recoll configuration directory
+        (the default is built like for any Recoll program).
+        extra_dbs is a list of external databases (xapian directories)
+        writable decides if we can index new data through this connection
+
+
+</literalLayout>
+
+
+      <sect2 id="rcl.program.python.examples">
+	<title>Example code</title>
+
+	<para>The following sample would query the index with a user
+	language string. See the <filename>python/samples</filename>
+	directory inside the &RCL; source for other examples.</para>
+
+	<programlisting>
+#!/usr/bin/env python
+
+import recoll
+
+db = recoll.connect()
+db.setAbstractParams(maxchars=80, contextwords=2)
+
+query = db.query()
+nres = query.execute("some user question")
+print "Result count: ", nres
+if nres > 5:
+    nres = 5
+while query.next >= 0 and query.next < nres: 
+    doc = query.fetchone()
+    print query.next
+    for k in ("title", "size"):
+        print k, ":", getattr(doc, k).encode('utf-8')
+    abs = db.makeDocAbstract(doc, query).encode('utf-8')
+    print abs
+    print
+
+
+
+</programlisting>
+
+      </sect2>
+
+    </sect1>
+  </chapter>
 
   <chapter id="rcl.install">
     <title>Installation</title>