Switch to unified view

a/src/doc/user/usermanual.sgml b/src/doc/user/usermanual.sgml
...
...
22
      <year>2005</year>
22
      <year>2005</year>
23
      <holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
23
      <holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
24
      Dockes</holder>
24
      Dockes</holder>
25
    </copyright>
25
    </copyright>
26
26
27
    <releaseinfo>$Id: usermanual.sgml,v 1.66 2008-10-08 16:12:36 dockes Exp $</releaseinfo>
27
    <releaseinfo>$Id: usermanual.sgml,v 1.67 2008-10-10 08:19:12 dockes Exp $</releaseinfo>
28
28
29
    <abstract>
29
    <abstract>
30
      <para>This document introduces full text search notions
30
      <para>This document introduces full text search notions
31
      and describes the installation and use of the &RCL;
31
      and describes the installation and use of the &RCL;
32
      application. It currently describes &RCL; 1.9.</para>
32
      application. It currently describes &RCL; 1.9.</para>
...
...
1573
    unchecking their entries.</para> 
1573
    unchecking their entries.</para> 
1574
1574
1575
      <para>Your main database (the one the current configuration
1575
      <para>Your main database (the one the current configuration
1576
      indexes to), is always implicitly active. If this is not
1576
      indexes to), is always implicitly active. If this is not
1577
      desirable, you can set up your configuration so that it indexes,
1577
      desirable, you can set up your configuration so that it indexes,
1578
      for example, an empty directory.</para>
1578
      for example, an empty directory. An alternative indexer may also
1579
      need to implement a way of purging the index from stale data,
1580
      </para>
1579
1581
1580
    </sect1>
1582
    </sect1>
1581
1583
1582
  </chapter>
1584
  </chapter>
1583
1585
1586
  <chapter id="rcl.program">
1587
    <title>Programming interface</title>
1588
1589
    <sect1 id="rcl.program.elements">
1590
      <title>Interface elements</title>
1591
1592
      <para>A few elements in the interface are specific and and need
1593
      an explanation.</para>
1594
1595
      <variablelist>
1596
1597
  <varlistentry>
1598
    <term>udi</term> <listitem><para>An udi (unique document
1599
            identifier) identifies a document. Because of limitations
1600
            inside the index engine, it is restricted in length (to
1601
            200 bytes), which is why a regular URI cannot be used. The
1602
            structure and contents of the udi is defined by the
1603
            application and opaque to the index engine. For example,
1604
            the internal file system indexer uses the complete
1605
            document path (file path + internal path), truncated to
1606
            length, the suppressed part being replaced by a hash
1607
            value.</para> </listitem>
1608
  </varlistentry>
1609
1610
  <varlistentry> 
1611
    <term>ipath</term> 
1612
    
1613
    <listitem><para>This data value (set as a field in the Doc
1614
    object) is stored, along with the URL, but not indexed by
1615
    &RCL;. Its contents are not interpreted, and its use is up
1616
    to the application. For example, the &RCL; internal file
1617
    system indexer stores the part of the document access path
1618
    internal to the container file (<literal>ipath</literal> in
1619
    this case is a list of subdocument sequential numbers). url
1620
    and ipath are returned in every search result and permit
1621
    access to the original document.</para>
1622
    </listitem>
1623
  </varlistentry>
1624
1625
  <varlistentry> 
1626
    <term>Stored and indexed fields</term> 
1627
    
1628
    <listitem><para>The <filename>fields</filename> file inside
1629
    the &RCL; configuration defines which document fields are
1630
    either "indexed" (searchable), "stored" (retrievable with
1631
    search results), or both.</para>
1632
    </listitem>
1633
  </varlistentry>
1634
1635
  </variablelist>
1636
1637
      <para>Data for an external indexer, should be stored in a
1638
      separate index, not the one for the &RCL; internal file system
1639
      indexer, except if the latter is not used at all). The reason
1640
      is that the main document indexer purge pass would remove all
1641
      the other indexer's documents, as they were not seen during
1642
      indexing. The main indexer documents would also probably be a
1643
      problem for the external indexer purge operation.</para>
1644
1645
    </sect1>
1646
1647
    <sect1 id="rcl.program.python">
1648
      <title>Python interface</title>
1649
1650
      <sect2 id="rcl.program.python.intro">
1651
  <title>Introduction</title>
1652
1653
    <para>&RCL; versions after 1.11 define a Python programming
1654
    interface, both for searching and indexing.</para> 
1655
1656
  <para>The python interface is not built by default and can be
1657
  found in the source package, under python/recoll. The
1658
  directory contains the usual <filename>setup.py</filename>
1659
  script which you can use to build and install the
1660
  module:
1661
1662
    <screen>
1663
        <userinput>cd recoll-xxx/python/recoll</userinput>
1664
        <userinput>python setup.py build</userinput>
1665
        <userinput>python setup.py install</userinput>
1666
      </screen>
1667
          </para> 
1668
1669
      </sect2>
1670
1671
1672
      <sect2 id="rcl.program.python.manual">
1673
  <title>Interface manual</title>
1674
1675
      <literalLayout>
1676
NAME
1677
    recoll - This is an interface to the Recoll full text indexer.
1678
1679
FILE
1680
    /usr/local/lib/python2.5/site-packages/recoll.so
1681
1682
CLASSES
1683
        Db
1684
        Doc
1685
        Query
1686
        SearchData
1687
    
1688
    class Db(__builtin__.object)
1689
     |  Db([confdir=None], [extra_dbs=None], [writable = False])
1690
     |  
1691
     |  A Db object holds a connection to a Recoll index. Use the connect()
1692
     |  function to create one.
1693
     |  confdir specifies a Recoll configuration directory (default: 
1694
     |   $RECOLL_CONFDIR or ~/.recoll).
1695
     |  extra_dbs is a list of external databases (xapian directories)
1696
     |  writable decides if we can index new data through this connection
1697
     |  
1698
     |  Methods defined here:
1699
     |  
1700
     |  
1701
     |  addOrUpdate(...)
1702
     |      addOrUpdate(udi, doc, parent_udi=None) -> None
1703
     |      Add or update index data for a given document
1704
     |      The udi string must define a unique id for the document. It is not
1705
     |      interpreted inside Recoll
1706
     |      doc is a Doc object
1707
     |      if parent_udi is set, this is a unique identifier for the
1708
     |      top-level container (ie mbox file)
1709
     |  
1710
     |  delete(...)
1711
     |      delete(udi) -> Bool.
1712
     |      Purge index from all data for udi. If udi matches a container
1713
     |      document, purge all subdocs (docs with a parent_udi matching udi).
1714
     |  
1715
     |  makeDocAbstract(...)
1716
     |      makeDocAbstract(Doc, Query) -> string
1717
     |      Build and return 'keyword-in-context' abstract for document
1718
     |      and query.
1719
     |  
1720
     |  needUpdate(...)
1721
     |      needUpdate(udi, sig) -> Bool.
1722
     |      Check if the index is up to date for the document defined by udi,
1723
     |      having the current signature sig.
1724
     |  
1725
     |  purge(...)
1726
     |      purge() -> Bool.
1727
     |      Delete all documents that were not touched during the just finished
1728
     |      indexing pass (since open-for-write). These are the documents for
1729
     |      the needUpdate() call was not performed, indicating that they no
1730
     |      longer exist in the primary storage system.
1731
     |  
1732
     |  query(...)
1733
     |      query() -> Query. Return a new, blank query object for this index.
1734
     |  
1735
     |  setAbstractParams(...)
1736
     |      setAbstractParams(maxchars, contextwords).
1737
     |      Set the parameters used to build 'keyword-in-context' abstracts
1738
     |  
1739
     |  ----------------------------------------------------------------------
1740
     |  Data and other attributes defined here:
1741
     |  
1742
    
1743
    class Doc(__builtin__.object)
1744
     |  Doc()
1745
     |  
1746
     |  A Doc object contains index data for a given document.
1747
     |  The data is extracted from the index when searching, or set by the
1748
     |  indexer program when updating. The Doc object has no useful methods but
1749
     |  many attributes to be read or set by its user. It matches exactly the
1750
     |  Rcl::Doc c++ object. Some of the attributes are predefined, but, 
1751
     |  especially when indexing, others can be set, the name of which will be
1752
     |  processed as field names by the indexing configuration.
1753
     |  Inputs can be specified as unicode or strings.
1754
     |  Outputs are unicode objects.
1755
     |  All dates are specified as unix timestamps, printed as strings
1756
     |  Predefined attributes (index/query/both):
1757
     |   text (index): document plain text
1758
     |   url (both)
1759
     |   fbytes (both) optional) file size in bytes
1760
     |   filename (both)
1761
     |   fmtime (both) optional file modification date. Unix time printed 
1762
     |      as string
1763
     |   dbytes (both) document text bytes
1764
     |   dmtime (both) document creation/modification date
1765
     |   ipath (both) value private to the app.: internal access path
1766
     |      inside file
1767
     |   mtype (both) mime type for original document
1768
     |   mtime (query) dmtime if set else fmtime
1769
     |   origcharset (both) charset the text was converted from
1770
     |   size (query) dbytes if set, else fbytes
1771
     |   sig (both) app-defined file modification signature. 
1772
     |      For up to date checks
1773
     |   relevancyrating (query)
1774
     |   abstract (both)
1775
     |   author (both)
1776
     |   title (both)
1777
     |   keywords (both)
1778
     |  
1779
     |  Methods defined here:
1780
     |  
1781
     |  
1782
     |  ----------------------------------------------------------------------
1783
     |  Data and other attributes defined here:
1784
     |  
1785
    
1786
    class Query(__builtin__.object)
1787
     |  Recoll Query objects are used to execute index searches. 
1788
     |  They must be created by the Db.query() method.
1789
     |  
1790
     |  Methods defined here:
1791
     |  
1792
     |  
1793
     |  execute(...)
1794
     |      execute(query_string, stemming=1|0)
1795
     |      
1796
     |      Starts a search for query_string, a Recoll search language string
1797
     |      (mostly Xesam-compatible).
1798
     |      The query can be a simple list of terms (and'ed by default), or more
1799
     |      complicated with field specs etc. See the Recoll manual.
1800
     |  
1801
     |  executesd(...)
1802
     |      executesd(SearchData)
1803
     |      
1804
     |      Starts a search for the query defined by the SearchData object.
1805
     |  
1806
     |  fetchone(...)
1807
     |      fetchone(None) -> Doc
1808
     |      
1809
     |      Fetches the next Doc object in the current search results.
1810
     |  
1811
     |  sortby(...)
1812
     |      sortby(field=fieldname, ascending=true)
1813
     |      Sort results by 'fieldname', in ascending or descending order.
1814
     |      Only one field can be used, no subsorts for now.
1815
     |      Must be called before executing the search
1816
     |  
1817
     |  ----------------------------------------------------------------------
1818
     |  Data descriptors defined here:
1819
     |  
1820
     |  next
1821
     |      Next index to be fetched from results. Normally increments after
1822
     |      each fetchone() call, but can be set/reset before the call effect
1823
     |      seeking. Starts at 0
1824
     |  
1825
     |  ----------------------------------------------------------------------
1826
     |  Data and other attributes defined here:
1827
     |  
1828
    
1829
    class SearchData(__builtin__.object)
1830
     |  SearchData()
1831
     |  
1832
     |  A SearchData object describes a query. It has a number of global
1833
     |  parameters and a chain of search clauses.
1834
     |  
1835
     |  Methods defined here:
1836
     |  
1837
     |  
1838
     |  addclause(...)
1839
     |      addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
1840
     |                qstring=string, slack=int, field=string, stemming=1|0,
1841
     |                subSearch=SearchData)
1842
     |      Adds a simple clause to the SearchData And/Or chain, or a subquery
1843
     |      defined by another SearchData object
1844
     |  
1845
     |  ----------------------------------------------------------------------
1846
     |  Data and other attributes defined here:
1847
     |  
1848
1849
FUNCTIONS
1850
    connect(...)
1851
        connect([confdir=None], [extra_dbs=None], [writable = False])
1852
                 -> Db.
1853
        
1854
        Connects to a Recoll database and returns a Db object.
1855
        confdir specifies a Recoll configuration directory
1856
        (the default is built like for any Recoll program).
1857
        extra_dbs is a list of external databases (xapian directories)
1858
        writable decides if we can index new data through this connection
1859
1860
1861
</literalLayout>
1862
1863
1864
      <sect2 id="rcl.program.python.examples">
1865
  <title>Example code</title>
1866
1867
  <para>The following sample would query the index with a user
1868
  language string. See the <filename>python/samples</filename>
1869
  directory inside the &RCL; source for other examples.</para>
1870
1871
  <programlisting>
1872
#!/usr/bin/env python
1873
1874
import recoll
1875
1876
db = recoll.connect()
1877
db.setAbstractParams(maxchars=80, contextwords=2)
1878
1879
query = db.query()
1880
nres = query.execute("some user question")
1881
print "Result count: ", nres
1882
if nres > 5:
1883
    nres = 5
1884
while query.next >= 0 and query.next < nres: 
1885
    doc = query.fetchone()
1886
    print query.next
1887
    for k in ("title", "size"):
1888
        print k, ":", getattr(doc, k).encode('utf-8')
1889
    abs = db.makeDocAbstract(doc, query).encode('utf-8')
1890
    print abs
1891
    print
1892
1893
1894
1895
</programlisting>
1896
1897
      </sect2>
1898
1899
    </sect1>
1900
  </chapter>
1584
1901
1585
  <chapter id="rcl.install">
1902
  <chapter id="rcl.install">
1586
    <title>Installation</title>
1903
    <title>Installation</title>
1587
1904
1588
    <sect1 id="rcl.install.binary">
1905
    <sect1 id="rcl.install.binary">