|
a/src/doc/user/usermanual.sgml |
|
b/src/doc/user/usermanual.sgml |
|
... |
|
... |
22 |
<year>2005</year>
|
22 |
<year>2005</year>
|
23 |
<holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
|
23 |
<holder role="mailto:jean-francois.dockes@wanadoo.fr">Jean-Francois
|
24 |
Dockes</holder>
|
24 |
Dockes</holder>
|
25 |
</copyright>
|
25 |
</copyright>
|
26 |
|
26 |
|
27 |
<releaseinfo>$Id: usermanual.sgml,v 1.66 2008-10-08 16:12:36 dockes Exp $</releaseinfo>
|
27 |
<releaseinfo>$Id: usermanual.sgml,v 1.67 2008-10-10 08:19:12 dockes Exp $</releaseinfo>
|
28 |
|
28 |
|
29 |
<abstract>
|
29 |
<abstract>
|
30 |
<para>This document introduces full text search notions
|
30 |
<para>This document introduces full text search notions
|
31 |
and describes the installation and use of the &RCL;
|
31 |
and describes the installation and use of the &RCL;
|
32 |
application. It currently describes &RCL; 1.9.</para>
|
32 |
application. It currently describes &RCL; 1.9.</para>
|
|
... |
|
... |
1573 |
unchecking their entries.</para>
|
1573 |
unchecking their entries.</para>
|
1574 |
|
1574 |
|
1575 |
<para>Your main database (the one the current configuration
|
1575 |
<para>Your main database (the one the current configuration
|
1576 |
indexes to), is always implicitly active. If this is not
|
1576 |
indexes to), is always implicitly active. If this is not
|
1577 |
desirable, you can set up your configuration so that it indexes,
|
1577 |
desirable, you can set up your configuration so that it indexes,
|
1578 |
for example, an empty directory.</para>
|
1578 |
for example, an empty directory. An alternative indexer may also
|
|
|
1579 |
need to implement a way of purging the index from stale data,
|
|
|
1580 |
</para>
|
1579 |
|
1581 |
|
1580 |
</sect1>
|
1582 |
</sect1>
|
1581 |
|
1583 |
|
1582 |
</chapter>
|
1584 |
</chapter>
|
1583 |
|
1585 |
|
|
|
1586 |
<chapter id="rcl.program">
|
|
|
1587 |
<title>Programming interface</title>
|
|
|
1588 |
|
|
|
1589 |
<sect1 id="rcl.program.elements">
|
|
|
1590 |
<title>Interface elements</title>
|
|
|
1591 |
|
|
|
1592 |
<para>A few elements in the interface are specific and and need
|
|
|
1593 |
an explanation.</para>
|
|
|
1594 |
|
|
|
1595 |
<variablelist>
|
|
|
1596 |
|
|
|
1597 |
<varlistentry>
|
|
|
1598 |
<term>udi</term> <listitem><para>An udi (unique document
|
|
|
1599 |
identifier) identifies a document. Because of limitations
|
|
|
1600 |
inside the index engine, it is restricted in length (to
|
|
|
1601 |
200 bytes), which is why a regular URI cannot be used. The
|
|
|
1602 |
structure and contents of the udi is defined by the
|
|
|
1603 |
application and opaque to the index engine. For example,
|
|
|
1604 |
the internal file system indexer uses the complete
|
|
|
1605 |
document path (file path + internal path), truncated to
|
|
|
1606 |
length, the suppressed part being replaced by a hash
|
|
|
1607 |
value.</para> </listitem>
|
|
|
1608 |
</varlistentry>
|
|
|
1609 |
|
|
|
1610 |
<varlistentry>
|
|
|
1611 |
<term>ipath</term>
|
|
|
1612 |
|
|
|
1613 |
<listitem><para>This data value (set as a field in the Doc
|
|
|
1614 |
object) is stored, along with the URL, but not indexed by
|
|
|
1615 |
&RCL;. Its contents are not interpreted, and its use is up
|
|
|
1616 |
to the application. For example, the &RCL; internal file
|
|
|
1617 |
system indexer stores the part of the document access path
|
|
|
1618 |
internal to the container file (<literal>ipath</literal> in
|
|
|
1619 |
this case is a list of subdocument sequential numbers). url
|
|
|
1620 |
and ipath are returned in every search result and permit
|
|
|
1621 |
access to the original document.</para>
|
|
|
1622 |
</listitem>
|
|
|
1623 |
</varlistentry>
|
|
|
1624 |
|
|
|
1625 |
<varlistentry>
|
|
|
1626 |
<term>Stored and indexed fields</term>
|
|
|
1627 |
|
|
|
1628 |
<listitem><para>The <filename>fields</filename> file inside
|
|
|
1629 |
the &RCL; configuration defines which document fields are
|
|
|
1630 |
either "indexed" (searchable), "stored" (retrievable with
|
|
|
1631 |
search results), or both.</para>
|
|
|
1632 |
</listitem>
|
|
|
1633 |
</varlistentry>
|
|
|
1634 |
|
|
|
1635 |
</variablelist>
|
|
|
1636 |
|
|
|
1637 |
<para>Data for an external indexer, should be stored in a
|
|
|
1638 |
separate index, not the one for the &RCL; internal file system
|
|
|
1639 |
indexer, except if the latter is not used at all). The reason
|
|
|
1640 |
is that the main document indexer purge pass would remove all
|
|
|
1641 |
the other indexer's documents, as they were not seen during
|
|
|
1642 |
indexing. The main indexer documents would also probably be a
|
|
|
1643 |
problem for the external indexer purge operation.</para>
|
|
|
1644 |
|
|
|
1645 |
</sect1>
|
|
|
1646 |
|
|
|
1647 |
<sect1 id="rcl.program.python">
|
|
|
1648 |
<title>Python interface</title>
|
|
|
1649 |
|
|
|
1650 |
<sect2 id="rcl.program.python.intro">
|
|
|
1651 |
<title>Introduction</title>
|
|
|
1652 |
|
|
|
1653 |
<para>&RCL; versions after 1.11 define a Python programming
|
|
|
1654 |
interface, both for searching and indexing.</para>
|
|
|
1655 |
|
|
|
1656 |
<para>The python interface is not built by default and can be
|
|
|
1657 |
found in the source package, under python/recoll. The
|
|
|
1658 |
directory contains the usual <filename>setup.py</filename>
|
|
|
1659 |
script which you can use to build and install the
|
|
|
1660 |
module:
|
|
|
1661 |
|
|
|
1662 |
<screen>
|
|
|
1663 |
<userinput>cd recoll-xxx/python/recoll</userinput>
|
|
|
1664 |
<userinput>python setup.py build</userinput>
|
|
|
1665 |
<userinput>python setup.py install</userinput>
|
|
|
1666 |
</screen>
|
|
|
1667 |
</para>
|
|
|
1668 |
|
|
|
1669 |
</sect2>
|
|
|
1670 |
|
|
|
1671 |
|
|
|
1672 |
<sect2 id="rcl.program.python.manual">
|
|
|
1673 |
<title>Interface manual</title>
|
|
|
1674 |
|
|
|
1675 |
<literalLayout>
|
|
|
1676 |
NAME
|
|
|
1677 |
recoll - This is an interface to the Recoll full text indexer.
|
|
|
1678 |
|
|
|
1679 |
FILE
|
|
|
1680 |
/usr/local/lib/python2.5/site-packages/recoll.so
|
|
|
1681 |
|
|
|
1682 |
CLASSES
|
|
|
1683 |
Db
|
|
|
1684 |
Doc
|
|
|
1685 |
Query
|
|
|
1686 |
SearchData
|
|
|
1687 |
|
|
|
1688 |
class Db(__builtin__.object)
|
|
|
1689 |
| Db([confdir=None], [extra_dbs=None], [writable = False])
|
|
|
1690 |
|
|
|
|
1691 |
| A Db object holds a connection to a Recoll index. Use the connect()
|
|
|
1692 |
| function to create one.
|
|
|
1693 |
| confdir specifies a Recoll configuration directory (default:
|
|
|
1694 |
| $RECOLL_CONFDIR or ~/.recoll).
|
|
|
1695 |
| extra_dbs is a list of external databases (xapian directories)
|
|
|
1696 |
| writable decides if we can index new data through this connection
|
|
|
1697 |
|
|
|
|
1698 |
| Methods defined here:
|
|
|
1699 |
|
|
|
|
1700 |
|
|
|
|
1701 |
| addOrUpdate(...)
|
|
|
1702 |
| addOrUpdate(udi, doc, parent_udi=None) -> None
|
|
|
1703 |
| Add or update index data for a given document
|
|
|
1704 |
| The udi string must define a unique id for the document. It is not
|
|
|
1705 |
| interpreted inside Recoll
|
|
|
1706 |
| doc is a Doc object
|
|
|
1707 |
| if parent_udi is set, this is a unique identifier for the
|
|
|
1708 |
| top-level container (ie mbox file)
|
|
|
1709 |
|
|
|
|
1710 |
| delete(...)
|
|
|
1711 |
| delete(udi) -> Bool.
|
|
|
1712 |
| Purge index from all data for udi. If udi matches a container
|
|
|
1713 |
| document, purge all subdocs (docs with a parent_udi matching udi).
|
|
|
1714 |
|
|
|
|
1715 |
| makeDocAbstract(...)
|
|
|
1716 |
| makeDocAbstract(Doc, Query) -> string
|
|
|
1717 |
| Build and return 'keyword-in-context' abstract for document
|
|
|
1718 |
| and query.
|
|
|
1719 |
|
|
|
|
1720 |
| needUpdate(...)
|
|
|
1721 |
| needUpdate(udi, sig) -> Bool.
|
|
|
1722 |
| Check if the index is up to date for the document defined by udi,
|
|
|
1723 |
| having the current signature sig.
|
|
|
1724 |
|
|
|
|
1725 |
| purge(...)
|
|
|
1726 |
| purge() -> Bool.
|
|
|
1727 |
| Delete all documents that were not touched during the just finished
|
|
|
1728 |
| indexing pass (since open-for-write). These are the documents for
|
|
|
1729 |
| the needUpdate() call was not performed, indicating that they no
|
|
|
1730 |
| longer exist in the primary storage system.
|
|
|
1731 |
|
|
|
|
1732 |
| query(...)
|
|
|
1733 |
| query() -> Query. Return a new, blank query object for this index.
|
|
|
1734 |
|
|
|
|
1735 |
| setAbstractParams(...)
|
|
|
1736 |
| setAbstractParams(maxchars, contextwords).
|
|
|
1737 |
| Set the parameters used to build 'keyword-in-context' abstracts
|
|
|
1738 |
|
|
|
|
1739 |
| ----------------------------------------------------------------------
|
|
|
1740 |
| Data and other attributes defined here:
|
|
|
1741 |
|
|
|
|
1742 |
|
|
|
1743 |
class Doc(__builtin__.object)
|
|
|
1744 |
| Doc()
|
|
|
1745 |
|
|
|
|
1746 |
| A Doc object contains index data for a given document.
|
|
|
1747 |
| The data is extracted from the index when searching, or set by the
|
|
|
1748 |
| indexer program when updating. The Doc object has no useful methods but
|
|
|
1749 |
| many attributes to be read or set by its user. It matches exactly the
|
|
|
1750 |
| Rcl::Doc c++ object. Some of the attributes are predefined, but,
|
|
|
1751 |
| especially when indexing, others can be set, the name of which will be
|
|
|
1752 |
| processed as field names by the indexing configuration.
|
|
|
1753 |
| Inputs can be specified as unicode or strings.
|
|
|
1754 |
| Outputs are unicode objects.
|
|
|
1755 |
| All dates are specified as unix timestamps, printed as strings
|
|
|
1756 |
| Predefined attributes (index/query/both):
|
|
|
1757 |
| text (index): document plain text
|
|
|
1758 |
| url (both)
|
|
|
1759 |
| fbytes (both) optional) file size in bytes
|
|
|
1760 |
| filename (both)
|
|
|
1761 |
| fmtime (both) optional file modification date. Unix time printed
|
|
|
1762 |
| as string
|
|
|
1763 |
| dbytes (both) document text bytes
|
|
|
1764 |
| dmtime (both) document creation/modification date
|
|
|
1765 |
| ipath (both) value private to the app.: internal access path
|
|
|
1766 |
| inside file
|
|
|
1767 |
| mtype (both) mime type for original document
|
|
|
1768 |
| mtime (query) dmtime if set else fmtime
|
|
|
1769 |
| origcharset (both) charset the text was converted from
|
|
|
1770 |
| size (query) dbytes if set, else fbytes
|
|
|
1771 |
| sig (both) app-defined file modification signature.
|
|
|
1772 |
| For up to date checks
|
|
|
1773 |
| relevancyrating (query)
|
|
|
1774 |
| abstract (both)
|
|
|
1775 |
| author (both)
|
|
|
1776 |
| title (both)
|
|
|
1777 |
| keywords (both)
|
|
|
1778 |
|
|
|
|
1779 |
| Methods defined here:
|
|
|
1780 |
|
|
|
|
1781 |
|
|
|
|
1782 |
| ----------------------------------------------------------------------
|
|
|
1783 |
| Data and other attributes defined here:
|
|
|
1784 |
|
|
|
|
1785 |
|
|
|
1786 |
class Query(__builtin__.object)
|
|
|
1787 |
| Recoll Query objects are used to execute index searches.
|
|
|
1788 |
| They must be created by the Db.query() method.
|
|
|
1789 |
|
|
|
|
1790 |
| Methods defined here:
|
|
|
1791 |
|
|
|
|
1792 |
|
|
|
|
1793 |
| execute(...)
|
|
|
1794 |
| execute(query_string, stemming=1|0)
|
|
|
1795 |
|
|
|
|
1796 |
| Starts a search for query_string, a Recoll search language string
|
|
|
1797 |
| (mostly Xesam-compatible).
|
|
|
1798 |
| The query can be a simple list of terms (and'ed by default), or more
|
|
|
1799 |
| complicated with field specs etc. See the Recoll manual.
|
|
|
1800 |
|
|
|
|
1801 |
| executesd(...)
|
|
|
1802 |
| executesd(SearchData)
|
|
|
1803 |
|
|
|
|
1804 |
| Starts a search for the query defined by the SearchData object.
|
|
|
1805 |
|
|
|
|
1806 |
| fetchone(...)
|
|
|
1807 |
| fetchone(None) -> Doc
|
|
|
1808 |
|
|
|
|
1809 |
| Fetches the next Doc object in the current search results.
|
|
|
1810 |
|
|
|
|
1811 |
| sortby(...)
|
|
|
1812 |
| sortby(field=fieldname, ascending=true)
|
|
|
1813 |
| Sort results by 'fieldname', in ascending or descending order.
|
|
|
1814 |
| Only one field can be used, no subsorts for now.
|
|
|
1815 |
| Must be called before executing the search
|
|
|
1816 |
|
|
|
|
1817 |
| ----------------------------------------------------------------------
|
|
|
1818 |
| Data descriptors defined here:
|
|
|
1819 |
|
|
|
|
1820 |
| next
|
|
|
1821 |
| Next index to be fetched from results. Normally increments after
|
|
|
1822 |
| each fetchone() call, but can be set/reset before the call effect
|
|
|
1823 |
| seeking. Starts at 0
|
|
|
1824 |
|
|
|
|
1825 |
| ----------------------------------------------------------------------
|
|
|
1826 |
| Data and other attributes defined here:
|
|
|
1827 |
|
|
|
|
1828 |
|
|
|
1829 |
class SearchData(__builtin__.object)
|
|
|
1830 |
| SearchData()
|
|
|
1831 |
|
|
|
|
1832 |
| A SearchData object describes a query. It has a number of global
|
|
|
1833 |
| parameters and a chain of search clauses.
|
|
|
1834 |
|
|
|
|
1835 |
| Methods defined here:
|
|
|
1836 |
|
|
|
|
1837 |
|
|
|
|
1838 |
| addclause(...)
|
|
|
1839 |
| addclause(type='and'|'or'|'excl'|'phrase'|'near'|'sub',
|
|
|
1840 |
| qstring=string, slack=int, field=string, stemming=1|0,
|
|
|
1841 |
| subSearch=SearchData)
|
|
|
1842 |
| Adds a simple clause to the SearchData And/Or chain, or a subquery
|
|
|
1843 |
| defined by another SearchData object
|
|
|
1844 |
|
|
|
|
1845 |
| ----------------------------------------------------------------------
|
|
|
1846 |
| Data and other attributes defined here:
|
|
|
1847 |
|
|
|
|
1848 |
|
|
|
1849 |
FUNCTIONS
|
|
|
1850 |
connect(...)
|
|
|
1851 |
connect([confdir=None], [extra_dbs=None], [writable = False])
|
|
|
1852 |
-> Db.
|
|
|
1853 |
|
|
|
1854 |
Connects to a Recoll database and returns a Db object.
|
|
|
1855 |
confdir specifies a Recoll configuration directory
|
|
|
1856 |
(the default is built like for any Recoll program).
|
|
|
1857 |
extra_dbs is a list of external databases (xapian directories)
|
|
|
1858 |
writable decides if we can index new data through this connection
|
|
|
1859 |
|
|
|
1860 |
|
|
|
1861 |
</literalLayout>
|
|
|
1862 |
|
|
|
1863 |
|
|
|
1864 |
<sect2 id="rcl.program.python.examples">
|
|
|
1865 |
<title>Example code</title>
|
|
|
1866 |
|
|
|
1867 |
<para>The following sample would query the index with a user
|
|
|
1868 |
language string. See the <filename>python/samples</filename>
|
|
|
1869 |
directory inside the &RCL; source for other examples.</para>
|
|
|
1870 |
|
|
|
1871 |
<programlisting>
|
|
|
1872 |
#!/usr/bin/env python
|
|
|
1873 |
|
|
|
1874 |
import recoll
|
|
|
1875 |
|
|
|
1876 |
db = recoll.connect()
|
|
|
1877 |
db.setAbstractParams(maxchars=80, contextwords=2)
|
|
|
1878 |
|
|
|
1879 |
query = db.query()
|
|
|
1880 |
nres = query.execute("some user question")
|
|
|
1881 |
print "Result count: ", nres
|
|
|
1882 |
if nres > 5:
|
|
|
1883 |
nres = 5
|
|
|
1884 |
while query.next >= 0 and query.next < nres:
|
|
|
1885 |
doc = query.fetchone()
|
|
|
1886 |
print query.next
|
|
|
1887 |
for k in ("title", "size"):
|
|
|
1888 |
print k, ":", getattr(doc, k).encode('utf-8')
|
|
|
1889 |
abs = db.makeDocAbstract(doc, query).encode('utf-8')
|
|
|
1890 |
print abs
|
|
|
1891 |
print
|
|
|
1892 |
|
|
|
1893 |
|
|
|
1894 |
|
|
|
1895 |
</programlisting>
|
|
|
1896 |
|
|
|
1897 |
</sect2>
|
|
|
1898 |
|
|
|
1899 |
</sect1>
|
|
|
1900 |
</chapter>
|
1584 |
|
1901 |
|
1585 |
<chapter id="rcl.install">
|
1902 |
<chapter id="rcl.install">
|
1586 |
<title>Installation</title>
|
1903 |
<title>Installation</title>
|
1587 |
|
1904 |
|
1588 |
<sect1 id="rcl.install.binary">
|
1905 |
<sect1 id="rcl.install.binary">
|