Sophie

Sophie

distrib > Fedora > 13 > i386 > by-pkgid > 2dc7ae7102ce788eb8a15dec0caf7708 > files > 349

xapian-core-devel-1.0.21-1.fc13.i686.rpm

<?xml version="1.0" encoding="utf-8" ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
<meta name="generator" content="Docutils 0.5: http://docutils.sourceforge.net/" />
<title>Xapian 1.0 Term Indexing/Querying Scheme</title>
<style type="text/css">

/*
:Author: David Goodger (goodger@python.org)
:Id: $Id: html4css1.css 5196 2007-06-03 20:25:28Z wiemann $
:Copyright: This stylesheet has been placed in the public domain.

Default cascading style sheet for the HTML output of Docutils.

See http://docutils.sf.net/docs/howto/html-stylesheets.html for how to
customize this style sheet.
*/

/* used to remove borders from tables and images */
.borderless, table.borderless td, table.borderless th {
  border: 0 }

table.borderless td, table.borderless th {
  /* Override padding for "table.docutils td" with "! important".
     The right padding separates the table cells. */
  padding: 0 0.5em 0 0 ! important }

.first {
  /* Override more specific margin styles with "! important". */
  margin-top: 0 ! important }

.last, .with-subtitle {
  margin-bottom: 0 ! important }

.hidden {
  display: none }

a.toc-backref {
  text-decoration: none ;
  color: black }

blockquote.epigraph {
  margin: 2em 5em ; }

dl.docutils dd {
  margin-bottom: 0.5em }

/* Uncomment (and remove this text!) to get bold-faced definition list terms
dl.docutils dt {
  font-weight: bold }
*/

div.abstract {
  margin: 2em 5em }

div.abstract p.topic-title {
  font-weight: bold ;
  text-align: center }

div.admonition, div.attention, div.caution, div.danger, div.error,
div.hint, div.important, div.note, div.tip, div.warning {
  margin: 2em ;
  border: medium outset ;
  padding: 1em }

div.admonition p.admonition-title, div.hint p.admonition-title,
div.important p.admonition-title, div.note p.admonition-title,
div.tip p.admonition-title {
  font-weight: bold ;
  font-family: sans-serif }

div.attention p.admonition-title, div.caution p.admonition-title,
div.danger p.admonition-title, div.error p.admonition-title,
div.warning p.admonition-title {
  color: red ;
  font-weight: bold ;
  font-family: sans-serif }

/* Uncomment (and remove this text!) to get reduced vertical space in
   compound paragraphs.
div.compound .compound-first, div.compound .compound-middle {
  margin-bottom: 0.5em }

div.compound .compound-last, div.compound .compound-middle {
  margin-top: 0.5em }
*/

div.dedication {
  margin: 2em 5em ;
  text-align: center ;
  font-style: italic }

div.dedication p.topic-title {
  font-weight: bold ;
  font-style: normal }

div.figure {
  margin-left: 2em ;
  margin-right: 2em }

div.footer, div.header {
  clear: both;
  font-size: smaller }

div.line-block {
  display: block ;
  margin-top: 1em ;
  margin-bottom: 1em }

div.line-block div.line-block {
  margin-top: 0 ;
  margin-bottom: 0 ;
  margin-left: 1.5em }

div.sidebar {
  margin: 0 0 0.5em 1em ;
  border: medium outset ;
  padding: 1em ;
  background-color: #ffffee ;
  width: 40% ;
  float: right ;
  clear: right }

div.sidebar p.rubric {
  font-family: sans-serif ;
  font-size: medium }

div.system-messages {
  margin: 5em }

div.system-messages h1 {
  color: red }

div.system-message {
  border: medium outset ;
  padding: 1em }

div.system-message p.system-message-title {
  color: red ;
  font-weight: bold }

div.topic {
  margin: 2em }

h1.section-subtitle, h2.section-subtitle, h3.section-subtitle,
h4.section-subtitle, h5.section-subtitle, h6.section-subtitle {
  margin-top: 0.4em }

h1.title {
  text-align: center }

h2.subtitle {
  text-align: center }

hr.docutils {
  width: 75% }

img.align-left {
  clear: left }

img.align-right {
  clear: right }

ol.simple, ul.simple {
  margin-bottom: 1em }

ol.arabic {
  list-style: decimal }

ol.loweralpha {
  list-style: lower-alpha }

ol.upperalpha {
  list-style: upper-alpha }

ol.lowerroman {
  list-style: lower-roman }

ol.upperroman {
  list-style: upper-roman }

p.attribution {
  text-align: right ;
  margin-left: 50% }

p.caption {
  font-style: italic }

p.credits {
  font-style: italic ;
  font-size: smaller }

p.label {
  white-space: nowrap }

p.rubric {
  font-weight: bold ;
  font-size: larger ;
  color: maroon ;
  text-align: center }

p.sidebar-title {
  font-family: sans-serif ;
  font-weight: bold ;
  font-size: larger }

p.sidebar-subtitle {
  font-family: sans-serif ;
  font-weight: bold }

p.topic-title {
  font-weight: bold }

pre.address {
  margin-bottom: 0 ;
  margin-top: 0 ;
  font-family: serif ;
  font-size: 100% }

pre.literal-block, pre.doctest-block {
  margin-left: 2em ;
  margin-right: 2em }

span.classifier {
  font-family: sans-serif ;
  font-style: oblique }

span.classifier-delimiter {
  font-family: sans-serif ;
  font-weight: bold }

span.interpreted {
  font-family: sans-serif }

span.option {
  white-space: nowrap }

span.pre {
  white-space: pre }

span.problematic {
  color: red }

span.section-subtitle {
  /* font-size relative to parent (h1..h6 element) */
  font-size: 80% }

table.citation {
  border-left: solid 1px gray;
  margin-left: 1px }

table.docinfo {
  margin: 2em 4em }

table.docutils {
  margin-top: 0.5em ;
  margin-bottom: 0.5em }

table.footnote {
  border-left: solid 1px black;
  margin-left: 1px }

table.docutils td, table.docutils th,
table.docinfo td, table.docinfo th {
  padding-left: 0.5em ;
  padding-right: 0.5em ;
  vertical-align: top }

table.docutils th.field-name, table.docinfo th.docinfo-name {
  font-weight: bold ;
  text-align: left ;
  white-space: nowrap ;
  padding-left: 0 }

h1 tt.docutils, h2 tt.docutils, h3 tt.docutils,
h4 tt.docutils, h5 tt.docutils, h6 tt.docutils {
  font-size: 100% }

ul.auto-toc {
  list-style-type: none }

</style>
</head>
<body>
<div class="document" id="xapian-1-0-term-indexing-querying-scheme">
<h1 class="title">Xapian 1.0 Term Indexing/Querying Scheme</h1>

<!-- Copyright (C) 2007 Olly Betts -->
<div class="contents topic" id="table-of-contents">
<p class="topic-title first">Table of contents</p>
<ul class="simple">
<li><a class="reference internal" href="#introduction" id="id1">Introduction</a></li>
<li><a class="reference internal" href="#stemming" id="id2">Stemming</a></li>
<li><a class="reference internal" href="#word-characters" id="id3">Word Characters</a></li>
</ul>
</div>
<div class="section" id="introduction">
<h1><a class="toc-backref" href="#id1">Introduction</a></h1>
<p>In Xapian 1.0, the default indexing scheme has been changed significantly, to address
lessons learned from observing the old scheme in real world use.  This document
describes the new scheme, with references to differences from the old.</p>
</div>
<div class="section" id="stemming">
<h1><a class="toc-backref" href="#id2">Stemming</a></h1>
<p>The most obvious difference is the handling of stemmed forms.</p>
<p>Previously all words were indexed stemmed without a prefix, and capitalised words were
indexed unstemmed (but lower cased) with an 'R' prefix.  The rationale for doing this was
that people want to be able to search for exact proper nouns (e.g. the English stemmer
conflates <tt class="docutils literal"><span class="pre">Tony</span></tt> and <tt class="docutils literal"><span class="pre">Toni</span></tt>).  But of course this also indexes words at the start
of sentences, words in titles, and in German all nouns are capitalised so will be indexed.
Both the normal and R-prefixed terms were indexed with positional information.</p>
<p>Now we index all words lowercased with positional information, and also stemmed with a
'Z' prefix (unless they start with a digit), but without positional information.  By default
a Xapian::Stopper is used to avoid indexed stemmed forms of stopwords (tests show this shaves
around 1% off the database size).</p>
<p>The new scheme allows exact phrase searching (which the old scheme didn't).  <tt class="docutils literal"><span class="pre">NEAR</span></tt>
now has to operate on unstemmed forms, but that's reasonable enough.  We can also disable
stemming of words which are capitalised in the query, to achieve good results for
proper nouns.  And Omega's $topterms will now always suggest unstemmed forms!</p>
<p>The main rationale for prefixing the stemmed forms is that there are simply fewer of
them!  As a side benefit, it opens the way for storing stemmed forms for multiple
languages (e.g. Z:en:, Z:fr: or something like that).</p>
<p>The special handling of a trailing <tt class="docutils literal"><span class="pre">.</span></tt> in the QueryParser (which would often
mistakenly trigger for pasted text) has been removed.  This feature was there to
support Omega's topterms adding stemmed forms, but Omega no longer needs to do this
as it can suggest unstemmed forms instead.</p>
</div>
<div class="section" id="word-characters">
<h1><a class="toc-backref" href="#id3">Word Characters</a></h1>
<p>By default, Unicode characters of category CONNECTOR_PUNCTUATION (<tt class="docutils literal"><span class="pre">_</span></tt> and a
handful of others) are now word characters, which provides better indexing of
identifiers, without much degradation of other cases.  Previously cases like
<tt class="docutils literal"><span class="pre">time_t</span></tt> required a phrase search.</p>
<p>Trailing <tt class="docutils literal"><span class="pre">+</span></tt> and <tt class="docutils literal"><span class="pre">#</span></tt> are still included on terms (up to 3 characters at most), but
<tt class="docutils literal"><span class="pre">-</span></tt> no longer is by default.  The examples it benefits aren't compelling
(<tt class="docutils literal"><span class="pre">nethack--</span></tt>, <tt class="docutils literal"><span class="pre">Cl-</span></tt>) and it tends to glue hyphens on to terms.</p>
<p>A single embedded <tt class="docutils literal"><span class="pre">'</span></tt> (apostrophe) is now included in a term.
Previously this caused a slow phrase search, and added junk terms to the index
(<tt class="docutils literal"><span class="pre">didn't</span></tt> -&gt; <tt class="docutils literal"><span class="pre">didn</span></tt> and <tt class="docutils literal"><span class="pre">t</span></tt>, etc).  Various Unicode characters used for apostrophes
are all mapped to the ASCII representation.</p>
<p>A few other characters (taken from the Unicode definition of a word) are included
in terms if they occur between two word characters, and <tt class="docutils literal"><span class="pre">.</span></tt>, <tt class="docutils literal"><span class="pre">,</span></tt> and a
few others are included in terms if they occur between two decimal digit characters.</p>
</div>
</div>
</body>
</html>