Sophie: xapian-core-devel-1.0.21-1.fc13 i686

xapian-core-devel-1.0.21-1.fc13.i686.rpm

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE>Xapian: Matcher Design Notes</TITLE>
</HEAD>
<BODY BGCOLOR="white" TEXT="black">

<H1>Matcher Design Notes</H1>

<P>This document is incomplete at present.  It lacks explanation of the min-heap
used to keep the best N M-set items (Managing Gigabytes describes this
technique well).

<h2>The PostList Tree</h2>

<P>The matcher builds a tree structure from the query.  This tree is a binary
tree, and each node is a PostList sub-class object.  This is more efficient than
a n-ary tree in terms of the number of comparisons which need to be performed:
&lt;insert proof&gt; (but this proof may only be valid for equal sized posting
lists
without optimisations, in which case there may be a more efficient way to do
this - investigate!)

<P>To built the tree, a PostList object is
created for each term, and pairs of PostLists are combined using
2-way branching tree elements for AND, OR, etc - these are virtual
PostLists whose class names reflect the operation (AndPostList,
OrPostList, etc).  See below for a full list.

<P>The tree is deliberately built in an uneven way, such that we minimise the
likely number of
times a posting has to be passed up a level.  For a group of OR operations,
the PostLists with fewest entries are furthest down the group's subtree,
minimising the amount of information needing to be passed up the tree.
For a group of AND operations the PostLists with most entries are furthest down,
so we look at the least frequent terms first, and skip the posting lists of
the others.  This will generally minimise the number of posting list entries
we read and maximises the size of each skip_to.  The OR tree is built up
in a similar way to how an optimal huffman code is constructed.  This is
provably optimal, but with the assumption that the tree structure is
immutable once created.  If term distribution is uneven, rebalancing this
tree during the match might be more efficient.

<h2>running the match</h2>

<P>
Once the tree is built, the matcher repeatedly asks
the root of the tree for the next matching document and compares it to
those in the proto-mset it maintains.  If the next matching document scores
more highly (either by weight, or in sort order if sorting is used) then it
adds it and discards the lowest scoring document.

<P>
When one of a sub-tree of AND operations runs out, it signals "end of list",
and each AND signals this too.

<P>
When an OR gets end of list, it autoprunes, replacing itself with the
branch that still has postings - see below for full details.  If the matcher
itself gets "end of list", the match is complete.

<P>
The other operations also handle end of list in one of these two ways
(for asymmetric operations, which happens may depend which branch has run out).

<P>
The matcher also passes the lowest weight currently needed make the proto-mset
into the tree, and each node may adjust this weight and pass it on to its
subtrees.  Each PostList can report a minimum weight it could contribute - so if
the left branch of an AND will always return a weight of 2 or more,
then if the whole AND needs to return at least 6, the right branch is
told it need to return at least 4.

<P>
For example, an OR knows that if its left branch can
contribute at most a weight of 4 and its right branch at most 7, then if
the minimum weight is 8, only documents matching both branches are now
of interest so it mutates into an AND.  If the minimum weight is 6 it
changes into an AND_MAYBE (A AND_MAYBE B matches documents which which
match A, but B contributes to the weight - in most search engines query
syntax, that's expressed as `+A B').  See the "Operator Decay" section
below for full details of these mutations.  If the minimum weight needed is
12, no document is good enough, and the OR returns "end of list".

<h2>Phrase and near matching</h2>

<P>
The way phrase and near matching works is to perform an AND query for all the
terms, with a filter node in front which only returns documents whose
positional information fulfils the phrase requirements.

<P>Unfortunately this creates a
bad case is where a lot of documents have the words of the phrase in
but few match the actual phrase - this filter does a lot of work, but
the matcher can't stop early.  We should look at hoisting the filtering
part higher up the tree, but note that this may not always be a win.
Some heuristics are probably required.

<h2>virtual postlist types</h2>

<P>There are several types of virtual PostList.  Each type can be treated as
boolean or probabilistic - the only difference is whether the weights are
ignored or not.  The types are:

<ul>
<li> OrPostList: returns documents which match either branch

<li> AndPostList: returns documents which match both branches

<li> XorPostList: returns documents which match one branch or the other but not both

<li> AndNotPostList: returns documents which match the left branch, but not the
 right (the weights of documents from the right branch are ignored).

<li> AndMaybePostList: returns documents which match the left branch - weights from
 documents also in the right branch are added in for the probabilistic case
 ("X ANDMAYBE Y" can be expressed as "+X Y" in Altavista).

<li> FilterPostList: applies the right branch as a boolean filter to the left
 branch (which is typically a probabilistic query.  Note: same as
 AndPostList with the right branch weights ignored.
</ul>

<p>[Note: You can use AndNotPostList to apply an inverted boolean filter to a
probabilistic query]

<p>All the symmetric operators (i.e. OR, AND, XOR) are coding for maximum
efficiency when the right branch has fewer postings in than the left branch.

<p>There are 2 main optimisations which the best match performs: autoprune and
operator decay.

<h2>autoprune</h2>

<P>For example, if a branch in the match tree is "A OR B", when A runs out then
"A OR B" is replaced by "B".  Similar reductions occur for XOR, ANDNOT, and
ANDMAYBE (if the right branch runs out).  Other operators (AND, FILTER, and
ANDMAYBE (when the left branch runs out) simply return "at_end" and this is
dealt with somewhere further up the tree as appropriate.

<P>An autoprune is indicated by the next or skip_to method returning a pointer
to the PostList object to replace the postlist being read with.

<h2>operator decay</h2>

<P>The matcher tracks the minimum weight needed for a document to make it into
the m-set (this decreases monotonically as the m-set forms).  This can be
used to replace on boolean operator with a stricter one.  E.g. consider A OR
B - when maxweight(A) &lt; minweight and maxweight(B) &lt; minweight then only
documents matching both A and B can make it into the m-set so we can replace
the OR with an AND.  Operator decay is flagged using the same mechanism as
autoprune, by returning the replacement operator from next or skip_to.

<p>Possible decays:

<ul>
<li> OR -&gt; AND
<li> OR -&gt; ANDMAYBE
<li> ANDMAYBE -&gt; AND
<li> XOR -&gt; ANDNOT
</ul>

<p>A related optimisation is that the Match object may terminate early if
maxweight for the whole tree is less than the smallest weight in the mset.

<!-- FOOTER $Author: olly $ $Date: 2007-04-09 15:09:11 +0100 (Mon, 09 Apr 2007) $ $Id: matcherdesign.html 8160 2007-04-09 14:09:11Z olly $ -->
</BODY>
</HTML>