Sophie

Sophie

distrib > Fedora > 13 > i386 > by-pkgid > 2dc7ae7102ce788eb8a15dec0caf7708 > files > 336

xapian-core-devel-1.0.21-1.fc13.i686.rpm

<HTML>
<HEAD>
<TITLE>The Quartz Database Backend</TITLE>
</HEAD>
<BODY BGCOLOR="white" TEXT="black">

<h1><center>The Quartz database backend</center></h1>

<p>
Note: use of Quartz is now deprecated in favour of the newer Flint disk-based
format.  We will entirely remove the quartz backend in Xapian 1.1.0.
However a lot of this document is also relevant to Flint as well.
</p>

<h2>Why Quartz?</h2>

<em>
What is this thing called Quartz?  How does it fit in with the Xapian
library?</em>
<p>
Xapian can access information stored in various different formats.
Generally these are disk-based, but there's also the InMemory format which is
stored entirely in memory.

<p>
Each of these formats is comprised by a set of classes providing an interface
to a Database object and several other related objects (PostList, TermList,
etc...).

<p>
Quartz is simply the name of Xapian's first high-performance backend.
The design of Quartz draws on all our past experience to satisfy the following
criteria:
<ul>
<li>
  Fast and scalable searches.
</li>
<li>
  May be updated (ie, database doesn't have to be built from scratch in order
  to make a single change).
</li>
<li>
  May be modified whilst searches are in progress.
</li>
<li>
  Provides atomic updates in the face of interruption at any point.
</li>
<li>
  Provides a single-writer, multiple-reader environment.
</li>
</ul>

<p>
Different backends can be optionally compiled into the Xapian library
(by specifying appropriate options to the configure script).  Quartz
is compiled by default.

<p>
<em>
Why do we call it Quartz - where does the name come from?
</em>
<p>Well, we had to call it something, and Quartz was
simply the first name we came up with which we thought we could live with...

<h2>Tables</h2>

A Quartz database consists of several tables, each of which stores a
different type of information: for example, one table stores the user-defined
data associated with each document, and another table stores the posting
lists (the lists of documents which particular terms occur in).
<p>
These tables consist of a set of key-tag pairs, which I shall often
refer to these as <em>items</em> or <em>entries</em>.  Items may
be accessed randomly by specifying a key and reading the item
pointed to, or in sorted order by creating a cursor pointing to a
particular item.  The sort order is a lexicographical ordering based
on the contents of the keys.  Only one instance of a key may exist in
a single table - inserting a second item with the same key as an existing
item will overwrite the existing item.
<p>
Positioning of cursors may be performed even when a
full key isn't known, by attempting to access an item which doesn't
exist: the cursor will then be set to point to the first item with a
key before that requested.
<p>
The Btree class defines the standard interface to a table.  This has
a subclass for each table - QuartzRecordTable, QuartzValueTable,
QuartzPostListTable, QuartzPositionListTable, and QuartzTermListTable.
Apart from QuartzPostListTable, these are fairly thin wrappers.
QuartzPostListTable buffers up the inverted changes internally to
allow fast updatig.
<p>
Changes are made to the Btree by calling add() and del(), but they will not be
seen by readers until commit() is called.  Alternatively, calling cancel() will
abandon changes.  This allows atomic transactions to be implemented.
<p>
The Btree class is optimised to be fast when changes are applied in sorted
order.  For most tables, this means indexing documents in docid order.
QuartzPostListTable takes care of this as part of the inversion process.

<h2>The contents of the tables</h2>

We shall worry about the implementation of the tables later, but first
we shall look at what is stored within each table.
<p>
There are five tables comprising a quartz database.
<ul>
<li>
<B>Record</B>.
This stores the arbitrary chunk of data associated with each document.
<p>
Key: lsb ... msb of the docid, until all remaining bytes are zero
<p>
The record table also holds a couple of special values, stored under the key
consisting of a single zero byte (this isn't a valid encoded docid).  The first
value is the next
document ID to use when adding a document (document IDs are allocated in
increasing order, starting at 1, and are currently never reused).  The other
value is the total length of the documents in the database, which
is used to calculate the average document length, which we need to
calculate normalised document lengths.
<p>
</li><li>
<B>Value</B>.
This stores a set of values for each document
<p>
Key: lsb ... msb of the docid, until all remaining bytes are zero
<p>
Currently, there is one B-tree entry for each document in the database
that has one or more values associated with it.  This entry
consists of a list of value_no-s and values for that document.
<p>
An
alternative implementation would be to store an item for each value, whose
key is a combination of the document ID and the keyno, and whose tag is the
value.
Which implementation is better depends on the access pattern: if a document
is being passed across a network link, all the values for a document
are read - if a document is being dealt with locally, usually only some of
the values will be read.
<p>
Documents will usually have very few values, so the current
implementation may actually be the most suitable.
<p>
</li>
<li>
<b>TermList</b>.
This stores the list of terms which appear in a document.
<p>
Key: lsb ... msb of the docid, until all remaining bytes are zero
<p>
The list first stores the document length, and the number of entries in the
termlist (this latter value is stored for quick access - it could also be
determined by running through the termlist).  It then stores a set of
entries: each entry in the list consists of a term (as a string), and the
wdf (within document frequency - how many times the term appears in the
document) of that term.
<p>
In a non-modifiable database, the term frequency could be stored in the
termlist for each entry in each list.  This would enable query expansion
operations to occur significantly faster by avoiding the need for a large
number of extra lookups - however, this cannot be implemented in a
writable database without causing any modifications to modify a very large
proportion of the database.
<p>
</li><li>
<b>PositionList</b>.
For each (term,&nbsp;document) pair, this stores the list of positions in the
document at which the term occurs.
<p>
Key: pack_uint(did) + tname
<p>
</li><li>
<b>PostList</b>.
This stores the list of documents in which each term appears.  Each entry
in the table is a chunk of entries in the posting list for the term.
<p>
Key: pack_string_preserving_sort(tname) [first chunk]
<br>
Key: pack_string_preserving_sort(tname) + pack_uint_preserving_sort(first_did_in_chunk) [second and subsequent chunks]
<p>
</li>
</ul>

<h2>Representation of integers, strings, etc</h2>

It is well known that in modern computers there are many, many CPU cycles
for each disk read, or even memory read.  It is therefore important to
minimise disk reads, and can be advantageous to do so even at the expense of
a large amount of computation.  In other words, <em>Compression is
good</em>.
<p>
The current implementation uses simple compression - we're investigating
more effective schemes - these are (FIXME: this is slightly out of date
now):
<ul>
<li>
In posting lists, successive document IDs are stored as a difference
which is compressed using a byte-wise huffman encoding (so it's stored
in 7, 14, 21, 28, ... bits):
<ol>
<li>
First byte: if integer is &lt; 128, store integer, otherwise store
integer modulo 128, but with top bit set.
</li><li>
Shift integer right 7 places.
</li><li>
Second byte: if integer is &lt; 128, store integer, otherwise store
integer modulo 128, but with top bit set.
</li><li>
Shift integer right 7 places.
</li><li>
etc...
</li>
</ol>
<p>
</li><li>
In position lists, successive positions are encoded similarly.
<p>
</li><li>
In termlists, terms are stored as string values in sorted order.  The term
names are compressed by storing differences between consecutive terms.
<p>
</li>
</ul>

<h2>PostLists and chunks</h2>

Posting lists can grow to be very large - some terms occur in a very large
proportion of the documents, and their posting lists can represent a
significant fraction of the size of the whole database.  Therefore, we do
not wish to read an entire posting list into memory at once.  (Indeed, we'd
rather only read a small portion of it at all, but that's a different
story - see the documentation on <A HREF="matcherdesign.html">optimisations
performed by the matcher</A>).
<p>
To deal with this, we store posting lists in small chunks, each the right
size to be stored in a single B-tree block, and hence to be accessed with a
minimal amount of disk latency.
<p>
The key for the first chunk in a posting list is the term name of the term
whose posting list it is.  The key in subsequent chunks is the term name
followed by the document ID of the first document in the chunk.  This
allows the cursor methods to be used to scan through the chunks in order,
and also to jump to the chunk containing a particular document ID.

<p>
It is quite possible that the termlists and
position lists would benefit from being split into chunks in this way.

<h2>All document lists</h2>

It is possible to use the Xapian API to obtain a list of all documents in the
database.  This is done by creating a special postinglist.  This functionality
was added after the file structure in use by Quartz was frozen, and it is
unfortunately impossible to implement efficiently for Quartz.

The problem is that it is not possible to read the list of documents in sorted
order direct from disk - instead, the list is read into memory to be sorted.
For databases which do not have sparse document IDs, this should not use much
memory since the list is kept in memory in a range-compressed form (but does
require an iteration over the entirety of one of the tables of the Quartz
database - no skipping can be done in this case.)  This is unlikely to be
fixed, since we don't believe it can be without changing Quartz's structure.
In any case, it is not a priority since Quartz is no longer the default
backend.

<h2>Btree implementation</h2>

The tables are currently all implemented as B-trees (actually a form of
B-tree sometimes known as a B+ tree).
(For some tables, the use of a different structure could be more appropriate -
perhaps a hashing scheme might provide more space and time efficient access.
This is an area for future investigation).
<p>
A B-tree is a fairly standard structure for storing this kind of data, so I
will not describe it in detail - see a reference book on database design and
algorithms for that.  The essential points are that it is a block-based
multiply branching tree structure, storing keys in the internal blocks and
key-tag pairs in the leaf blocks.
<p>
Our implementation is fairly standard, except for its revision scheme,
which allows modifications to be applied atomically whilst other processes
are reading the database.  This scheme involves copying each block in the
tree which is involved in a modification, rather than modifying it in
place, so that a complete new tree structure is built up whilst the old
structure is unmodified (although this new structure will typically share a
large number of blocks with the old structure).  The modifications can then
be atomically applied by writing the new root block and making it active.
<p>
After a modification is applied successfully, the old version of the
table is still fully intact, and can be accessed.  The old version only
becomes invalid when a second modification is attempted (and it becomes
invalid whether or not that second modification succeeds).
<p>
There is no need for a process which is writing the database to know
whether any processes are reading previous versions of the database.  As long
as only one update is performed before the reader closes (or reopens) the
database, no problem will occur.  If more than one update occurs whilst
the table is still open, the reader will notice that the database has been
changed whilst it has been reading it by comparing a revision number stored
at the start of each block with the revision number it was expecting.  An
appropriate action can then be taken (for example, to reopen the database
and repeat the operation).
<p>
An alternative approach would be to obtain a read-lock on the revision
being accessed.  A write would then have to wait until no read-locks
existed on the old revision before modifying the database.

<h2>Applying changes to all the tables simultaneously</h2>

To recap, we have tables storing key/tag pairs, we can update these,
and we can then call a
method and have all the modifications applied to the table atomically.
Unfortunately, we need more than that - we need to be able to apply
modifications as a single atomic transaction across multiple tables, so
that the tables are always accessed in a mutually consistent state.
<p>
The revisioning scheme described earlier comes to the rescue!  By carefully
making sure that we open all the tables at the same revision, and by ensuring
that at least one such consistent revision always exists, we can extend the
scope of atomicity to cover all the tables.  In detail:

<ul>
<li>
When opening a database, we open each table in a specified order: lets call
the tables <em>A</em>, <em>B</em>, <em>C</em>, <em>D</em>, <em>E</em>,
and say we open them in alphabetical order.
</li><li>
When opening a database, after opening the first table, <em>A</em>, at the
newest available revision, we read its revision number.  We then open all
the other tables at the same revision number.
</li><li>
When we commit changes to a database, we commit the tables in reverse order
- <em>E</em> first, and <em>A</em> last.
</li>
</ul>

This scheme guarantees that modifications are atomic across all the tables
- essentially we have made the modification get committed only when the
final table is committed.

<h2>Items to be added to this document</h2>

<ul>
<li>
Describe that postlists must be stored in sorted order, for boolean queries,
so cannot store in reverse wdf order for efficiency.  A possible workaround is
to store the postlists in two or more chunks, ordered by wdf, and to access
them in this order.
</li>
<li>
An better explanation of why there will always be a consistent set of table
versions using the scheme described above.
</li>
<li>
Mention that future versions will allow the database creator to decide
whether to store certain levels of detail in the database - eg, whether to
store document lengths, term frequencies in the termlists, document lengths
in the posting lists, etc.
</li>
<li>
Add comment about the inversion process - we are essentially doing a text
based partitioning scheme.
</li>
<li>
Mention adding documents: usual to add with a new document ID which is
greater than any currently in the system.  This means that new postings
get added to the end of posting lists (so we make it easy to get to the
end of a posting list quickly).  Also, we key position lists by document
ID first and termname second, so that new positionlists get added to the
end of the positionlist database, meaning that hardly any blocks will need
to be altered in this database: data just gets added to the end.
</li>
</ul>

<h2>Endnote</h2>

The system as described could, no doubt, be improved in several ways.
If you can think of such ways then suggest it to us,
so we can have a discussion of the improvement to see whether it would
help: if it would we will add it to the design (and eventually the
code) - if not, we'll add a discussion about it to this document.

<h1><center>The Btree Implementation</center></h1>

I'm not sure about the name 'Btree' that runs through all this, since the fact
that it is all implemented as a B-tree is surely irrelevant. I have not been
able to think of a better name though ...

<P>
Some of the constants mentioned below depend upon a byte being 8 bits, but this
assumption is not built into the code.

<H2>Keys and tags</H2>

Thinking of 'byte' having type 'unsigned char', a key and a tag are both
sequences of bytes. The B-tree is a repository for key-tag pairs. A key can be
looked up to find its corresponding tag. If a key is deleted, the corresponding
tag is deleted. And in the B-tree keys are unique, so if a key-tag pair is
added in and the key is already in the Btree, the tag to be added in replaces
the tag in the B-tree.

<P>In the B-tree key-tag pairs are ordered, and the order is the ASCII collating
order of the keys. Very precisely, if key1 and key2 point to keys with lengths
key1_len, key2_len, key1 is before/equal/after key2 according as the following
procedure returns a value less than, equal to or greater than 0,

<pre>
static int compare_keys(const byte * key1, int key1_len,
			const byte * key2, int key2_len)
{
    int smaller = key1_len &lt; key2_len ? key1_len : key2_len;
    for (int i = 0; i &lt; smaller; i++) {
        int diff = (int) key1[i] - key2[i];
        if (diff != 0) return diff;
    }
    return key1_len - key2_len;
}
</pre>

<P>[This is okay, but none of the code fragments below have been checked.]

<P>
Any large-scale operation on the B-tree will run very much faster when the keys
have been sorted into ASCII collating order. This fact is critical to the
performance of the B-tree software.

<P>
A key-tag pair is called an 'item'. The B-tree consists therefore of a list of
items, ordered by their keys:

<pre>
    I<sub>1</sub>  I<sub>2</sub>  ...  I<sub>j-1</sub>  I<sub>j</sub>  I<sub>j+1</sub>  ...  I<sub>n-1</sub>  I<sub>n</sub>
</pre>

<P>
Item I<sub>j</sub> has a 'previous' item, I<sub>j-1</sub>, and a 'next' item, I<sub>j+1</sub>.

<P>
When the B-tree is created, a single item is added in with null key and null
tag. This is the 'null item'. The null item may be searched for, and it's
possible, although perhaps not useful, to replace the tag part of the null
item. But the null item cannot be deleted, and an attempt to do so is merely
ignored.

<P>
A key must not exceed 252 bytes in length.

<P>
A tag may have length zero. There is an upper limit on the length of a tag, but
it is quite high. Roughly, the tag is divided into items of size L - kl, where
L is a a few bytes less than a quarter of the block size, and kl is length of
its key. You can then have 64K such items. So even with a block size as low as
2K and key length as large as 100, you could have a tag of 2.5 megabytes. More
realistically, with a 16K block size, the upper limit on the tag size is about
256 megabytes.

<H2>Revision numbers</H2>

<P>The B-tree has a revision number, and each time it is updated, the revision
number increases. In a single transaction on the B-tree, it is first opened,
its revision number, R is found, updates are made, and then the B-tree is
closed with a supplied revision number. The supplied revision number will
typically be R+1, but any R+k is possible, where k &gt; 0.

<P>
If this sequence fails to complete for some reason, revision R+k of the B-tree
will not, of course, be brought into existence. But revision R will still
exist, and it is that version of the B-tree that will be the starting point for
later revisions.

<P>
If this sequence runs to a successful termination, the new revision, R+k,
supplants the old revision, R. But it is still possible to open the B-tree at
revision R. After a successful revision of the B-tree, in fact, it will have
two valid versions: the current one, revision R+k, and the old one, revision R.

<P>
You might want to go back to the old revision of a B-tree if it is being
updated in tandem with second B-tree, and the update on the second B-tree
fails. Suppose B1 and B2 are two such B-trees. B1 is opened and its latest
revision number is found to be R1. B2 is opened and its latest revision number
is found to be R2. If R1 &gt; R2, it must be the case that the previous
transaction on B1 succeeded and the previous transaction on B2 failed. Then B1
needs to opened at its previous revision number, which must be R1.

<P>
The calls using revision numbers described below are intended to handle this
type of contingency.

<H2>The files</H2>

The B-tree has three associated files. DB contains the data proper of the
B-tree. The revision numbers, other administrative information, and a bitmap
are held in two files, baseA and baseB.

<P>
When the B-tree is opened without any particular
revision number being specified, the later of baseA and baseB is chosen as the
opening base, and as soon as a write to the file DB occurs, the earlier of
baseA or baseB is deleted. On closure, the new revision number is written to
baseB if baseA was the opening base, and to baseA if baseB was the opening
base. If the B-tree update fails for some reason, only one base will usually
survive.

<P>The bitmap stored in each base file will have bit n set if block n is in use
in the corresponding revision of the B-tree.

<H2>The API</H2>

See the <a href="http://www.xapian.org/docs/sourcedoc/html/classBtree.html">doxygen generated documentation</a> for a description of the API of the Btree
class and related classes.

<h2>Checking the B-tree</h2>

The following static method is provided in btreecheck.h:

<pre>
void BtreeCheck::check(const string &amp; name, int opts);
</pre>
<!-- FIXME there's an optional ostream & argument (default output is to cout) -->

<blockquote>
    BtreeCheck::check(s, opts) is essentially equivalent to

<pre>
        Btree B(s, false);
        B.open();
        {
            // do a complete integrity check of the B-tree,
            // reporting according to the bitmask opts
        }
</pre>

    The option bitmask may consist of any of the following values |-ed together:
<ul>
  <li> OPT_SHORT_TREE - short summary of entire B-tree
  <li> OPT_FULL_TREE - full summary of entire B-tree
  <li> OPT_SHOW_BITMAP - print the bitmap
  <li> OPT_SHOW_STATS - print the basic information (revision number, blocksize etc.)
</ul>

    The options control what is reported - the entire B-tree is always checked
    as well as reporting the information.
    <!-- keep this for quartzcheck docs:
    if non-null, causes information to go to stdout. The
    following characters may appear in the option string:

<pre>
        t   - short summary of entire B-tree
        f   - full summary of entire B-tree
        b   - print the bitmap
        v   - print the basic information (revision number, blocksize etc.)
        +   - equivalent to tbv
        ?   - lists currently available options
</pre>

    The options cause a side-effect of printing, so Btree_check(s, "v") checks
    the entire B-tree and reports basic information, rather than merely
    reporting the basic information.
    -->
</blockquote>

<h2>Full compaction</h2>

As the B-tree grows, items are added into blocks. When a block is full, it
splits into two (amoeba-like) and one of the new blocks accommodates the new
entry. Blocks are therefore between 50% and 100% full during growth, or 75% full
on average.

<p>
Let us say an item is 'new' if it is presented for addition to the B-tree and
its key is not already in the B-tree. Then presenting a long run of new items
ordered by key causes the B-tree updating process to switch into a mode where
much higher compaction than 75% is achieved - about 90%. This is called
'sequential' mode. It is possible to force an even higher compaction rate with
the procedure


<pre>
void Btree::full_compaction(bool parity);
</pre>

So

<pre>
    B.full_compaction(true);
</pre>

switches full compaction on, and

<pre>
    B.full_compaction(false);
</pre>

switches it off. Full compaction may be switched on or off at any time, but
it only affects the compaction rate of sequential mode. In sequential mode, full
compaction gives around 98-99% block usage - it is not quite 100% because keys
are not split across blocks.

<p>
The downside of full compaction is that block splitting will be heavy on the
next update. However, if a B-tree is created with no intention of being updated,
full compaction is very desirable.

<h2>Full compaction with revision 1</h2>

Retrieval mode is faster when the B-tree has revision number 1 than for higher
revision numbers. This is because there are no unused blocks in the B-tree and
the blocks are in a special order, and this enables the Bcursor::prev and
Bcursor::next procedures, and the other procedures which use them implicitly,
to have more efficient forms.

<p>
To make a really fast structure for retrieval therefore, create a new B-tree,
open it for updating, set full compaction mode, and add all the items in a
single transaction, sorted on keys. After closing, do not update further.
<!--
Further updates can be prevented quite easily by deleting (or moving) the bitmap
files. These are required in update mode but ignored in retrieval mode.
-->

<p>
Xapian includes a utility which performs this process on all the Btrees
in a quartz database - it's call quartzcompact. You can
refer to the <a href="http://www.xapian.org/docs/sourcedoc/html/quartzcompact_8cc-source.html">source code of the quartzcompact utility</a> to see how this
is implemented.

<h2>quartzcompact</h2>

quartzcompact takes two arguments - the path of the database to compact, and
a path to write the compacted version to.

<p>
Only the Btree structure is changed - all the keys and tags are unaltered, so
the database is the same as far as an application using Xapian is concerned.
In particular, all the document ids are the same.

<h2>Notes on space requirements</h2>

The level of the B-tree is the distance of the root block from a leaf block. At
minimum this is zero. If a B-tree has level L and block size B, then update
mode requires space for 2(LB + b<sub>1</sub> + b<sub>2</sub>) bytes,
where b<sub>1</sub> and b<sub>2</sub> are the size of
the two bitmap files. Of course, L, b<sub>1</sub> and b<sub>2</sub>
may grow during an update on the
B-tree. If the revision number is greater than one, then retrieval mode
requires (L - 2 + 2c)B bytes, where c is the number of active cursors. If
however the revision number is one, it only requires (L - 2 + c)B bytes.

<p>
This may change in the future with code redesign, but meanwhile note that a K
term query that needs k &lt;= K cursors open at once to process, will demand
2*K*B bytes of memory in the B-tree manager.

<h2>Updating during retrieval</h2>

The B-tree cannot be updated by two separate processes at the same time. The
user of the B-tree software should establish a locking mechanism to ensure that
this never happens.

<p>
It is possible to do retrieval while the B-tree is being updated. If the
updating process overwrites a part of the B-tree required by the retrieval
process, then a Xapian::DatabaseModifiedError exception is thrown.

<p>
This should be handled, and suitable action taken - either the operation
aborted, or the Btree reopened at the latest revision and the operation
retried.  Here is a model scheme:

<pre>
static Btree * reopen(Btree * B)
{
    // Get the revision number. This will return the correct value, even when
    // B-&gt;overwritten is detected during opening.
    uint4 revision = B-&gt;get_open_revision_number();

    while (true) {
	try {
	    delete B;  /* close the B-tree */
	    B = new Btree(s, true);
	    B-&gt;open(s); /* and reopen */
	    break;
	} catch (const Xapian::DatabaseModifiedError &amp;) {
	}
    }

    if (revision == B-&gt;get_open_revision_number()) {
        // The revision number ought to have gone up from last time,
        // so if we arrive here, something has gone badly wrong ...
        printf("Possible database corruption!\n");
        exit(1);
    }
    return B;
}


    ....

    char * s = "database/";
    Btree * B = 0;
    uint4 revision = 0;

    /* open the B-tree */
    while (true) {
	try {
	    delete B;  /* close the B-tree */
	    B = new Btree(s, true);
	    B-&gt;open(); /* and reopen */
	    break;
	} catch (const Xapian::DatabaseModifiedError &amp;) {
	}
    }

    string t;
    while (true) {
	try {
	    B-&gt;find_tag("brunel", &amp;t); /* look up some keyword */
	    break;
	} catch (const Xapian::DatabaseModifiedError &amp;) {
	    B = reopen(s);
	}
    }

    ...
</pre>

If the overwritten condition were detected in updating mode, this would mean
that there were two updating processes at work, or the database has become
corrupted somehow.  If this is detected, a Xapian::DatabaseCorruptError is
thrown.  There's not much which can usefully be done to automatically handle
this condition.

<p>
In retrieval mode, the following can cause Xapian::DatabaseModifiedError to
be thrown:

<pre>
    Btree::open_to_read(name);
    Btree::open_to_read(name, revision);
    Bcursor::next();
    Bcursor::prev();
    Bcursor::find_key(const string &amp;key);
    Bcursor::get_tag(string * tag);
</pre>

The following can not:

<pre>
   Bcursor::Bcursor(Btree * B);
   Bcursor::get_key(string * key);
</pre>

Note particularly that opening the B-tree can cause it, but
Bcursor::get_key(..) can't.

<!--
<h2>Error conditions</h2>

Note: this section is out of date - many methods now return errors in different
ways to those this section indicates.

<P>
The procedures described above report errors in two ways. (A) A non-zero
result. Btree::close() returns an int result which is 0 if successful,
otherwise an error number. (B) The error is placed in B-&gt;error, where B is the
Btree structure used in the call, or the Btree structure from which the Bcursor
structure used in the call derives. Then B-&gt;error == 0 means no error,
otherwise it is a positive number (greater than 2) giving the error number.

<p>
Some procedures cannot give an error.  Here is a summary:

<pre>
    Error method  procedure
    (A)(B)          error condition given by:
    ===================================================
        *  n = Btree::find_key(key)
        *  n = Btree::find_tag(key, kt)
        *  n = Btree::add(key, tag)
        *  n = Btree::del(key)
        *  B = Btree::open_to_write(s)
        *  B = Btree::open_to_write(s, rev)
     *     n = Btree::close(B, rev)
     *     n = Btree::create(s, block_size)   [throws exception]
        *  B = Btree::open_to_read(s)
        *  B = Btree::open_to_read(s, rev)
               Bcursor::Bcursor()
        *  n = Bcursor::find_key(key)
        *  n = Bcursor::next()
        *  n = Bcursor::prev()
        *  n = Bcursor::get_key(kt)
        *  n = Bcursor::get_tag(kt)
               Btree::full_compaction(parity)

    (A) non-zero result (B) B.error == true
</pre>
-->
<!-- FOOTER $Author: olly $ $Date: 2009-03-18 06:36:15 +0000 (Wed, 18 Mar 2009) $ $Id: quartzdesign.html 12217 2009-03-18 06:36:15Z olly $ -->
</BODY>
</HTML>