Sophie

Sophie

distrib > CentOS > 5 > i386 > by-pkgid > 90dba77ca23efa667b541b5c0dd77497 > files > 80

python-lxml-2.0.11-2.el5.i386.rpm

=====================================
lxml FAQ - Frequently Asked Questions
=====================================

.. meta::
  :description: Frequently Asked Questions about lxml (FAQ)
  :keywords: lxml, lxml.etree, FAQ, frequently asked questions

Frequently asked questions on lxml.  See also the notes on compatibility_ to
ElementTree_.

.. _compatibility: compatibility.html
.. _ElementTree:   http://effbot.org/zone/element-index.htm

.. contents::
..
   1  General Questions
     1.1  Is there a tutorial?
     1.2  Where can I find more documentation about lxml?
     1.3  What standards does lxml implement?
     1.4  Who uses lxml?
     1.5  What is the difference between lxml.etree and lxml.objectify?
     1.6  How can I make my application run faster?
     1.7  What about that trailing text on serialised Elements?
   2  Installation
     2.1  Which version of libxml2 and libxslt should I use or require?
     2.2  Where are the Windows binaries?
     2.3  Why do I get errors about missing UCS4 symbols when installing lxml?
   3  Contributing
     3.1  Why is lxml not written in Python?
     3.2  How can I contribute?
   4  Bugs
     4.1  My application crashes!
     4.2  I think I have found a bug in lxml. What should I do?
   5  Threading
     5.1  Can I use threads to concurrently access the lxml API?
     5.2  Does my program run faster if I use threads?
     5.3  Would my single-threaded program run faster if I turned off threading?
     5.4  Why can't I reuse XSLT stylesheets in other threads?
     5.5  My program crashes when run with mod_python/Pyro/Zope/Plone/...
   6  Parsing and Serialisation
     6.1  Why doesn't the ``pretty_print`` option reformat my XML output?
     6.2  Why can't lxml parse my XML from unicode strings?
     6.3  What is the difference between str(xslt(doc)) and xslt(doc).write() ?
     6.4  Why can't I just delete parents or clear the root node in iterparse()?
     6.5  How do I output null characters in XML text?
   7  XPath and Document Traversal
     7.1  What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
     7.2  Why doesn't ``findall()`` support full XPath expressions?
     7.3  How can I find out which namespace prefixes are used in a document?
     7.4  How can I specify a default namespace for XPath expressions?


General Questions
=================

Is there a tutorial?
--------------------

Read the `lxml.etree Tutorial`_.  While this is still work in progress
(just as any good documentation), it provides an overview of the most
important concepts in ``lxml.etree``.  If you want to help out,
improving the tutorial is a very good place to start.

There is also a `tutorial for ElementTree`_ which works for ``lxml.etree``.
The `API documentation`_ also contains many examples for ``lxml.etree``.  To
learn using ``lxml.objectify``, read the `objectify documentation`_.

John Shipman has written another tutorial called `Python XML
processing with lxml`_ that contains lots of examples.

.. _`lxml.etree Tutorial`:      tutorial.html
.. _`tutorial for ElementTree`: http://effbot.org/zone/element.htm
.. _`API documentation`:        api.html
.. _`objectify documentation`:  objectify.html
.. _`Python XML processing with lxml`: http://www.nmt.edu/tcc/help/pubs/pylxml/


Where can I find more documentation about lxml?
-----------------------------------------------

There is a lot of documentation as lxml implements the well-known `ElementTree
API`_ and tries to follow its documentation as closely as possible.  There are
a couple of issues where lxml cannot keep up compatibility.  They are
described in the compatibility_ documentation.  The lxml specific extensions
to the API are described by individual files in the ``doc`` directory of the
distribution and on `the web page`_.

.. _`ElementTree API`: http://effbot.org/zone/element-index.htm
.. _`the web page`:    http://codespeak.net/lxml/#documentation


What standards does lxml implement?
-----------------------------------

The compliance to XML Standards depends on the support in libxml2 and libxslt.
Here is a quote from `http://xmlsoft.org/`:

  In most cases libxml2 tries to implement the specifications in a relatively
  strictly compliant way. As of release 2.4.16, libxml2 passed all 1800+ tests
  from the OASIS XML Tests Suite.

lxml currently supports libxml2 2.6.20 or later, which has even better support
for various XML standards.  Some of the more important ones are: HTML, XML
namespaces, XPath, XInclude, XSLT, XML catalogs, canonical XML, RelaxNG,
XML:ID.  Support for XML Schema and especially Schematron is currently
incomplete in libxml2, but is definitely usable and actively being worked on.
libxml2 also supports loading documents through HTTP and FTP.


Who uses lxml?
--------------

As an XML library, lxml is often used under the hood of in-house
server applications, such as web servers or applications that
facilitate some kind of document management.  Therefore, it is hard to
get an idea of who uses it, and the following list of 'users and
projects we know of' is definitely not a complete list of lxml's
users.

Also note that the compatibility to the ElementTree library does not
require projects to set a hard dependency on lxml - as long as they do
not take advantage of lxml's enhanced feature set.

* cssutils_, a CSS parser and toolkit, can be used with ``lxml.cssselect``
* Deliverance_, a content theming tool
* `Enfold Proxy 4`_, a web server accelerator with on-the-fly XSLT processing
* Inteproxy_, a secure HTTP proxy
* lwebstring_, an XML template engine
* OpenXMLlib_, a library for handling OpenXML document meta data
* Pycoon_, a WSGI web development framework based on XML pipelines
* rfadict_, an RDFa parser with a simple dictionary-like interface.

Zope3 and some of its extensions have good support for lxml:

* gocept.lxml_, Zope3 interface bindings for lxml
* z3c.rml_, an implementation of ReportLab's RML format

And don't miss the quotes by our generally happy_ users_, and other
`sites that link to lxml`_.

.. _cssutils: http://code.google.com/p/cssutils/source/browse/trunk/examples/style.py?r=917
.. _Deliverance: http://www.openplans.org/projects/deliverance/project-home
.. _`Enfold Proxy 4`: http://www.enfoldsystems.com/Products/Proxy/4
.. _gocept.lxml: http://pypi.python.org/pypi/gocept.lxml
.. _Inteproxy: http://lists.wald.intevation.org/pipermail/inteproxy-devel/2007-February/000000.html
.. _lwebstring: http://pypi.python.org/pypi/lwebstring
.. _OpenXMLlib: http://permalink.gmane.org/gmane.comp.python.lxml.devel/3250
.. _Pycoon: http://pypi.python.org/pypi/pycoon
.. _rfadict: http://pypi.python.org/pypi/rdfadict
.. _z3c.rml: http://pypi.python.org/pypi/z3c.rml

.. _happy: http://thread.gmane.org/gmane.comp.python.lxml.devel/3244/focus=3244
.. _users: http://article.gmane.org/gmane.comp.python.lxml.devel/3246
.. _`sites that link to lxml`: http://www.google.com/search?as_lq=http%3A%2F%2Fcodespeak.net%2Flxml


What is the difference between lxml.etree and lxml.objectify?
-------------------------------------------------------------

The two modules provide different ways of handling XML.  However, objectify
builds on top of lxml.etree and therefore inherits most of its capabilities
and a large portion of its API.

* lxml.etree is a generic API for XML and HTML handling.  It aims for
  ElementTree compatibility_ and supports the entire XML infoset.  It is well
  suited for both mixed content and data centric XML.  Its generality makes it
  the best choice for most applications.

* lxml.objectify is a specialized API for XML data handling in a Python object
  syntax.  It provides a very natural way to deal with data fields stored in a
  structurally well defined XML format.  Data is automatically converted to
  Python data types and can be manipulated with normal Python operators.  Look
  at the examples in the `objectify documentation`_ to see what it feels like
  to use it.

  Objectify is not well suited for mixed contents or HTML documents.  As it is
  built on top of lxml.etree, however, it inherits the normal support for
  XPath, XSLT or validation.


How can I make my application run faster?
-----------------------------------------

lxml.etree is a very fast library for processing XML.  There are, however, `a
few caveats`_ involved in the mapping of the powerful libxml2 library to the
simple and convenient ElementTree API.  Not all operations are as fast as the
simplicity of the API might suggest, while some use cases can heavily benefit
from finding the right way of doing them.  The `benchmark page`_ has a
comparison to other ElementTree implementations and a number of tips for
performance tweaking.  As with any Python application, the rule of thumb is:
the more of your processing runs in C, the faster your application gets.  See
also the section on threading_.

.. _`a few caveats`:  performance.html#the-elementtree-api
.. _`benchmark page`: performance.html
.. _threading:        #threading


What about that trailing text on serialised Elements?
-----------------------------------------------------

The ElementTree tree model defines an Element as a container with a tag name,
contained text, child Elements and a tail text.  This means that whenever you
serialise an Element, you will get all parts of that Element::

    >>> from lxml import etree
    >>> root = etree.XML("<root><tag>text<child/></tag>tail</root>")
    >>> print etree.tostring(root[0])
    <tag>text<child/></tag>tail

Here is an example that shows why not serialising the tail would be
even more surprising from an object point of view::

    >>> root = etree.Element("test")

    >>> root.text = "TEXT"
    >>> etree.tostring(root)
    <test>TEXT</test>

    >>> etree.tail = "TAIL"
    >>> etree.tostring(root)
    <test>TEXT</test>TAIL

    >>> etree.tail = None
    >>> etree.tostring(root)
    <test>TEXT</test>

Just imagine a Python list where you append an item and it doesn't
show up when you look at the list.

The ``.tail`` property is a huge simplification for the tree model as
it avoids text nodes to appear in the list of children and makes
access to them quick and simple.  So this is a benefit in most
applications and simplifies many, many XML tree algorithms.

However, in document-like XML (and especially HTML), the above result can be
unexpected to new users and can sometimes require a bit more overhead.  A good
way to deal with this is to use helper functions that copy the Element without
its tail.  The ``lxml.html`` package also deals with this in a couple of
places, as most HTML algorithms benefit from a tail-free behaviour.


Installation
============

Which version of libxml2 and libxslt should I use or require?
-------------------------------------------------------------

It really depends on your application, but the rule of thumb is: more recent
versions contain less bugs and provide more features.

* Do not use libxml2 2.6.27 if you want to use XPath (including XSLT).  You
  will get crashes when XPath errors occur during the evaluation (e.g. for
  unknown functions).  This happens inside the evaluation call to libxml2, so
  there is nothing that lxml can do about it.

* Try to use versions of both libraries that were released together.  At least
  the libxml2 version should not be older than the libxslt version.

* If you use XML Schema or Schematron which are still under development, the
  most recent version of libxml2 is usually a good bet.

* The same applies to XPath, where a substantial number of bugs and memory
  leaks were fixed over time.  If you encounter crashes or memory leaks in
  XPath applications, try a more recent version of libxml2.

* For parsing and fixing broken HTML, lxml requires at least libxml2 2.6.21.

* For the normal tree handling, however, any libxml2 version starting with
  2.6.20 should do.

Read the `release notes of libxml2`_ and the `release notes of libxslt`_ to
see when (or if) a specific bug has been fixed.

.. _`release notes of libxml2`: http://xmlsoft.org/news.html
.. _`release notes of libxslt`: http://xmlsoft.org/XSLT/news.html


Where are the Windows binaries?
-------------------------------

Short answer: If you want to contribute a binary build, we are happy to put it
up on the Cheeseshop.

Long answer: Two of the bigger problems with the Windows system are the lack
of a pre-installed standard compiler and the missing package management.  Both
make it non-trivial to build lxml on this platform.  We are trying hard to
make lxml as platform-independent as possible and it is regularly tested on
Windows systems.  However, we currently cannot provide Windows binary
distributions ourselves.

From time to time, users of different environments kindly contribute binary
builds of lxml, most frequently for Windows or Mac-OS X.  We put these on the
Cheeseshop to make it as easy as possible for others to use lxml on their
platform.

If there is not currently a binary distribution of the most recent lxml
release for your platform available from the Cheeseshop, please look through
the older versions to see if they provide a binary build.  This is done by
appending the version number to the cheeseshop URL, e.g.:

	  http://cheeseshop.python.org/pypi/lxml/1.1.2


Why do I get errors about missing UCS4 symbols when installing lxml?
--------------------------------------------------------------------

Most likely, you use a Python installation that was configured for internal
use of UCS2 unicode, meaning 16-bit unicode.  The lxml egg distributions are
generally compiled on platforms that use UCS4, a 32-bit unicode encoding, as
this is used on the majority of platforms.  Sadly, both are not compatible, so
the eggs can only support the one they were compiled with.

This means that you have to compile lxml from sources for your system.  Note
that you do not need Cython for this, the lxml source distribution is directly
compilable on both platform types.  See the `build instructions`_ on how to do
this.

.. _`build instructions`: build.html


Contributing
============

Why is lxml not written in Python?
----------------------------------

It *almost* is.

lxml is not written in plain Python, because it interfaces with two C
libraries: libxml2 and libxslt.  Accessing them at the C-level is
required for performance reasons.

However, to avoid writing plain C-code and caring too much about the
details of built-in types and reference counting, lxml is written in
Cython_, a Python-like language that is translated into C-code.
Chances are that if you know Python, you can write `code that Cython
accepts`_.  Again, the C-ish style used in the lxml code is just for
performance optimisations.  If you want to contribute, don't bother
with the details, a Python implementation of your contribution is
better than none.  And keep in mind that lxml's flexible API often
favours an implementation of features in pure Python, without
bothering with C-code at all.  For example, the ``lxml.html`` package
is entirely written in Python.

Please contact the `mailing list`_ if you need any help.

.. _Cython: http://www.cython.org/
.. _`code that Cython accepts`: http://www.cosc.canterbury.ac.nz/greg.ewing/python/Pyrex/version/Doc/overview.html


How can I contribute?
---------------------

Besides enhancing the code, there are a lot of places where you can help the
project and its user base.  You can

* spread the word and write about lxml.  Many users (especially new Python
  users) have not yet heared about lxml, although our user base is constantly
  growing.  If you write your own blog and feel like saying something about
  lxml, go ahead and do so.  If we think your contribution or criticism is
  valuable to other users, we may even put a link or a quote on the project
  page.

* provide code examples for the general usage of lxml or specific problems
  solved with lxml.  Readable code is a very good way of showing how a library
  can be used and what great things you can do with it.  Again, if we hear
  about it, we can set a link on the project page.

* work on the documentation.  The web page is generated from a set of ReST_
  `text files`_.  It is meant both as a representative project page for lxml
  and as a site for documenting lxml's API and usage.  If you have questions
  or an idea how to make it more readable and accessible while you are reading
  it, please send a comment to the `mailing list`_.

.. _ReST: http://docutils.sourceforge.net/rst.html
.. _`text files`: http://codespeak.net/svn/lxml/trunk/doc/

* help with the tutorial.  A tutorial is the most important stating point for
  new users, so it is important for us to provide an easy to understand guide
  into lxml.  As allo documentation, the tutorial is work in progress, so we
  appreciate every helping hand.

* improve the docstrings.  lxml uses docstrings to support Python's integrated
  online ``help()`` function.  However, sometimes these are not sufficient to
  grasp the details of the function in question.  If you find such a place,
  you can try to write up a better description and send it to the `mailing
  list`_.


Bugs
====

My application crashes!
-----------------------

One of the goals of lxml is "no segfaults", so if there is no clear warning in
the documentation that you were doing something potentially harmful, you have
found a bug and we would like to hear about it.  Please report this bug to the
`mailing list`_.  See the next section on how to do that.

However, there are a few things to try first, to make sure the problem is
really within lxml (or libxml2 or libxslt):

a) If your application (or e.g. your web container) uses threads, please see
   the FAQ section on threading_ to check if you touch on one of the
   potential pitfalls.

b) If you are on Mac-OS X, make sure lxml uses the correct libraries.  If you
   have updated the old system libraries (e.g. through fink), this is best
   achieved by building lxml statically to prevent the different library
   versions from interfering.  If you choose to use a dynamically linked
   version, make sure the ``DYLD_LIBRARY_PATH`` environment variable contains
   the directory where you installed the libraries.  To make sure the correct
   libraries are used, print the module level version numbers that
   ``lxml.etree`` provides from *within* your application rather than relying
   on what your operating system tells you.

In any case, try to reproduce the problem with the latest versions of
libxml2 and libxslt.  From time to time, bugs and race conditions are found
in these libraries, so a more recent version might already contain a fix for
your problem.


I think I have found a bug in lxml. What should I do?
-----------------------------------------------------

First, you should look at the `current developer changelog`_ to see if this
is a known problem that has already been fixed in the SVN trunk since the
release you are using.

.. _`current developer changelog`: http://codespeak.net/svn/lxml/trunk/CHANGES.txt

Also, the 'crash' section above has a few good advices what to try to see if
the problem is really in lxml - and not in your setup.  Believe it or not,
that happens more often than you might think, especially when old libraries
or even multiple library versions are installed.

You should always try to reproduce the problem with the latest versions of
libxml2 and libxslt - and make sure they are used (``lxml.etree`` can tell
you what it runs with, see below).

Otherwise, we would really like to hear about it.  Please report it to the
`mailing list`_ so that we can fix it.  It is very helpful in this case if
you can come up with a short code snippet that demonstrates your problem.
If others can reproduce and see the problem, it is much easier for them to
fix it - and maybe even easier for you to describe it and get people
convinced that it really is a problem to fix.  Please also report the
version of lxml, libxml2 and libxslt that you are using by calling this::

   from lxml import etree
   print "lxml.etree:       ", etree.LXML_VERSION
   print "libxml used:      ", etree.LIBXML_VERSION
   print "libxml compiled:  ", etree.LIBXML_COMPILED_VERSION
   print "libxslt used:     ", etree.LIBXSLT_VERSION
   print "libxslt compiled: ", etree.LIBXSLT_COMPILED_VERSION

.. _`mailing list`: http://codespeak.net/mailman/listinfo/lxml-dev

Since as a user of lxml you are likely a programmer, you might find
`this article on bug reports`_ an interesting read.

.. _`this article on bug reports`: http://www.chiark.greenend.org.uk/~sgtatham/bugs.html


Threading
=========

Can I use threads to concurrently access the lxml API?
------------------------------------------------------

Yes, although not carelessly.

lxml frees the GIL (Python's global interpreter lock) internally when parsing
from disk and memory, as long as you use either the default parser (which is
replicated for each thread) or create a parser for each thread yourself.  lxml
also allows concurrency during validation (RelaxNG and XMLSchema) and XSL
transformation.  You can share RelaxNG, XMLSchema and (with restrictions) XSLT
objects between threads.  While you can also share parsers between threads,
this will serialize the access to each of them, so it is better to ``copy()``
parsers or to just use the default parser (which is automatically copied for
each thread).

Due to the way libxslt handles threading, concurrent access to stylesheets is
currently only possible if it was parsed in the main thread.  Parsing and
applying a stylesheet inside one thread also works.

Warning: You should generally avoid modifying trees in other threads than the
one it was generated in.  Although this should work in many cases, there are
certain scenarios where the termination of a thread that parsed a tree can
crash the application if subtrees of this tree were moved to other documents.
You should be on the safe side when passing trees between threads if you
either

a) do not modify these trees and do not move their elements to other trees, or
b) do not terminate threads while the trees they parsed are still in use
   (e.g. by using a fixed size thread-pool or long-running threads in
   processing chains)


Does my program run faster if I use threads?
--------------------------------------------

Depends.  The best way to answer this is timing and profiling.

The global interpreter lock (GIL) in Python serializes access to the
interpreter, so if the majority of your processing is done in Python code
(walking trees, modifying elements, etc.), your gain will be close to 0.  The
more of your XML processing moves into lxml, however, the higher your gain.
If your application is bound by XML parsing and serialisation, or by complex
XSLTs, your speedup on multi-processor machines can be substantial.

See the question above to learn which operations free the GIL to support
multi-threading.


Would my single-threaded program run faster if I turned off threading?
----------------------------------------------------------------------

Can be.  You can see for yourself by compiling lxml entirely without threading
support.  Pass the ``--without-threading`` option to setup.py when building
lxml from source.


Why can't I reuse XSLT stylesheets in other threads?
----------------------------------------------------

lxml currently has the restriction that an XSLT object can only be
used in a thread if it was created either in the thread itself or in
the main thread.  This is due to some interfering optimisations in
libxslt and lxml.etree.  To work around this, you can do a couple of
things:

* create all XSLT objects in the main program and reuse them wherever
  you want.

* create them in the thread where you use them and maybe cache them in
  thread local storage (see the threading module).

If your stylesheets are diverse and status specific, you can still
prepare them in advance if you:

* use XSLT parameters that you pass at call time to configure the
  stylesheets

* create the stylesheets (partially) programmatically in the main
  program, e.g. by adding ``xsl:output`` tags, ``xsl:include`` tags or
  templates (be careful with the order here) to the XSL tree, and then
  create the ``XSLT`` objects and store them in a read-only
  dictionary.  That way, you can access and use them in any thread.

Note that passing the same XSL tree into multiple ``XSLT()`` instances
will create independent stylesheets, so later modifications of the
tree will not be reflected in the already created stylesheets.  Also,
since lxml 2.0, you can deep copy XSLT objects using the ``copy``
module from the standard library.


My program crashes when run with mod_python/Pyro/Zope/Plone/...
---------------------------------------------------------------

These environments can use threads in a way that may not make it obvious when
threads are created and what happens in which thread.  This makes it hard to
ensure lxml's threading support is used in a reliable way.  Sadly, if problems
arise, they are as diverse as the applications, so it is difficult to provide
any generally applicable solution.  Also, these environments are so complex
that problems become hard to debug and even harder to reproduce in a
predictable way.  If you encounter crashes in one of these systems, but your
code runs perfectly when started by hand, the following gives you a few hints
for possible approaches to solve your specific problem:

* make sure you use recent versions of libxml2, libxslt and lxml.  The libxml2
  developers keep fixing bugs in each release, and lxml also tries to become
  more robust against possible pitfalls.  So newer versions might already fix
  your problem in a reliable way.

* make sure the library versions you installed are really used.  Do not rely
  on what your operating system tells you!  Print the version constants in
  ``lxml.etree`` from within your runtime environment to make sure it is the
  case.  This is especially a problem under MacOS-X when newer library
  versions were installed in addition to the outdated system libraries.

* if you use ``mod_python``, try setting this option:

      PythonInterpreter main_interpreter

  There was a discussion on the mailing list about this problem:

      http://comments.gmane.org/gmane.comp.python.lxml.devel/2942

* compile lxml without threading support by running ``setup.py`` with the
  ``--without-threading`` option.  While this might be slower in certain
  scenarios on multi-processor systems, it *might* also keep your application
  from crashing, which should be worth more to you than peek performance.
  Remember that lxml is fast anyway, so concurrency may not even be worth it.

* avoid doing fancy XSLT stuff like foreign document access or passing in
  subtrees trough XSLT variables.  This might or might not work, depending on
  your specific usage.

* try copying trees at suspicious places and working with those instead of a
  tree shared between threads.  A good candidate might be the result of an
  XSLT or the stylesheet itself.

* try keeping thread-local copies of XSLT stylesheets, i.e. one per thread,
  instead of sharing one.  Also see the question above.

* you can try to serialise suspicious parts of your code with explicit thread
  locks, thus disabling the concurrency of the runtime system.

* report back on the mailing list to see if there are other ways to work
  around your specific problems.  Do not forget to report the version numbers
  of lxml, libxml2 and libxslt you are using (see the question on reporting
  a bug).


Parsing and Serialisation
=========================

Why doesn't the ``pretty_print`` option reformat my XML output?
---------------------------------------------------------------

Pretty printing (or formatting) an XML document means adding white space to
the content.  These modifications are harmless if they only impact elements in
the document that do not carry (text) data.  They corrupt your data if they
impact elements that contain data.  If lxml cannot distinguish between
whitespace and data, it will not alter your data.  Whitespace is therefore
only added between nodes that do not contain data.  This is always the case
for trees constructed element-by-element, so no problems should be expected
here.  For parsed trees, a good way to assure that no conflicting whitespace
is left in the tree is the ``remove_blank_text`` option::

   >>> parser = etree.XMLParser(remove_blank_text=True)
   >>> tree = etree.parse(file, parser)

This will allow the parser to drop blank text nodes when constructing the
tree.  If you now call a serialization function to pretty print this tree,
lxml can add fresh whitespace to the XML tree to indent it.


Why can't lxml parse my XML from unicode strings?
-------------------------------------------------

lxml can read Python unicode strings and even tries to support them if libxml2
does not.  However, if the unicode string declares an XML encoding internally
(``<?xml encoding="..."?>``), parsing is bound to fail, as this encoding is
most likely not the real encoding used in Python unicode.  The same is true
for HTML unicode strings that contain charset meta tags, although the problems
may be more subtle here.  The libxml2 HTML parser may not be able to parse the
meta tags in broken HTML and may end up ignoring them, so even if parsing
succeeds, later handling may still fail with character encoding errors.

Note that Python uses different encodings for unicode on different platforms,
so even specifying the real internal unicode encoding is not portable between
Python interpreters.  Don't do it.

Python unicode strings with XML data or HTML data that carry encoding
information are broken.  lxml will not parse them.  You must provide parsable
data in a valid encoding.


What is the difference between str(xslt(doc)) and xslt(doc).write() ?
---------------------------------------------------------------------

The str() implementation of the XSLTResultTree class (a subclass of the
ElementTree class) knows about the output method chosen in the stylesheet
(xsl:output), write() doesn't.  If you call write(), the result will be a
normal XML tree serialization in the requested encoding.  Calling this method
may also fail for XSLT results that are not XML trees (e.g. string results).

If you call str(), it will return the serialized result as specified by the
XSL transform.  This correctly serializes string results to encoded Python
strings and honours ``xsl:output`` options like ``indent``.  This almost
certainly does what you want, so you should only use ``write()`` if you are
sure that the XSLT result is an XML tree and you want to override the encoding
and indentation options requested by the stylesheet.


Why can't I just delete parents or clear the root node in iterparse()?
----------------------------------------------------------------------

The ``iterparse()`` implementation is based on the libxml2 parser.  It
requires the tree to be intact to finish parsing.  If you delete or modify
parents of the current node, chances are you modify the structure in a way
that breaks the parser.  Normally, this will result in a segfault.  Please
refer to the `iterparse section`_ of the lxml API documentation to find out
what you can do and what you can't do.

.. _`iterparse section`: api.html#iterparse-and-iterwalk


How do I output null characters in XML text?
--------------------------------------------

Don't.  What you would produce is not well-formed XML.  XML parsers
will refuse to parse a document that contains null characters.  The
right way to embed binary data in XML is using a text encoding such as
uuencode or base64.


XPath and Document Traversal
============================

What are the ``findall()`` and ``xpath()`` methods on Element(Tree)?
--------------------------------------------------------------------

``findall()`` is part of the original `ElementTree API`_.  It supports a
`simple subset of the XPath language`_, without predicates, conditions and
other advanced features.  It is very handy for finding specific tags in a
tree.  Another important difference is namespace handling, which uses the
``{namespace}tagname`` notation.  This is not supported by XPath.  The
findall, find and findtext methods are compatible with other ElementTree
implementations and allow writing portable code that runs on ElementTree,
cElementTree and lxml.etree.

``xpath()``, on the other hand, supports the complete power of the XPath
language, including predicates, XPath functions and Python extension
functions.  The syntax is defined by the `XPath specification`_.  If you need
the expressiveness and selectivity of XPath, the ``xpath()`` method, the
``XPath`` class and the ``XPathEvaluator`` are the best choice_.

.. _`simple subset of the XPath language`: http://effbot.org/zone/element-xpath.htm
.. _`XPath specification`:                 http://www.w3.org/TR/xpath
.. _choice:                                performance.html#xpath


Why doesn't ``findall()`` support full XPath expressions?
---------------------------------------------------------

It was decided that it is more important to keep compatibility with
ElementTree_ to simplify code migration between the libraries.  The main
difference compared to XPath is the ``{namespace}tagname`` notation used in
``findall()``, which is not valid XPath.

ElementTree and lxml.etree use the same implementation, which assures 100%
compatibility.  Note that ``findall()`` is `so fast`_ in lxml that a native
implementation would not bring any performance benefits.

.. _`so fast`: performance.html#tree-traversal


How can I find out which namespace prefixes are used in a document?
-------------------------------------------------------------------

You can traverse the document (``getiterator()``) and collect the prefix
attributes from all Elements into a set.  However, it is unlikely that you
really want to do that.  You do not need these prefixes, honestly.  You only
need the namespace URIs.  All namespace comparisons use these, so feel free to
make up your own prefixes when you use XPath expressions or extension
functions.

The only place where you might consider specifying prefixes is the
serialization of Elements that were created through the API.  Here, you can
specify a prefix mapping through the ``nsmap`` argument when creating the root
Element.  Its children will then inherit this prefix for serialization.


How can I specify a default namespace for XPath expressions?
------------------------------------------------------------

You can't.  In XPath, there is no such thing as a default namespace.  Just use
an arbitrary prefix and let the namespace dictionary of the XPath evaluators
map it to your namespace.  See also the question above.