Sophie

Sophie

distrib > CentOS > 5 > i386 > by-pkgid > 90dba77ca23efa667b541b5c0dd77497 > files > 88

python-lxml-2.0.11-2.el5.i386.rpm

====================
BeautifulSoup Parser
====================

BeautifulSoup_ is a Python package that parses broken HTML.  While libxml2
(and thus lxml) can also parse broken HTML, BeautifulSoup is a bit more
forgiving and has superiour `support for encoding detection`_.

.. _BeautifulSoup: http://www.crummy.com/software/BeautifulSoup/
.. _`support for encoding detection`: http://www.crummy.com/software/BeautifulSoup/documentation.html#Beautiful%20Soup%20Gives%20You%20Unicode,%20Dammit
.. _ElementSoup: http://effbot.org/zone/element-soup.htm

lxml can benefit from the parsing capabilities of BeautifulSoup
through the ``lxml.html.soupparser`` module.  It provides three main
functions: ``fromstring()`` and ``parse()`` to parse a string or file
using BeautifulSoup, and ``convert_tree()`` to convert an existing
BeautifulSoup tree into a list of top-level Elements.

The functions ``fromstring()`` and ``parse()`` behave as known from
ElementTree.  The first returns a root Element, the latter returns an
ElementTree.

There is also a legacy module called ``lxml.html.ElementSoup``, which
mimics the interface provided by ElementTree's own ElementSoup_
module.  Note that the ``soupparser`` module was added in lxml 2.0.3.
Previous versions of lxml 2.0.x only have the ``ElementSoup`` module.

Here is a document full of tag soup, similar to, but not quite like, HTML::

    >>> tag_soup = '<meta><head><title>Hello</head<body onload=crash()>Hi all<p>'

all you need to do is pass it to the ``fromstring()`` function::

    >>> from lxml.html.soupparser import fromstring
    >>> root = fromstring(tag_soup)

To see what we have here, you can serialise it::

    >>> from lxml.etree import tostring
    >>> print tostring(root, pretty_print=True),
    <html>
      <meta/>
      <head>
        <title>Hello</title>
      </head>
      <body onload="crash()">Hi all<p/></body>
    </html>

Not quite what you'd expect from an HTML page, but, well, it was broken
already, right?  BeautifulSoup did its best, and so now it's a tree.

To control which Element implementation is used, you can pass a
``makeelement`` factory function to ``parse()`` and ``fromstring()``.
By default, this is based on the HTML parser defined in ``lxml.html``.


Entity handling
===============

By default, the BeautifulSoup parser also replaces the entities it
finds by their character equivalent::

    >>> tag_soup = '<body>&copy;&euro;&#45;&#245;&#445;<p>'
    >>> body = fromstring(tag_soup).find('.//body')
    >>> body.text
    u'\xa9\u20ac-\xf5\u01bd'

If you want them back on the way out, you can serialise with the
'html' method, which will always use escaping for safety reasons::

    >>> tostring(body, method="html")
    '<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

    >>> tostring(body, method="html", encoding="utf-8")
    '<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

    >>> tostring(body, method="html", encoding=unicode)
    u'<body>&#xA9;&#x20AC;-&#xF5;&#x1BD;<p></p></body>'

Otherwise, when serialising to XML, only the plain ASCII encoding will
escape non-ASCII characters::

    >>> tostring(body)
    '<body>&#169;&#8364;-&#245;&#445;<p/></body>'

    >>> tostring(body, encoding="utf-8")
    '<body>\xc2\xa9\xe2\x82\xac-\xc3\xb5\xc6\xbd<p/></body>'

    >>> tostring(body, encoding=unicode)
    u'<body>\xa9\u20ac-\xf5\u01bd<p/></body>'

There is also a legacy module called ``lxml.html.ElementSoup``, which
mimics the interface provided by ElementTree's own ElementSoup_
module.


Using soupparser as a fallback
==============================

The downside of using this parser is that it is `much slower`_ than
the HTML parser of lxml.  So if performance matters, you might want to
consider using ``soupparser`` only as a fallback for certain cases.

.. _`much slower`: http://blog.ianbicking.org/2008/03/30/python-html-parser-performance/

One common problem of lxml's parser is that it might not get the
encoding right in cases where the document contains a ``<meta>`` tag
at the wrong place.  In this case, you can exploit the fact that lxml
serialises much faster than most other HTML libraries for Python.
Just serialise the document to unicode and if that gives you an
exception, re-parse it with BeautifulSoup to see if that works
better::

    >>> tag_soup = '''\
    ... <meta http-equiv="Content-Type"
    ...       content="text/html;charset=utf-8" />
    ... <html>
    ...   <head>
    ...     <title>Hello W\xc3\xb6rld!</title>
    ...   </head>
    ...   <body>Hi all</body>
    ... </html>'''

    >>> import lxml.html
    >>> import lxml.html.soupparser

    >>> root = lxml.html.fromstring(tag_soup)
    >>> try:
    ...     ignore = tostring(root, encoding=unicode)
    ... except UnicodeDecodeError:
    ...     root = lxml.html.soupparser.fromstring(tag_soup)