<?xml version="1.0" encoding="ascii"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>lxml.html.clean</title> <link rel="stylesheet" href="epydoc.css" type="text/css" /> <script type="text/javascript" src="epydoc.js"></script> </head> <body bgcolor="white" text="black" link="blue" vlink="#204080" alink="#204080"> <!-- ==================== NAVIGATION BAR ==================== --> <table class="navbar" border="0" width="100%" cellpadding="0" bgcolor="#a0c0ff" cellspacing="0"> <tr valign="middle"> <!-- Home link --> <th> <a href="lxml-module.html">Home</a> </th> <!-- Tree link --> <th> <a href="module-tree.html">Trees</a> </th> <!-- Index link --> <th> <a href="identifier-index.html">Indices</a> </th> <!-- Help link --> <th> <a href="help.html">Help</a> </th> <!-- Project homepage --> <th class="navbar" align="right" width="100%"> <table border="0" cellpadding="0" cellspacing="0"> <tr><th class="navbar" align="center" ><a class="navbar" target="_top" href="http://codespeak.net/lxml/">lxml API</a></th> </tr></table></th> </tr> </table> <table width="100%" cellpadding="0" cellspacing="0"> <tr valign="top"> <td width="100%"> <span class="breadcrumbs"> <a href="lxml-module.html">Package lxml</a> :: <a href="lxml.html-module.html">Package html</a> :: Module clean </span> </td> <td> <table cellpadding="0" cellspacing="0"> <!-- hide/show private --> <tr><td align="right"><span class="options" >[<a href="frames.html" target="_top">frames</a >] | <a href="lxml.html.clean-module.html" target="_top">no frames</a>]</span></td></tr> </table> </td> </tr> </table> <!-- ==================== MODULE DESCRIPTION ==================== --> <h1 class="epydoc">Module clean</h1><p class="nomargin-top"><span class="codelink"><a href="lxml.html.clean-pysrc.html">source code</a></span></p> <p>A cleanup tool for HTML.</p> <p>Removes unwanted tags and content. See the <a href="lxml.html.clean.Cleaner-class.html" class="link">Cleaner</a> class for details.</p> <!-- ==================== CLASSES ==================== --> <a name="section-Classes"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Classes</span></td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a href="lxml.html.clean.Cleaner-class.html" class="summary-name">Cleaner</a><br /> Instances cleans the document of each of the possible offending elements. </td> </tr> </table> <!-- ==================== FUNCTIONS ==================== --> <a name="section-Functions"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Functions</span></td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="clean_html"></a><span class="summary-sig-name">clean_html</span>(<span class="summary-sig-arg">html</span>)</span></td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#clean_html">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a href="lxml.html.clean-module.html#autolink" class="summary-sig-name">autolink</a>(<span class="summary-sig-arg">el</span>, <span class="summary-sig-arg">link_regexes</span>=<span class="summary-sig-default"><code class="variable-group">[</code>re.compile(r'<code class="re-flags">(?i)</code><code class="re-group">(?P<</code><code class="re-ref">body</code><code class="re-group">></code>https<code class="re-op">?</code>://<code class="re-group">(?P<</code><code class="re-ref">host</code><code class="re-group">></code><code class="re-group">[</code>a<code class="re-op">-</code>z0<code class="re-op">-</code>9\._-<code class="re-group">]</code><code class="re-op">+</code><code class="re-group">)</code><code class="re-group">(?:</code><code class="variable-ellipsis">...</code></span>, <span class="summary-sig-arg">avoid_elements</span>=<span class="summary-sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">textarea</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">pre</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">code</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">head</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">select</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">a</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="summary-sig-arg">avoid_hosts</span>=<span class="summary-sig-default"><code class="variable-group">[</code>re.compile(r'<code class="re-flags">(?i)</code>^localhost')<code class="variable-op">, </code>re.compile(r'<code class="re-flags">(?i)</code>\bexample\.<code class="re-group">(?</code><code class="variable-ellipsis">...</code></span>, <span class="summary-sig-arg">avoid_classes</span>=<span class="summary-sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">nolink</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>)</span><br /> Turn any URLs into links.</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#autolink">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a href="lxml.html.clean-module.html#autolink_html" class="summary-sig-name">autolink_html</a>(<span class="summary-sig-arg">html</span>, <span class="summary-sig-arg">*args</span>, <span class="summary-sig-arg">**kw</span>)</span><br /> Turn any URLs into links.</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#autolink_html">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a href="lxml.html.clean-module.html#word_break" class="summary-sig-name">word_break</a>(<span class="summary-sig-arg">el</span>, <span class="summary-sig-arg">max_width</span>=<span class="summary-sig-default">40</span>, <span class="summary-sig-arg">avoid_elements</span>=<span class="summary-sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">pre</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">textarea</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">code</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="summary-sig-arg">avoid_classes</span>=<span class="summary-sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">nobreak</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="summary-sig-arg">break_character</span>=<span class="summary-sig-default"><code class="variable-quote">u'</code><code class="variable-string">​</code><code class="variable-quote">'</code></span>)</span><br /> Breaks any long words found in the body of the text (not attributes).</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#word_break">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="word_break_html"></a><span class="summary-sig-name">word_break_html</span>(<span class="summary-sig-arg">html</span>, <span class="summary-sig-arg">*args</span>, <span class="summary-sig-arg">**kw</span>)</span></td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#word_break_html">source code</a></span> </td> </tr> </table> </td> </tr> </table> <!-- ==================== VARIABLES ==================== --> <a name="section-Variables"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Variables</span></td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="clean"></a><span class="summary-name">clean</span> = <code title="Cleaner()">Cleaner()</code> </td> </tr> </table> <!-- ==================== FUNCTION DETAILS ==================== --> <a name="section-FunctionDetails"></a> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Function Details</span></td> </tr> </table> <a name="autolink"></a> <div> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr><td> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr valign="top"><td> <h3 class="epydoc"><span class="sig"><span class="sig-name">autolink</span>(<span class="sig-arg">el</span>, <span class="sig-arg">link_regexes</span>=<span class="sig-default"><code class="variable-group">[</code>re.compile(r'<code class="re-flags">(?i)</code><code class="re-group">(?P<</code><code class="re-ref">body</code><code class="re-group">></code>https<code class="re-op">?</code>://<code class="re-group">(?P<</code><code class="re-ref">host</code><code class="re-group">></code><code class="re-group">[</code>a<code class="re-op">-</code>z0<code class="re-op">-</code>9\._-<code class="re-group">]</code><code class="re-op">+</code><code class="re-group">)</code><code class="re-group">(?:</code><code class="variable-ellipsis">...</code></span>, <span class="sig-arg">avoid_elements</span>=<span class="sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">textarea</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">pre</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">code</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">head</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">select</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">a</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="sig-arg">avoid_hosts</span>=<span class="sig-default"><code class="variable-group">[</code>re.compile(r'<code class="re-flags">(?i)</code>^localhost')<code class="variable-op">, </code>re.compile(r'<code class="re-flags">(?i)</code>\bexample\.<code class="re-group">(?</code><code class="variable-ellipsis">...</code></span>, <span class="sig-arg">avoid_classes</span>=<span class="sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">nolink</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>)</span> </h3> </td><td align="right" valign="top" ><span class="codelink"><a href="lxml.html.clean-pysrc.html#autolink">source code</a></span> </td> </tr></table> <p>Turn any URLs into links.</p> <p>It will search for links identified by the given regular expressions (by default mailto and http(s) links).</p> <p>It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).</p> <p>If you pass in an element, the elements tail will not be substituted, only the contents of the element.</p> <dl class="fields"> </dl> </td></tr></table> </div> <a name="autolink_html"></a> <div> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr><td> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr valign="top"><td> <h3 class="epydoc"><span class="sig"><span class="sig-name">autolink_html</span>(<span class="sig-arg">html</span>, <span class="sig-arg">*args</span>, <span class="sig-arg">**kw</span>)</span> </h3> </td><td align="right" valign="top" ><span class="codelink"><a href="lxml.html.clean-pysrc.html#autolink_html">source code</a></span> </td> </tr></table> <p>Turn any URLs into links.</p> <p>It will search for links identified by the given regular expressions (by default mailto and http(s) links).</p> <p>It won't link text in an element in avoid_elements, or an element with a class in avoid_classes. It won't link to anything with a host that matches one of the regular expressions in avoid_hosts (default localhost and 127.0.0.1).</p> <p>If you pass in an element, the elements tail will not be substituted, only the contents of the element.</p> <dl class="fields"> </dl> </td></tr></table> </div> <a name="word_break"></a> <div> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr><td> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr valign="top"><td> <h3 class="epydoc"><span class="sig"><span class="sig-name">word_break</span>(<span class="sig-arg">el</span>, <span class="sig-arg">max_width</span>=<span class="sig-default">40</span>, <span class="sig-arg">avoid_elements</span>=<span class="sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">pre</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">textarea</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">code</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="sig-arg">avoid_classes</span>=<span class="sig-default"><code class="variable-group">[</code><code class="variable-quote">'</code><code class="variable-string">nobreak</code><code class="variable-quote">'</code><code class="variable-group">]</code></span>, <span class="sig-arg">break_character</span>=<span class="sig-default"><code class="variable-quote">u'</code><code class="variable-string">​</code><code class="variable-quote">'</code></span>)</span> </h3> </td><td align="right" valign="top" ><span class="codelink"><a href="lxml.html.clean-pysrc.html#word_break">source code</a></span> </td> </tr></table> <p>Breaks any long words found in the body of the text (not attributes).</p> <p>Doesn't effect any of the tags in avoid_elements, by default <tt class="rst-docutils literal"><span class="pre"><textarea></span></tt> and <tt class="rst-docutils literal"><span class="pre"><pre></span></tt></p> <p>Breaks words by inserting &#8203;, which is a unicode character for Zero Width Space character. This generally takes up no space in rendering, but does copy as a space, and in monospace contexts usually takes up space.</p> <p>See <a class="rst-reference" href="http://www.cs.tut.fi/~jkorpela/html/nobr.html" target="_top">http://www.cs.tut.fi/~jkorpela/html/nobr.html</a> for a discussion</p> <dl class="fields"> </dl> </td></tr></table> </div> <br /> <!-- ==================== NAVIGATION BAR ==================== --> <table class="navbar" border="0" width="100%" cellpadding="0" bgcolor="#a0c0ff" cellspacing="0"> <tr valign="middle"> <!-- Home link --> <th> <a href="lxml-module.html">Home</a> </th> <!-- Tree link --> <th> <a href="module-tree.html">Trees</a> </th> <!-- Index link --> <th> <a href="identifier-index.html">Indices</a> </th> <!-- Help link --> <th> <a href="help.html">Help</a> </th> <!-- Project homepage --> <th class="navbar" align="right" width="100%"> <table border="0" cellpadding="0" cellspacing="0"> <tr><th class="navbar" align="center" ><a class="navbar" target="_top" href="http://codespeak.net/lxml/">lxml API</a></th> </tr></table></th> </tr> </table> <table border="0" cellpadding="0" cellspacing="0" width="100%%"> <tr> <td align="left" class="footer"> Generated by Epydoc 3.0 on Fri Dec 12 22:40:29 2008 </td> <td align="right" class="footer"> <a target="mainFrame" href="http://epydoc.sourceforge.net" >http://epydoc.sourceforge.net</a> </td> </tr> </table> <script type="text/javascript"> <!-- // Private objects are initially displayed (because if // javascript is turned off then we want them to be // visible); but by default, we want to hide them. So hide // them unless we have a cookie that says to show them. checkCookie(); // --> </script> </body> </html>