<?xml version="1.0" encoding="ascii"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>lxml.html.clean.Cleaner</title> <link rel="stylesheet" href="epydoc.css" type="text/css" /> <script type="text/javascript" src="epydoc.js"></script> </head> <body bgcolor="white" text="black" link="blue" vlink="#204080" alink="#204080"> <!-- ==================== NAVIGATION BAR ==================== --> <table class="navbar" border="0" width="100%" cellpadding="0" bgcolor="#a0c0ff" cellspacing="0"> <tr valign="middle"> <!-- Home link --> <th> <a href="lxml-module.html">Home</a> </th> <!-- Tree link --> <th> <a href="module-tree.html">Trees</a> </th> <!-- Index link --> <th> <a href="identifier-index.html">Indices</a> </th> <!-- Help link --> <th> <a href="help.html">Help</a> </th> <!-- Project homepage --> <th class="navbar" align="right" width="100%"> <table border="0" cellpadding="0" cellspacing="0"> <tr><th class="navbar" align="center" ><a class="navbar" target="_top" href="http://codespeak.net/lxml/">lxml API</a></th> </tr></table></th> </tr> </table> <table width="100%" cellpadding="0" cellspacing="0"> <tr valign="top"> <td width="100%"> <span class="breadcrumbs"> <a href="lxml-module.html">Package lxml</a> :: <a href="lxml.html-module.html">Package html</a> :: <a href="lxml.html.clean-module.html">Module clean</a> :: Class Cleaner </span> </td> <td> <table cellpadding="0" cellspacing="0"> <!-- hide/show private --> <tr><td align="right"><span class="options" >[<a href="frames.html" target="_top">frames</a >] | <a href="lxml.html.clean.Cleaner-class.html" target="_top">no frames</a>]</span></td></tr> </table> </td> </tr> </table> <!-- ==================== CLASS DESCRIPTION ==================== --> <h1 class="epydoc">Class Cleaner</h1><p class="nomargin-top"><span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner">source code</a></span></p> <pre class="base-tree"> object --+ | <strong class="uidshort">Cleaner</strong> </pre> <hr /> <p>Instances cleans the document of each of the possible offending elements. The cleaning is controlled by attributes; you can override attributes in a subclass, or set them in the constructor.</p> <dl class="rst-docutils"> <dt><tt class="rst-docutils literal"><span class="pre">scripts</span></tt>:</dt> <dd>Removes any <tt class="rst-docutils literal"><span class="pre"><script></span></tt> tags.</dd> <dt><tt class="rst-docutils literal"><span class="pre">javascript</span></tt>:</dt> <dd>Removes any Javascript, like an <tt class="rst-docutils literal"><span class="pre">onclick</span></tt> attribute.</dd> <dt><tt class="rst-docutils literal"><span class="pre">comments</span></tt>:</dt> <dd>Removes any comments.</dd> <dt><tt class="rst-docutils literal"><span class="pre">style</span></tt>:</dt> <dd>Removes any style tags or attributes.</dd> <dt><tt class="rst-docutils literal"><span class="pre">links</span></tt>:</dt> <dd>Removes any <tt class="rst-docutils literal"><span class="pre"><link></span></tt> tags</dd> <dt><tt class="rst-docutils literal"><span class="pre">meta</span></tt>:</dt> <dd>Removes any <tt class="rst-docutils literal"><span class="pre"><meta></span></tt> tags</dd> <dt><tt class="rst-docutils literal"><span class="pre">page_structure</span></tt>:</dt> <dd>Structural parts of a page: <tt class="rst-docutils literal"><span class="pre"><head></span></tt>, <tt class="rst-docutils literal"><span class="pre"><html></span></tt>, <tt class="rst-docutils literal"><span class="pre"><title></span></tt>.</dd> <dt><tt class="rst-docutils literal"><span class="pre">processing_instructions</span></tt>:</dt> <dd>Removes any processing instructions.</dd> <dt><tt class="rst-docutils literal"><span class="pre">embedded</span></tt>:</dt> <dd>Removes any embedded objects (flash, iframes)</dd> <dt><tt class="rst-docutils literal"><span class="pre">frames</span></tt>:</dt> <dd>Removes any frame-related tags</dd> <dt><tt class="rst-docutils literal"><span class="pre">forms</span></tt>:</dt> <dd>Removes any form tags</dd> <dt><tt class="rst-docutils literal"><span class="pre">annoying_tags</span></tt>:</dt> <dd>Tags that aren't <em>wrong</em>, but are annoying. <tt class="rst-docutils literal"><span class="pre"><blink></span></tt> and <tt class="rst-docutils literal"><span class="pre"><marque></span></tt></dd> <dt><tt class="rst-docutils literal"><span class="pre">remove_tags</span></tt>:</dt> <dd>A list of tags to remove.</dd> <dt><tt class="rst-docutils literal"><span class="pre">allow_tags</span></tt>:</dt> <dd>A list of tags to include (default include all).</dd> <dt><tt class="rst-docutils literal"><span class="pre">remove_unknown_tags</span></tt>:</dt> <dd>Remove any tags that aren't standard parts of HTML.</dd> <dt><tt class="rst-docutils literal"><span class="pre">safe_attrs_only</span></tt>:</dt> <dd>If true, only include 'safe' attributes (specifically the list from <a class="rst-reference" href="http://feedparser.org/docs/html-sanitization.html" target="_top">feedparser</a>).</dd> <dt><tt class="rst-docutils literal"><span class="pre">add_nofollow</span></tt>:</dt> <dd>If true, then any <a> tags will have <tt class="rst-docutils literal"><span class="pre">rel="nofollow"</span></tt> added to them.</dd> <dt><tt class="rst-docutils literal"><span class="pre">host_whitelist</span></tt>:</dt> <dd><p class="rst-first">A list or set of hosts that you can use for embedded content (for content like <tt class="rst-docutils literal"><span class="pre"><object></span></tt>, <tt class="rst-docutils literal"><span class="pre"><link</span> <span class="pre">rel="stylesheet"></span></tt>, etc). You can also implement/override the method <tt class="rst-docutils literal"><span class="pre">allow_embedded_url(el,</span> <span class="pre">url)</span></tt> or <tt class="rst-docutils literal"><span class="pre">allow_element(el)</span></tt> to implement more complex rules for what can be embedded. Anything that passes this test will be shown, regardless of the value of (for instance) <tt class="rst-docutils literal"><span class="pre">embedded</span></tt>.</p> <p class="rst-last">Note that this parameter might not work as intended if you do not make the links absolute before doing the cleaning.</p> </dd> <dt><tt class="rst-docutils literal"><span class="pre">whitelist_tags</span></tt>:</dt> <dd>A set of tags that can be included with <tt class="rst-docutils literal"><span class="pre">host_whitelist</span></tt>. The default is <tt class="rst-docutils literal"><span class="pre">iframe</span></tt> and <tt class="rst-docutils literal"><span class="pre">embed</span></tt>; you may wish to include other tags like <tt class="rst-docutils literal"><span class="pre">script</span></tt>, or you may want to implement <tt class="rst-docutils literal"><span class="pre">allow_embedded_url</span></tt> for more control. Set to None to include all tags.</dd> </dl> <p>This modifies the document <em>in place</em>.</p> <!-- ==================== INSTANCE METHODS ==================== --> <a name="section-InstanceMethods"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Instance Methods</span></td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a href="lxml.html.clean.Cleaner-class.html#__init__" class="summary-sig-name">__init__</a>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">**kw</span>)</span><br /> x.__init__(...) initializes x; see x.__class__.__doc__ for signature</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.__init__">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="__call__"></a><span class="summary-sig-name">__call__</span>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">doc</span>)</span><br /> Cleans the document.</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.__call__">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="allow_follow"></a><span class="summary-sig-name">allow_follow</span>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">anchor</span>)</span><br /> Override to suppress rel="nofollow" on some anchors.</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.allow_follow">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="allow_element"></a><span class="summary-sig-name">allow_element</span>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">el</span>)</span></td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.allow_element">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="allow_embedded_url"></a><span class="summary-sig-name">allow_embedded_url</span>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">el</span>, <span class="summary-sig-arg">url</span>)</span></td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.allow_embedded_url">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a href="lxml.html.clean.Cleaner-class.html#kill_conditional_comments" class="summary-sig-name">kill_conditional_comments</a>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">doc</span>)</span><br /> IE conditional comments basically embed HTML that the parser doesn't normally see.</td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.kill_conditional_comments">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr> <td><span class="summary-sig"><a name="clean_html"></a><span class="summary-sig-name">clean_html</span>(<span class="summary-sig-arg">self</span>, <span class="summary-sig-arg">html</span>)</span></td> <td align="right" valign="top"> <span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.clean_html">source code</a></span> </td> </tr> </table> </td> </tr> <tr> <td colspan="2" class="summary"> <p class="indent-wrapped-lines"><b>Inherited from <code>object</code></b>: <code>__delattr__</code>, <code>__getattribute__</code>, <code>__hash__</code>, <code>__new__</code>, <code>__reduce__</code>, <code>__reduce_ex__</code>, <code>__repr__</code>, <code>__setattr__</code>, <code>__str__</code> </p> </td> </tr> </table> <!-- ==================== CLASS VARIABLES ==================== --> <a name="section-ClassVariables"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Class Variables</span></td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="scripts"></a><span class="summary-name">scripts</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="javascript"></a><span class="summary-name">javascript</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="comments"></a><span class="summary-name">comments</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="style"></a><span class="summary-name">style</span> = <code title="False">False</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="links"></a><span class="summary-name">links</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="meta"></a><span class="summary-name">meta</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="page_structure"></a><span class="summary-name">page_structure</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="processing_instructions"></a><span class="summary-name">processing_instructions</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="embedded"></a><span class="summary-name">embedded</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="frames"></a><span class="summary-name">frames</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="forms"></a><span class="summary-name">forms</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="annoying_tags"></a><span class="summary-name">annoying_tags</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="remove_tags"></a><span class="summary-name">remove_tags</span> = <code title="None">None</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="allow_tags"></a><span class="summary-name">allow_tags</span> = <code title="None">None</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="remove_unknown_tags"></a><span class="summary-name">remove_unknown_tags</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="safe_attrs_only"></a><span class="summary-name">safe_attrs_only</span> = <code title="True">True</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="add_nofollow"></a><span class="summary-name">add_nofollow</span> = <code title="False">False</code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="host_whitelist"></a><span class="summary-name">host_whitelist</span> = <code title="()"><code class="variable-group">(</code><code class="variable-group">)</code></code> </td> </tr> <tr> <td width="15%" align="right" valign="top" class="summary"> <span class="summary-type"> </span> </td><td class="summary"> <a name="whitelist_tags"></a><span class="summary-name">whitelist_tags</span> = <code title="set(['embed', 'iframe'])"><code class="variable-group">set([</code><code class="variable-quote">'</code><code class="variable-string">embed</code><code class="variable-quote">'</code><code class="variable-op">, </code><code class="variable-quote">'</code><code class="variable-string">iframe</code><code class="variable-quote">'</code><code class="variable-group">])</code></code> </td> </tr> </table> <!-- ==================== PROPERTIES ==================== --> <a name="section-Properties"></a> <table class="summary" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Properties</span></td> </tr> <tr> <td colspan="2" class="summary"> <p class="indent-wrapped-lines"><b>Inherited from <code>object</code></b>: <code>__class__</code> </p> </td> </tr> </table> <!-- ==================== METHOD DETAILS ==================== --> <a name="section-MethodDetails"></a> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr bgcolor="#70b0f0" class="table-header"> <td align="left" colspan="2" class="table-header"> <span class="table-header">Method Details</span></td> </tr> </table> <a name="__init__"></a> <div> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr><td> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr valign="top"><td> <h3 class="epydoc"><span class="sig"><span class="sig-name">__init__</span>(<span class="sig-arg">self</span>, <span class="sig-arg">**kw</span>)</span> <br /><em class="fname">(Constructor)</em> </h3> </td><td align="right" valign="top" ><span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.__init__">source code</a></span> </td> </tr></table> x.__init__(...) initializes x; see x.__class__.__doc__ for signature <dl class="fields"> <dt>Overrides: object.__init__ <dd><em class="note">(inherited documentation)</em></dd> </dt> </dl> </td></tr></table> </div> <a name="kill_conditional_comments"></a> <div> <table class="details" border="1" cellpadding="3" cellspacing="0" width="100%" bgcolor="white"> <tr><td> <table width="100%" cellpadding="0" cellspacing="0" border="0"> <tr valign="top"><td> <h3 class="epydoc"><span class="sig"><span class="sig-name">kill_conditional_comments</span>(<span class="sig-arg">self</span>, <span class="sig-arg">doc</span>)</span> </h3> </td><td align="right" valign="top" ><span class="codelink"><a href="lxml.html.clean-pysrc.html#Cleaner.kill_conditional_comments">source code</a></span> </td> </tr></table> IE conditional comments basically embed HTML that the parser doesn't normally see. We can't allow anything like that, so we'll kill any comments that could be conditional. <dl class="fields"> </dl> </td></tr></table> </div> <br /> <!-- ==================== NAVIGATION BAR ==================== --> <table class="navbar" border="0" width="100%" cellpadding="0" bgcolor="#a0c0ff" cellspacing="0"> <tr valign="middle"> <!-- Home link --> <th> <a href="lxml-module.html">Home</a> </th> <!-- Tree link --> <th> <a href="module-tree.html">Trees</a> </th> <!-- Index link --> <th> <a href="identifier-index.html">Indices</a> </th> <!-- Help link --> <th> <a href="help.html">Help</a> </th> <!-- Project homepage --> <th class="navbar" align="right" width="100%"> <table border="0" cellpadding="0" cellspacing="0"> <tr><th class="navbar" align="center" ><a class="navbar" target="_top" href="http://codespeak.net/lxml/">lxml API</a></th> </tr></table></th> </tr> </table> <table border="0" cellpadding="0" cellspacing="0" width="100%%"> <tr> <td align="left" class="footer"> Generated by Epydoc 3.0 on Fri Dec 12 22:40:31 2008 </td> <td align="right" class="footer"> <a target="mainFrame" href="http://epydoc.sourceforge.net" >http://epydoc.sourceforge.net</a> </td> </tr> </table> <script type="text/javascript"> <!-- // Private objects are initially displayed (because if // javascript is turned off then we want them to be // visible); but by default, we want to hide them. So hide // them unless we have a cookie that says to show them. checkCookie(); // --> </script> </body> </html>