<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"> <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"> <meta name="author" content="John J. Lee <jjl@pobox.com>"> <meta name="date" content="2006-01-05"> <meta name="keywords" content="FAQ,cookie,HTTP,HTML,form,table,Python,web,client,client-side,testing,sniffer,https,script,embedded"> <title>Python web-client programming general FAQs</title> <style type="text/css" media="screen">@import "../styles/style.css";</style> <base href="http://wwwsearch.sourceforge.net/bits/clientx.html"> </head> <body> <div id="sf"><a href="http://sourceforge.net"> <img src="http://sourceforge.net/sflogo.php?group_id=48205&type=2" width="125" height="37" alt="SourceForge.net Logo"></a></div> <!--<img src="../images/sflogo.png"--> <h1>Python web-client programming general FAQs</h1> <div id="Content"> <ul> <li>Is there any example code? <p>There's (still!) a bit of a shortage of example code for ClientCookie and ClientForm &co., because the stuff I've written tends to either require access to restricted-access sites, or is proprietary code (and the same goes for other people's code). <li>HTTPS on Windows? <p>Use this <a href="http://pypgsql.sourceforge.net/misc/python22-win32-ssl.zip"> _socket.pyd</a>, or use Python 2.3. <li>I want to see what my web browser is doing, but standard network sniffers like <a href="http://www.ethereal.com/">ethereal</a> or netcat (nc) don't work for HTTPS. How do I sniff HTTPS traffic? <p>Three good options: <ul> <li>Mozilla plugin: <a href="http://livehttpheaders.mozdev.org/"> livehttpheaders</a>. <li><a href="http://www.blunck.info/iehttpheaders.html">ieHTTPHeaders</a> does the same for MSIE. <li>Use <a href="http://lynx.browser.org/">lynx</a> <code>-trace</code>, and filter out the junk with a script. </ul> <p>I'm told you can also use a proxy like <a href="http://www.proxomitron.info/">proxomitron</a> (never tried it myself). There's also a commercial <a href="http://www.simtec.ltd.uk/">MSIE plugin</a>. <li>Embedded script is messing up my web-scraping. What do I do? <p>It is possible to embed script in HTML pages (sandwiched between <code><SCRIPT>here</SCRIPT></code> tags, and in <code>javascript:</code> URLs) - JavaScript / ECMAScript, VBScript, or even Python. These scripts can do all sorts of things, including causing cookies to be set in a browser, submitting or filling in parts of forms in response to user actions, changing link colours as the mouse moves over a link, etc. <p>If you come across this in a page you want to automate, you have four options. Here they are, roughly in order of simplicity. <ul> <li>Simply figure out what the embedded script is doing and emulate it in your Python code: for example, by manually adding cookies to your <code>CookieJar</code> instance, calling methods on <code>HTMLForm</code>s, calling <code>urlopen</code>, etc. <li>Dump ClientCookie and ClientForm and automate a browser instead. For example use MS Internet Explorer via its COM automation interfaces, using the <a href="http://starship.python.net/crew/mhammond/">Python for Windows extensions</a>, aka pywin32, aka win32all (eg. <a href="http://vsbabu.org/mt/archives/2003/06/13/ie_automation.html">simple function</a>, <a href="http://pamie.sourceforge.net/">pamie</a>; <a href="http://www.oreilly.com/catalog/pythonwin32/chapter/ch12.html"> pywin32 chapter from the O'Reilly book</a>) or <a href="http://starship.python.net/crew/theller/ctypes/">ctypes</a> (<a href="http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/305273"> example</a>: may be out of date, since <code>ctypes</code>' COM support is still evolving). <a href="http://www.brunningonline.net/simon/blog/archives/winGuiAuto.py.html">This</a> kind of thing may also come in useful on Windows for cases where the automation API is lacking. <a href="http://ftp.acc.umu.se/pub/GNOME/sources/pyphany/">pyphany</a> is a binding to the <a href="http://www.gnome.org/projects/epiphany/"> epiphany web browser</a>, allowing both plugins and automation code to be written in Python. XXX Mozilla automation & XPCOM / PyXPCOM, Konqueror & DCOP / KParts / PyKDE). <li>Use Java's <a href="httpunit.sourceforge.net">httpunit</a> from Jython, since it knows some JavaScript. <li>Get ambitious and automatically delegate the work to an appropriate interpreter (Mozilla's JavaScript interpreter, for instance). This approach is the one taken by <a href="../DOMForm">DOMForm</a> (the JavaScript support is "very alpha", though!). </ul> <li>Misc links <ul> <li><a href="http://www.crummy.com/software/BeautifulSoup/">Beautiful Soup</a> is a widely recommended HTML-parsing module. <li><a href="http://linux.duke.edu/projects/urlgrabber/">urlgrabber</a> contains useful stuff like persistent connections, mirroring and throttling, and it looks like most or all of it is well-integrated with <code>urllib2</code> (originally part of the yum package manager, but now becoming a separate project). <li>Another Java thing: <a href="http://maxq.tigris.org/">maxq</a>, which provides a proxy to aid automatic generation of functional tests written in Jython using the standard library unittest module (PyUnit) and the "Jakarta Commons" HttpClient library. <li>A useful set Zope-oriented links on <a href="http://viii.dclxvi.org/bookmarks/tech/zope/test">tools for testing web applications</a>. <li>O'Reilly book: <a href="">Spidering Hacks</a>. Very Perl-oriented. <li>Useful <a href="http://chrispederick.com/work/webdeveloper/"> Firefox extension </a> which, amongst other things, can display HTML form information and HTML table structure(thanks to Erno Kuusela for this link). <li> <a href="http://www.openqa.org/selenium/">Selenium</a>: In-browser web functional testing. <li><a href="http://www.opensourcetesting.org/functional.php">Open source functional testing tools</a>. A nice list. <li><a href="http://www.rexx.com/~dkuhlman/quixote_htmlscraping.html"> A HOWTO on web scraping</a> from Dave Kuhlman. </ul> <li>Will any of this code make its way into the Python standard library? <p>The request / response processing extensions to <code>urllib2</code> from ClientCookie have been merged into <code>urllib2</code> for Python 2.4. The cookie processing has been added, as module <code>cookielib</code>. Eventually, I'll submit patches to get the http-equiv, refresh, and robots.txt code in there too, and maybe <code>mechanize.UserAgent</code> too (but <em>not</em> <code>mechanize.Browser</code>). The rest, probably not. </ul> <p>I prefer questions and comments to be sent to the <a href="http://lists.sourceforge.net/lists/listinfo/wwwsearch-general"> mailing list</a> rather than direct to me. <p><a href="mailto:jjl@pobox.com">John J. Lee</a>, January 2006. </div> <!--id="Content"--> <div id="Menu"> <a href="..">Home</a><br> <br> <span class="thispage">General FAQs</span><br> <br> <a href="../mechanize/">mechanize</a><br> <a href="../pullparser/">pullparser</a><br> <a href="../ClientCookie/">ClientCookie</a><br> <a href="../ClientCookie/doc.html"><span class="subpage">ClientCookie docs</span></a><br> <a href="../ClientForm/">ClientForm</a><br> <br> <a href="../DOMForm/">DOMForm</a><br> <a href="../python-spidermonkey/">python-spidermonkey</a><br> <a href="../ClientTable/">ClientTable</a><br> <a href="../bits/urllib2_152.py">1.5.2 urllib2.py</a><br> <a href="../bits/urllib_152.py">1.5.2 urllib.py</a><br> <br> </body> </html>