Sophie

Sophie

distrib > Mandriva > 2010.2 > i586 > media > contrib-backports > by-pkgid > b9ca70a7484981df9126f0365edf0863 > files > 265

python-pyzmq-2.2.0.1-1mdv2010.1.i586.rpm



<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
    <title>PyZMQ and Unicode &mdash; PyZMQ v2.2.0.1 documentation</title>
    <link rel="stylesheet" href="_static/default.css" type="text/css" />
    <link rel="stylesheet" href="_static/pygments.css" type="text/css" />
    <script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    '',
        VERSION:     '2.2.0.1',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
    <script type="text/javascript" src="_static/jquery.js"></script>
    <script type="text/javascript" src="_static/underscore.js"></script>
    <script type="text/javascript" src="_static/doctools.js"></script>
    <link rel="shortcut icon" href="_static/zeromq.ico"/>
    <link rel="top" title="PyZMQ v2.2.0.1 documentation" href="index.html" />
    <link rel="next" title="More Than Just Bindings" href="morethanbindings.html" />
    <link rel="prev" title="PyZMQ, Python2.5, and Python3" href="pyversions.html" /> 
  </head>
  <body>

<div style="background-color: white; text-align: left; padding: 10px 10px 15px 15px">
<a href="index.html"><img src="_static/logo.png" border="0" alt="PyZMQ Documentation"/></a>
</div>

    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             accesskey="I">index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="morethanbindings.html" title="More Than Just Bindings"
             accesskey="N">next</a> |</li>
        <li class="right" >
          <a href="pyversions.html" title="PyZMQ, Python2.5, and Python3"
             accesskey="P">previous</a> |</li>
        <li><a href="index.html">home</a>|&nbsp;</li>
        <li><a href="search.html">search</a>|&nbsp;</li>
       <li><a href="api/index.html">API</a> &raquo;</li>
 
      </ul>
    </div>

      <div class="sphinxsidebar">
        <div class="sphinxsidebarwrapper">
  <h3><a href="index.html">Table Of Contents</a></h3>
  <ul>
<li><a class="reference internal" href="#">PyZMQ and Unicode</a><ul>
<li><a class="reference internal" href="#first-unicode-in-python-2-and-3">First, Unicode in Python 2 and 3</a><ul>
<li><a class="reference internal" href="#unicode-buffers">Unicode Buffers</a></li>
</ul>
</li>
<li><a class="reference internal" href="#what-this-means-for-pyzmq">What This Means for PyZMQ</a><ul>
<li><a class="reference internal" href="#the-methods">The Methods</a></li>
</ul>
</li>
</ul>
</li>
</ul>

  <h4>Previous topic</h4>
  <p class="topless"><a href="pyversions.html"
                        title="previous chapter">PyZMQ, Python2.5, and Python3</a></p>
  <h4>Next topic</h4>
  <p class="topless"><a href="morethanbindings.html"
                        title="next chapter">More Than Just Bindings</a></p>
  <h3>This Page</h3>
  <ul class="this-page-menu">
    <li><a href="_sources/unicode.txt"
           rel="nofollow">Show Source</a></li>
  </ul>
<div id="searchbox" style="display: none">
  <h3>Quick search</h3>
    <form class="search" action="search.html" method="get">
      <input type="text" name="q" size="18" />
      <input type="submit" value="Go" />
      <input type="hidden" name="check_keywords" value="yes" />
      <input type="hidden" name="area" value="default" />
    </form>
    <p class="searchtip" style="font-size: 90%">
    Enter search terms or a module, class or function name.
    </p>
</div>
<script type="text/javascript">$('#searchbox').show(0);</script>
        </div>
      </div>

    <div class="document">
      <div class="documentwrapper">
        <div class="bodywrapper">
          <div class="body">
            
  <div class="section" id="pyzmq-and-unicode">
<span id="unicode"></span><h1>PyZMQ and Unicode<a class="headerlink" href="#pyzmq-and-unicode" title="Permalink to this headline">¶</a></h1>
<p>PyZMQ is built with an eye towards an easy transition to Python 3, and part of
that is dealing with unicode strings. This is an overview of some of what we
found, and what it means for PyZMQ.</p>
<div class="section" id="first-unicode-in-python-2-and-3">
<h2>First, Unicode in Python 2 and 3<a class="headerlink" href="#first-unicode-in-python-2-and-3" title="Permalink to this headline">¶</a></h2>
<p>In Python &lt; 3, a <tt class="docutils literal"><span class="pre">str</span></tt> object is really a C string with some sugar - a
specific series of bytes with some fun methods like <tt class="docutils literal"><span class="pre">endswith()</span></tt> and
<tt class="docutils literal"><span class="pre">split()</span></tt>. In 2.0, the <tt class="docutils literal"><span class="pre">unicode</span></tt> object was added, which handles different
methods of encoding. In Python 3, however, the meaning of <tt class="docutils literal"><span class="pre">str</span></tt> changes. A
<tt class="docutils literal"><span class="pre">str</span></tt> in Python 3 is a full unicode object, with encoding and everything. If
you want a C string with some sugar, there is a new object called <tt class="docutils literal"><span class="pre">bytes</span></tt>,
that behaves much like the 2.x <tt class="docutils literal"><span class="pre">str</span></tt>. The idea is that for a user, a string is
a series of <em>characters</em>, not a series of bytes. For simple ascii, the two are
interchangeable, but if you consider accents and non-Latin characters, then the
character meaning of byte sequences can be ambiguous, since it depends on the
encoding scheme. They decided to avoid the ambiguity by forcing users who want
the actual bytes to specify the encoding every time they want to convert a
string to bytes. That way, users are aware of the difference between a series of
bytes and a collection of characters, and don&#8217;t confuse the two, as happens in
Python 2.x.</p>
<p>The problems (on both sides) come from the fact that regardless of the language
design, users are mostly going to use <tt class="docutils literal"><span class="pre">str</span></tt> objects to represent collections
of characters, and the behavior of that object is dramatically different in
certain aspects between the 2.x <tt class="docutils literal"><span class="pre">bytes</span></tt> approach and the 3.x <tt class="docutils literal"><span class="pre">unicode</span></tt>
approach. The <tt class="docutils literal"><span class="pre">unicode</span></tt> approach has the advantage of removing byte ambiguity
- it&#8217;s a list of characters, not bytes. However, if you really do want the
bytes, it&#8217;s very inefficient to get them. The <tt class="docutils literal"><span class="pre">bytes</span></tt> approach has the
advantage of efficiency. A <tt class="docutils literal"><span class="pre">bytes</span></tt> object really is just a char* pointer with
some methods to be used on it, so when interacting with, so interacting with C
code, etc is highly efficient and straightforward. However, understanding a
bytes object as a string with extended characters introduces ambiguity and
possibly confusion.</p>
<p>To avoid ambiguity, hereafter we will refer to encoded C arrays as &#8216;bytes&#8217; and
abstract unicode objects as &#8216;strings&#8217;.</p>
<div class="section" id="unicode-buffers">
<h3>Unicode Buffers<a class="headerlink" href="#unicode-buffers" title="Permalink to this headline">¶</a></h3>
<p>Since unicode objects have a wide range of representations, they are not stored
as the bytes according to their encoding, but rather in a format called UCS (an
older fixed-width Unicode format). On some platforms (OSX,Windows), the storage
is UCS-2, which is 2 bytes per character. On most *ix systems, it is UCS-4, or
4 bytes per character. The contents of the <em>buffer</em> of a <tt class="docutils literal"><span class="pre">unicode</span></tt> object are
not encoding dependent (always UCS-2 or UCS-4), but they are <em>platform</em>
dependent. As a result of this, and the further insistence on not interpreting
<tt class="docutils literal"><span class="pre">unicode</span></tt> objects as bytes without specifying encoding, <tt class="docutils literal"><span class="pre">str</span></tt> objects in
Python 3 don&#8217;t even provide the buffer interface. You simply cannot get the raw
bytes of a <tt class="docutils literal"><span class="pre">unicode</span></tt> object without specifying the encoding for the bytes. In
Python 2.x, you can get to the raw buffer, but the platform dependence and the
fact that the encoding of the buffer is not the encoding of the object makes it
very confusing, so this is probably a good move.</p>
<p>The efficiency problem here comes from the fact that simple ascii strings are 4x
as big in memory as they need to be (on most Linux, 2x on other platforms).
Also, to translate to/from C code that works with char*, you always have to copy
data and encode/decode the bytes. This really is horribly inefficient from a
memory standpoint. Essentially, Where memory efficiency matters to you, you
should never ever use strings; use bytes. The problem is that users will almost
always use <tt class="docutils literal"><span class="pre">str</span></tt>, and in 2.x they are efficient, but in 3.x they are not. We
want to make sure that we don&#8217;t help the user make this mistake, so we ensure
that zmq methods don&#8217;t try to hide what strings really are.</p>
</div>
</div>
<div class="section" id="what-this-means-for-pyzmq">
<h2>What This Means for PyZMQ<a class="headerlink" href="#what-this-means-for-pyzmq" title="Permalink to this headline">¶</a></h2>
<p>PyZMQ is a wrapper for a C library, so it really should use bytes, since a
string is not a simple wrapper for <tt class="docutils literal"><span class="pre">char</span> <span class="pre">*</span></tt> like it used to be, but an
abstract sequence of characters. The representations of bytes in Python are
either the <tt class="docutils literal"><span class="pre">bytes</span></tt> object itself, or any object that provides the buffer
interface (aka memoryview). In Python 2.x, unicode objects do provide the buffer
interface, but as they do not in Python 3, where pyzmq requires bytes, we
specifically reject unicode objects.</p>
<p>The relevant methods here are <tt class="docutils literal"><span class="pre">socket.send/recv</span></tt>, <tt class="docutils literal"><span class="pre">socket.get/setsockopt</span></tt>,
<tt class="docutils literal"><span class="pre">socket.bind/connect</span></tt>. The important consideration for send/recv and
set/getsockopt is that when you put in something, you really should get the same
object back with its partner method. We can easily coerce unicode objects to
bytes with send/setsockopt, but the problem is that the pair method of
recv/getsockopt will always be bytes, and there should be symmetry. We certainly
shouldn&#8217;t try to always decode on the retrieval side, because if users just want
bytes, then we are potentially using up enormous amounts of excess memory
unnecessarily, due to copying and larger memory footprint of unicode strings.</p>
<p>Still, we recognize the fact that users will quite frequently have unicode
strings that they want to send, so we have added <tt class="docutils literal"><span class="pre">socket.&lt;method&gt;_string()</span></tt>
wrappers. These methods simply wrap their bytes counterpart by encoding
to/decoding from bytes around them, and they all take an <cite>encoding</cite> keyword
argument that defaults to utf-8. Since encoding and decoding are necessary to
translate between unicode and bytes, it is impossible to perform non-copying
actions with these wrappers.</p>
<p><tt class="docutils literal"><span class="pre">socket.bind/connect</span></tt> methods are different from these, in that they are
strictly setters and there is not corresponding getter method. As a result, we
feel that we can safely coerce unicode objects to bytes (always to utf-8) in
these methods.</p>
<div class="admonition note">
<p class="first admonition-title">Note</p>
<p class="last">For cross-language symmetry (including Python 3), the <tt class="docutils literal"><span class="pre">_unicode</span></tt> methods
are now <tt class="docutils literal"><span class="pre">_string</span></tt>. Many languages have a notion of native strings, and
the use of <tt class="docutils literal"><span class="pre">_unicode</span></tt> was wedded too closely to the name of such objects
in Python 2.  For the time being, anywhere you see <tt class="docutils literal"><span class="pre">_string</span></tt>, <tt class="docutils literal"><span class="pre">_unicode</span></tt>
also works, and is the only option in pyzmq ≤ 2.1.11.</p>
</div>
<div class="section" id="the-methods">
<h3>The Methods<a class="headerlink" href="#the-methods" title="Permalink to this headline">¶</a></h3>
<p>Overview of the relevant methods:</p>
<dl class="function">
<dt id="socket.bind">
<tt class="descclassname">socket.</tt><tt class="descname">bind</tt><big>(</big><em>self</em>, <em>addr</em><big>)</big><a class="headerlink" href="#socket.bind" title="Permalink to this definition">¶</a></dt>
<dd><p><cite>addr</cite> is <tt class="docutils literal"><span class="pre">bytes</span></tt> or <tt class="docutils literal"><span class="pre">unicode</span></tt>. If <tt class="docutils literal"><span class="pre">unicode</span></tt>,
encoded to utf-8 <tt class="docutils literal"><span class="pre">bytes</span></tt></p>
</dd></dl>

<dl class="function">
<dt id="socket.connect">
<tt class="descclassname">socket.</tt><tt class="descname">connect</tt><big>(</big><em>self</em>, <em>addr</em><big>)</big><a class="headerlink" href="#socket.connect" title="Permalink to this definition">¶</a></dt>
<dd><p><cite>addr</cite> is <tt class="docutils literal"><span class="pre">bytes</span></tt> or <tt class="docutils literal"><span class="pre">unicode</span></tt>. If <tt class="docutils literal"><span class="pre">unicode</span></tt>,
encoded to utf-8 <tt class="docutils literal"><span class="pre">bytes</span></tt></p>
</dd></dl>

<dl class="function">
<dt id="socket.send">
<tt class="descclassname">socket.</tt><tt class="descname">send</tt><big>(</big><em>self</em>, <em>object obj</em>, <em>flags=0</em>, <em>copy=True</em><big>)</big><a class="headerlink" href="#socket.send" title="Permalink to this definition">¶</a></dt>
<dd><p><cite>obj</cite> is <tt class="docutils literal"><span class="pre">bytes</span></tt> or provides buffer interface.</p>
<p>if <cite>obj</cite> is <tt class="docutils literal"><span class="pre">unicode</span></tt>, raise <tt class="docutils literal"><span class="pre">TypeError</span></tt></p>
</dd></dl>

<dl class="function">
<dt id="socket.recv">
<tt class="descclassname">socket.</tt><tt class="descname">recv</tt><big>(</big><em>self</em>, <em>flags=0</em>, <em>copy=True</em><big>)</big><a class="headerlink" href="#socket.recv" title="Permalink to this definition">¶</a></dt>
<dd><p>returns <tt class="docutils literal"><span class="pre">bytes</span></tt> if <cite>copy=True</cite></p>
<p>returns <tt class="docutils literal"><span class="pre">zmq.Message</span></tt> if <cite>copy=False</cite>:</p>
<blockquote>
<div><p><cite>message.buffer</cite> is a buffer view of the <tt class="docutils literal"><span class="pre">bytes</span></tt></p>
<p><cite>str(message)</cite> provides the <tt class="docutils literal"><span class="pre">bytes</span></tt></p>
<p><cite>unicode(message)</cite> decodes <cite>message.buffer</cite> with utf-8</p>
</div></blockquote>
</dd></dl>

<dl class="function">
<dt id="socket.send_string">
<tt class="descclassname">socket.</tt><tt class="descname">send_string</tt><big>(</big><em>self</em>, <em>unicode s</em>, <em>flags=0</em>, <em>encoding='utf-8'</em><big>)</big><a class="headerlink" href="#socket.send_string" title="Permalink to this definition">¶</a></dt>
<dd><p>takes a <tt class="docutils literal"><span class="pre">unicode</span></tt> string <cite>s</cite>, and sends the <tt class="docutils literal"><span class="pre">bytes</span></tt>
after encoding without an extra copy, via:</p>
<p><cite>socket.send(s.encode(encoding), flags, copy=False)</cite></p>
</dd></dl>

<dl class="function">
<dt id="socket.recv_string">
<tt class="descclassname">socket.</tt><tt class="descname">recv_string</tt><big>(</big><em>self</em>, <em>flags=0</em>, <em>encoding='utf-8'</em><big>)</big><a class="headerlink" href="#socket.recv_string" title="Permalink to this definition">¶</a></dt>
<dd><p>always returns <tt class="docutils literal"><span class="pre">unicode</span></tt> string</p>
<p>there will be a <tt class="docutils literal"><span class="pre">UnicodeError</span></tt> if it cannot decode the buffer</p>
<p>performs non-copying <cite>recv</cite>, and decodes the buffer with <cite>encoding</cite></p>
</dd></dl>

<dl class="function">
<dt id="socket.setsockopt">
<tt class="descclassname">socket.</tt><tt class="descname">setsockopt</tt><big>(</big><em>self</em>, <em>opt</em>, <em>optval</em><big>)</big><a class="headerlink" href="#socket.setsockopt" title="Permalink to this definition">¶</a></dt>
<dd><p>only accepts <tt class="docutils literal"><span class="pre">bytes</span></tt>  for <cite>optval</cite> (or <tt class="docutils literal"><span class="pre">int</span></tt>, depending on <cite>opt</cite>)</p>
<p><tt class="docutils literal"><span class="pre">TypeError</span></tt> if <tt class="docutils literal"><span class="pre">unicode</span></tt> or anything else</p>
</dd></dl>

<dl class="function">
<dt id="socket.getsockopt">
<tt class="descclassname">socket.</tt><tt class="descname">getsockopt</tt><big>(</big><em>self</em>, <em>opt</em><big>)</big><a class="headerlink" href="#socket.getsockopt" title="Permalink to this definition">¶</a></dt>
<dd><p>returns <tt class="docutils literal"><span class="pre">bytes</span></tt> (or <tt class="docutils literal"><span class="pre">int</span></tt>), never <tt class="docutils literal"><span class="pre">unicode</span></tt></p>
</dd></dl>

<dl class="function">
<dt id="socket.setsockopt_string">
<tt class="descclassname">socket.</tt><tt class="descname">setsockopt_string</tt><big>(</big><em>self</em>, <em>opt</em>, <em>unicode optval</em>, <em>encoding='utf-8'</em><big>)</big><a class="headerlink" href="#socket.setsockopt_string" title="Permalink to this definition">¶</a></dt>
<dd><p>accepts <tt class="docutils literal"><span class="pre">unicode</span></tt> string for <cite>optval</cite></p>
<p>encodes <cite>optval</cite> with <cite>encoding</cite> before passing the <tt class="docutils literal"><span class="pre">bytes</span></tt> to
<cite>setsockopt</cite></p>
</dd></dl>

<dl class="function">
<dt id="socket.getsockopt_string">
<tt class="descclassname">socket.</tt><tt class="descname">getsockopt_string</tt><big>(</big><em>self</em>, <em>opt</em>, <em>encoding='utf-8'</em><big>)</big><a class="headerlink" href="#socket.getsockopt_string" title="Permalink to this definition">¶</a></dt>
<dd><p>always returns <tt class="docutils literal"><span class="pre">unicode</span></tt> string, after decoding with <cite>encoding</cite></p>
<p>note that <cite>zmq.IDENTITY</cite> is the only <cite>sockopt</cite> with a string value
that can be queried with <cite>getsockopt</cite></p>
</dd></dl>

</div>
</div>
</div>


          </div>
        </div>
      </div>
      <div class="clearer"></div>
    </div>
    <div class="related">
      <h3>Navigation</h3>
      <ul>
        <li class="right" style="margin-right: 10px">
          <a href="genindex.html" title="General Index"
             >index</a></li>
        <li class="right" >
          <a href="py-modindex.html" title="Python Module Index"
             >modules</a> |</li>
        <li class="right" >
          <a href="morethanbindings.html" title="More Than Just Bindings"
             >next</a> |</li>
        <li class="right" >
          <a href="pyversions.html" title="PyZMQ, Python2.5, and Python3"
             >previous</a> |</li>
        <li><a href="index.html">home</a>|&nbsp;</li>
        <li><a href="search.html">search</a>|&nbsp;</li>
       <li><a href="api/index.html">API</a> &raquo;</li>
 
      </ul>
    </div>
    <div class="footer">
        &copy; Copyright 2010-2011, Brian E. Granger &amp; Min Ragan-Kelley.  
ØMQ logo © iMatix Corportation, used under the Creative Commons Attribution-Share Alike 3.0 License.  
Python logo ™ of the Python Software Foundation, used by Min RK with permission from the Foundation.
      Created using <a href="http://sphinx.pocoo.org/">Sphinx</a> 1.0.7.
    </div>
  </body>
</html>