Sophie

Sophie

distrib > Mandriva > 2010.0 > x86_64 > by-pkgid > c27466c2a3fa3cf6008c3a485d00ce04 > files > 98

jetty5-manual-5.1.15-1.5.2mdv2010.0.noarch.rpm

SECTION: 900-Content
TITLE: International
QUESTION: How do I work with international characters?
<small>
<I>by chris&#64;harvington.org.uk</I><BR>
</small>
<p>Internet standards support for international character sets was historically 
  weak, is now improving, but is still not perfect. The current standards situation 
  is confusing, and has led to false expectations about what can be reliably achieved 
  using today's browsers, protocols and servers.</p>
<p>Jetty fully supports and implements the current relevant standards and specifications, 
  but this alone is not sufficient to make working with international characters 
  easy or reliable in all situations. This FAQ explains the current standards, 
  provides hints and tips for building and decoding Internationalised web pages 
  (including ones with dynamic data) and explains how Jetty has anticipated the 
  probable future direction of the standards.</p>
<p>The intended readership is people developing Servlet applications and their 
  associated web pages. A basic knowledge of Java, HTML and of hexadecimal notation 
  is assumed.</p>
<p>Unless otherwise stated, the information below applies to all current (August 
  2002) standards-conformant Web Servers, not just to Jetty.</p>
<h3>Primer and Terminology</h3>
<p>There are four groups whose standards and specifications affect character handling 
  in web pages:</p>
<ul>
  <li>The Internet Engineering Task Force (<a href="http://www.ietf.org/">IETF</a>) 
    manages the Requests for Comments (RFCs) which define the basic transportation 
    protocols, including the HyperTextTransfer Protocol (HTTP) and Uniform Resource 
    Locators (URLs) with which we are concerned,</li>
  <li>The World Wide Web Consortium (<a href="http://www.w3.org/">W3C</a>) manages 
    the HTML and XHTML standards, with which web pages are built, and the XML 
    standard, used in Jetty configuration.</li>
  <li>The Internet Assigned Numbers Authority (<a href="http://www.iana.org">IANA</a>), 
    which, among other duties, maintains a list of <a href="http://www.iana.org/assignments/character-sets">character 
    encodings</a>.</li>
  <li>Sun Microsystems, Inc. owns the <a href="http://java.sun.com/products/servlet/"> 
    Servlet</a> specification, which is expressed in <a href="http://java.sun.com">Java</a> 
    APIs.</li>
</ul>
<p>The Internet was initially designed and constructed using basic English characters 
  encoded in the 7-bit US-ASCII character set. This provides 26 upper and lower 
  case letters, 10 digits, and a variety of punctuation and other symbols - it 
  encodes 95 'printing' characters. Today, these are still the only printable 
  characters which can be used in HTTP.</p>
<p>A single byte (8 bits) has the capacity to represent up to 256 characters. 
  There are several widely-used encodings which give the US-ASCII characters their 
  normal values in the range 0-127, and a selection of other characters are assigned 
  code values in the range 128-255. In many web browsers the encoding to use can 
  either be specified by the web page designer or selected by the user. Some of 
  these encodings are proprietary, others specified by consortia or international 
  standards bodies.</p>
<p>The first approaches to supporting international characters involved selecting 
  one of these 8-bit character sets, and attempting to ensure that the web page 
  source, the browser, and any server using data from that browser were using 
  the same character encoding.</p>
<p>There is a default character set ISO-8859-1, which supports most western European 
  languages, and is currently the official 'default' content encoding for content 
  carried by HTTP (but <i>not</i> for data included in URLs - see below). This 
  is also the default for all Jetty versions <i>except</i> 4.0.x. Alternative 
  encodings may be specified by defining the HTTP <code>Content-Type</code> header 
  of the page to be something like &quot;<code>text/html; charset=ISO-8859-4</code>&quot;. 
  From a Servlet you can use <code>request.setContentType(&quot;text/html; charset=ISO-8859-4&quot;); 
  </code>. Pages can then be composed using the desired literal characters (e.g. 
  accented letters from Europe or, with a different encoding selected, Japanese 
  characters). This mechanism can be unreliable; the browser's user can select 
  the encoding to be applied, which may be different from that intended by the 
  servlet designer. </p>
<p>Today the Internet is converging on a single, common encoding - <a href="http://www.unicode.org">Unicode</a> 
  - in which can be represented all known written languages, as well as a wide 
  range of symbols (e.g. mathematical symbols and decorative marks). Unicode requires 
  a 16-bit integer value to represent each character. By design, the 95 printable 
  US-ASCII characters have the same code values in Unicode; US-ASCII is a subset 
  of Unicode. Most modern browsers can decode and display a wide range of the 
  characters represented by Unicode - but it would be rare to find a browser capable 
  of displaying <i>all</i> the Unicode characters.</p>
<p>Unicode is the only character encoding used in XML and is now the default in 
  HTML, XHML and in most Java implementations.</p>
<p>The Internet transmits data in groups of 8 bits, which the IETF usually call 
  'octets', but everyone else calls 'bytes'. When larger values have to be sent, 
  such as the 16 bits needed for Unicode and some other international character 
  encodings, there has to be a convention on how the 16 bits are packed into one 
  or more octets. There are two standards commonly used to encode Unicode: UTF-8 
  and UTF-16.</p>
<p>UTF-16 is the 'brute-force' encoding for data transmission. The 16 bits representing 
  the character value are placed in adjacent octets. Variants of UTF-16 place 
  the octets in different orders.</p>
<p>UTF-8 is more common, and is recommended for most purposes. Characters with 
  values less 0080 (hexadecimal notation) are transmitted as a single octet whose 
  value is that of the character. Characters with values in the range 0080 to 
  07FF are encoded as two octets, whilst the (infrequently-used) Unicode characters 
  with values between 0800 and FFFF are encoded into three octets. This encoding 
  has the really useful property that a sequence of (7-bit) US-ASCII characters 
  sent as bytes and then sent as Unicode UTF-8 octets produce identical octet 
  streams - a US-ASCII byte stream is a valid UTF-8 octet stream and represents 
  the same printing characters. This is <i>not</i> the case if other characters 
  are being sent, or if UTF-16 is in use.</p>
<p>As well as having US-ASCII compatibility, UTF-8 is preferred because, in the 
  majority of situations, it results in shorter messages than UTF-16.</p>
<p>Note that, when UTF-8 is specified, it not only defines the way in which the 
  character code values are packed into octets, it also implicitly defines the 
  character encoding in use as Unicode. </p>
<p>There is an international standard - ISO-10646 - which defines an identical 
  character set to Unicode - but omits much essential additional information. 
  For most purposes refer to Unicode rather than ISO-10646.</p>
<h3>Characters included in HTTP requests</h3>
<p>There are two places in which <a href="http://www.w3.org/Protocols/rfc2068/rfc2068">HTTP</a> 
  requests (from browsers to web servers) may include character data:</p>
<ol>
  <li>In the URL part of the first line of the HTTP request,</li>
  <li>In the HTTP content area at the end of the HTTP message, resulting from 
    an <a href="http://www.w3.org/TR/html4/interact/forms.html#h-17.3">HTML</a> 
    <code><br>
    &lt;form method='post'...&gt; ... &lt;/form&gt;<br>
    </code> submission</li>
</ol>
<h4>Characters in the HTTP content part</h4>
<p>Wherever possible, a POST method should be used when international characters 
  are involved. </p>
<p>This is because:</p>
<ol>
  <li>The web browser records the way it has encoded the content data in the MIME 
    headers that precede the content itself,</li>
  <li>The web server uses this MIME header to apply the correct decoding to the 
    character content,</li>
  <li>There are no (practical) limits to the amount of data which can sent in 
    this way (URL lengths may be limited by the buffers in some web/proxy servers) 
  </li>
</ol>
<p>The HTTP <code>Content-Type</code> header is used to tell the web server's 
  Servlet classes which encoding was used. By the time the characters are made 
  available to the Servlet as a String it is in the Unicode encoding used by Java. 
</p>
<h4>International characters in URLs</h4>
<p>A typical URL looks like:</p>
<p><code>http://www.my.site/dir/page.html</code></p>
<p>When a form is sent, using the default GET method, the data values from the 
  form are included in the URL, e.g.:</p>
<p><code>http://www.my.site/dir/page.html?name=Durst&amp;age=18</code></p>
<p>It is important to note that only a very restricted sub-set of the US-ASCII 
  printing characters are permitted in URLs. </p>
<p>Something like <code>name=D&uuml;rst</code> (with an umlaut) is illegal. It 
  might work with some browser/server combinations (and might even deliver the 
  expected value), but it should never be used.</p>
<p>The HTTP Specification provides an 'escape' mechanism, which permits arbitrary 
  octet values to be included in the URL. The three characters %HH - where HH 
  are hexadecimal characters - inserts the specified octet value into the URL. 
  This has to be used for the US-ASCII characters which may not appear literally 
  in URLs, and can be used for other octet values.</p>
<p>It is a common fallacy that this permits international characters to be reliably 
  transmitted. This is wrong.</p>
<p><b><i>Current standards (July 2002) provide no basis for the reliable transmission 
  of international characters using the URL %HH escape mechanism</i>.</b></p>
<p>This is because the %HH escape permits the transmission of a sequence of octets, 
  but has nothing to say about what character encoding is in use.</p>
<p>There is no provision in the HTTP protocol for the sender (the browser) to 
  tell the receiver (the web server) what encoding has been used in the URI, and 
  none of the specifications related to HTTP/HTML define a default encoding.</p>
<p>Thus, although any desired octet sequence can be placed in a URL, none of the 
  standards tell the web server how to interpret that octet sequence.</p>
<p>The designers of web servers with Servlet APIs currently have a problem. They 
  are presented with an octet stream of unspecified encoding, and yet have to 
  deliver a Java String (a sequence of decoded characters) to the Servlet API.</p>
<p>There is an <a href="http://www.ietf.org/internet-drafts/draft-duerst-iri-01.txt">Internet 
  draft </a>(which expires Oct. 2002) which proposes that URLs should evolve into 
  Internationalized Resource Identifiers (IRIs). This work-in-progress proposes 
  that an octet sequence (generated by a combination of US-ASCII printing characters 
  and %HH escapes) should always represent a UTF-8 character sequence.</p>
<p>The 4.0.x release of Jetty made this assumption of UTF-8, in anticipation of 
  the draft (or something like it) becoming the formal specification. Earlier 
  releases of Jetty (and other web servers) made different assumptions about the 
  encoding of octets outside the 'core' US-ASCII range.</p>
<p>During July 2002 the likely RFC direction seems to have evolved again. It now 
  appears that the character encoding in URIs will always remain 'undefined'. 
  A totally new RFC for IRIs is likely to be proposed.</p>
<p>Accordingly, Jetty 4.1 has reverted to a default encoding of ISO-8859-1.</p>
<h3>Handling of International characters by browsers</h3>
<p>There are many ways in which international characters can be displayed or placed 
  into browsers for inclusion in HTTP requests. Some examples are: </p>
<ul>
  <li>&lt;h3&gt;D&uuml;rst&lt;/h3&gt; &lt;!-- Assumes specific character encoding 
    --&gt; </li>
  <li>&lt;h3&gt;D&amp;uuml;rst&lt;/h3&gt; </li>
  <li>&lt;h3&gt;D&amp;#252;rst&lt;/h3&gt;</li>
  <li>&lt;h3&gt;D&amp;#xFC;rst;&lt;/h3&gt;</li>
  <li>&lt;input type='text' value='D&amp;uuml;rst'&gt;</li>
  <li>&lt;script language='javascript'&gt;name=&quot;D\u00FCrst&quot;;&lt;/script&gt;</li>
  <li>&lt;form action='<code>/servlet?name=D%C3%BCrst</code>'&gt;...&lt;/form&gt;</li>
  <li>&lt;a href='<code>/servlet?name=D%C3%BCrst</code>'&gt;...&lt;/a&gt;</li>
</ul>
<p>It is also possible to manipulate document text using the <a href="http://www.w3.org/DOM">DOM</a> 
  APIs. </p>
<p> It is believed that, in all the above examples, all modern browsers (those 
  supporting HTML 4) will treat the &amp;...; encoding as representing Unicode 
  characters. Earlier ones may not understand this encoding.</p>
<p>The first example, with the literal &uuml;, should only be used if the character 
  encoding can be relied upon, and if support for 'legacy' browsers (those not 
  understanding the &amp;...; encoding) is essential.</p>
<p>It is, of course, possible for users to enter characters using &lt;input..&gt; 
  and &lt;textarea...&gt; elements via the operating system. Text can come from 
  keyboards, and also from 'cut and paste' mechanisms. It appears that most browsers 
  use their current (user-selectable) 'Encoding' setting (e.g. in MSIE: View..Encoding) 
  to encode such characters. After the User has selected the encoding to use, 
  it appears that many browsers will transmit the data characters in the request 
  in that locally-defined encoding, rather than the one specified with the page. 
</p>
<h3>Techniques for working with international characters</h3>
<p>The only reliable, standards-supported way to handle international characters 
  in a browser- and server-neutral way appears to be:</p>
<ol>
  <li>To specify the contents of the HTML page using &amp;...; encoding and</li>
  <li>To use the <code>&lt;form method='post' ...&gt;</code> method of submission.</li>
</ol>
<h3>Need for GET-method support</h3>
<p>It is sometimes suggested that all forms can and should be submitted using 
  the above POST method. There is, in fact, a valid need to use the default GET 
  method. </p>
<p>To appreciate this need one must consider carefully a significant difference 
  between submitting a form using POST and GET. When using GET the data values 
  from the form are appended to the URI and form part of the visible 'address' 
  seen at the top of the browser. It is also this entire string that is saved 
  if the page is bookmarked by the browser, and may be moved using 'cut-and-paste' 
  into an eMail, another web page etc..</p>
<p>It is possible that the dynamic data from the form is an integral part of the 
  semantics of the 'page address' to be stored. </p>
<p>The address may be part of a map; one of the data values from the map may define 
  the town on which the map view is to be centered - and this name may only be 
  expressible in, say, Thai characters. The town name may have been entered or 
  selected by the user - it was not a 'literal' in the HTML defining the page 
  or in the URL.</p>
<p>Another common need is to 'bookmark' or sent by eMail the request string from 
  a search engine request which has non-ASCII characters in the search.</p>
<p>There is not yet any standards-based way of constructing this dynamically-defined 
  URL - there is no direct way to be certain what character encodings have been 
  applied in constructing the URI-with-query string that the browser generates.</p>
<p>A work-around which has been suggested is to provide additional, known text 
  fields alongside the unknown text. In the example above, the form with the dynamically-defined 
  town name could also have a hidden field into which the generated page inserts 
  'known' text from the same character set (using the &amp;...; encoding). When 
  the request is eventually received by a servlet the octet contents of the known 
  field are inspected (typically by using <code>request.getQueryString()</code> 
  ). If the characters of the 'tracer' field are the same as those injected into 
  the page when it was generated (and the character code values encompass those 
  of the unknown town) then there is an assumption that the encoding used was 
  Unicode and that the town name as present in Java is accurate.</p>
<p>If the the actual encoding can be deduced from the 'tracer' octets, the Servlet 
  API <code>request.setCharacterEncoding()</code> can be called (before calling 
  any of the .<code>getParameter()</code> methods) to tell the web server which 
  encoding to assume when decoding the query.</p>
<p>There is an obvious potential flaw in this 'tracer' technique - the browser 
  may represent &amp;...;-specified 'tracer' text with its Unicode values, yet 
  may use the local keyboard/operating system encoding for locally-entered data. 
  The author is not aware of any conclusive knowledge or evidence in this area.</p>
<p>An alternative work-around, which is more complex but might be more certain 
  if GET <i>has</i> to be used, would be to use Javascript or the DOM interfaces 
  to transfer the characters from the input fields to the query part of the URL 
  string. </p>
<h3>International characters in Jetty configuration files</h3>
<p>Jetty configuration files use XML. If international characters need to be included 
  in these files, the standard XML character encoding method should be used. Where 
  the character has a defined abbreviation (such as <code>&amp;uuml;</code> for 
  u-umlaut), that should be used. In other cases the hexadecimal description of 
  the character's Unicode code value should be used. For example <code>&amp;#x0391;</code> 
  defines the Greek capital A letter. Use of the decimal form <code>(&amp;#913;)</code> 
  seems now to be unfashionable in W3C circles.</p>
<h3>Future possibilities</h3>
<p>It is to be hoped that something like the IRI scheme described in the Internet 
  Draft will evolve into a standard that will be adopted by suppliers of web servers 
  and browsers. It will probably also need changes to HTTP and/or the use of internationalised 
  versions of the <code>http</code> and <code>https</code> protocols. As currently 
  drafted, such a scheme would not, of its own solve the problem of dynamic data 
  derived from form GET submissions. This will require changes to HTML4 or, more 
  likely, extensions to a future version of XHTML. </p>
<p>The whole area of form data handling may be radically improved if the <a href="http://www.w3.org/MarkUp/Forms/">Xforms</a> 
  program is successful. This has defined an XML-based approach to forms and associated 
  data and event handling and uses Unicode throughout. The Xforms 1.0 specification 
  is currently (August 2002) at 'last call working draft' status, and a number 
  of experimental implementations, some using browser applets or plug-ins, have 
  been announced. </p>
<p>Neither of these likely developments will improve the handling of international 
  characters by 'todays' browsers, so designers of web services for the 'open' 
  market seem likely to have to work within today's constraints for a long time. 
</p>
<p>Anyone interested in the full complexity of handing international characters 
  and languages might like to read the W3C's <a href="http://www.w3.org/TR/charmod/">Character 
  Model</a> (currently a working draft) and follow the W3C's <a href="http://www.w3c.org/International/Activity.html">International 
  Activity</a>.</p>