Monday, 19 April 2004
これは日本語のテキストです。読めますか?
Posted by at 4:05 PM in Internationalization
All internationalization tests pass.
Let's see how a link back to Sam Ruby's Unicode and weblogs goes.
My original frustration with URI encoding in Tomcat 5 for reference.
<Connector port="8009" enableLookups="false" redirectPort="8443" debug="0" protocol="AJP/1.3" URIEncoding="UTF-8"/>The default for this option, IMHO, should be UTF-8 and not ISO-8859-1. Did I dream it that there was a relevant W3C specification where it was specified that UTF-8 should be the default encoding used for URIs? Maybe. I'm looking now, but if you know in particular, point me at it and I'll update this entry appropriately. Update: Character Encoding in URI references. So, you still get the restricted US-ASCII subset allowed in URIs, but the encoding of the characters to bytes is done using UTF-8.
And wouldn't you know it that the reference I was originally looking for was in the javadocs for java.net.URLEncoder#encode(String s, String enc). The specific reference is Non-ASCII characters in URI attribute values.
- 1. Each disallowed character is converted to UTF-8, resulting in one or more bytes.
- 2. The resulting bytes are escaped using the URI escaping mechanism (that is, each byte is converted to %HH, where HH is the byte value expressed using hexadecimal notation).
- 3. The original character is replaced by the resulting character sequence.