by Rob Mayoff
Note that this document applies only to the Tcl 8 version of
AOLserver, also known as nsd8x
, because Tcl 7 has
no internationalization support. This document is also mainly
concerned with the AOLserver Tcl API,
because that is what we
use at ArsDigita. There are probably problems in the C API as
well that are not covered here.
application/x-www-form-urlencoded
Format
multipart/form-data
Format
ns_httpopen / ns_httpget
The second case is different. It turns out that Tcl 8.1 and later
use Unicode. The interpreter normally stores strings using the UTF-8
encoding (which uses a variable number of bytes per character),
and sometimes converts them to UCS-2 encoding (which uses 16-bit
"wide characters"). The regsub
command is one of those
cases where conversions are involved. First, regsub
converted the string to UCS-2. Tcl's UTF-8 parser is lenient, so
the transformation ended up translating xFC
into
x00FC
. (This happens to be the correct translation
because UCS-2 is a superset of ISO-8859-1.) Then regsub
did its matching and substitution. Then it converted the UCS-2
representation back to UTF-8. The UTF-8 encoding of x00FC
is xC3 xBC
. AOLserver does not know anything about UTF-8;
it just sends whatever bytes you give it. In ISO-8859-1, xC3 means
à and xBC means ¼.
So regsub
didn't do anything wrong. We gave it
garbage (a non-UTF-8 string), so it gave us garbage. How do we solve
this problem? We need to make sure that all of AOLserver's textual
input is translated to its UTF-8 representation and that the UTF-8 is
translated to the appropriate character encoding on output.
Charset is synonymous with "character encoding"; Internet standards use this term.
Tcl 8.1 and later use Unicode and UTF-8 internally and include support for converting between character encodings. The Tcl names for various encodings are different than the Internet standard names. So, in this document, I typically use the term "encoding" when I am referring to Tcl, and "charset" when I am referring to an Internet protocol feature.
AOLserver has several APIs for sending the contents of a file directly to the client. All of them send the contents of the file back to the client unmodified - no character encoding translation is performed. This means that it is up to you to ensure that the file's encoding is the same as the encoding the client expects.
The safest thing is to use only US-ASCII bytes in your text files - bytes with the high bit clear. Just about every character encoding you're likely to run across on the Web will be a superset of US-ASCII, so no matter what charset the client is expecting, your content will probably be displayed correctly. If you are sending an HTML (or XML) file, it can still access any Unicode character using the &#nnn; notation. However, if you have non-HTML files, or you don't want to deal with all those character reference entities, you'll have to make sure your client knows what character set you're sending.
The client knows what character set to expect from the Content-Type header. You're probably used to seeing a header like this:
Typically, you determine the content-type to send for a file by
calling ns_guesstype
on it. ns_guesstype
looks up the file extension in AOLserver's file extension
table to pick the content-type. The default table is in the
AOLserver manual. Some of the default mappings are:
Extension | Type |
---|---|
.html | text/html |
.txt | text/plain |
.jpg | image/jpeg |
nsd.ini | nsd.tcl |
---|---|
[ns/mimetypes] .html=text/html; charset=iso-8859-1 .txt=text/plain; charset=iso-8859-1 |
ns_section ns/mimetypes ns_param .html "text/html; charset=iso-8859-1" ns_param .txt "text/plain; charset=iso-8859-1" |
nsd.ini | nsd.tcl |
---|---|
[ns/mimetypes] .html=text/html; charset=iso-8859-1 .txt=text/plain; charset=iso-8859-1 .html_sj=text/html; charset=shift_jis .txt_sj=text/plain; charset=shift_jis .html_ej=text/html; charset=euc-jp .txt_ej=text/plain; charset=euc-jp |
ns_section ns/mimetypes ns_param .html "text/html; charset=iso-8859-1" ns_param .txt "text/plain; charset=iso-8859-1" ns_param .html_sj "text/html; charset=shift_jis" ns_param .txt_sj "text/plain; charset=shift_jis" ns_param .html_ej "text/html; charset=euc-jp" ns_param .txt_ej "text/plain; charset=euc-jp" |
set fd [open somefile.html_sj r] fconfigure $fd -encoding shiftjis set html [read $fd [file size somefile.html_sj]] close $fd ns_return 200 "text/html; charset=euc-jp" $html
XXX ACS: ad_serve_html_file
ns_writefp
ns_connsendfp
ns_returnfp
ns_respond
ns_returnfile
ns_return
(and variants like ns_returnerror
)
ns_write
Tcl stores strings in memory using UTF-8. However, when you send content to the client from Tcl, you may not want the client to receive UTF-8; he may not support it. So AOLserver can translate UTF-8 to a different charset.
If you use ns_return
or ns_respond
to send a Tcl string to the client, AOLserver determines what
character set to use by examining the content type you specify:
text/anything
,
then AOLserver translates the string to the charset
specified in the config file by
ns/parameters/OutputCharset
(iso-8859-1 by
default).
In the second instance, where AOLserver uses
the ns/parameters/OutputCharset
, if
ns/parameters/HackContentType
is also set to true, then
AOLserver will modify the Content-Type header to include the charset
parameter. HackContentType is set by default, and I strongly
recommend leaving it set, because it's always safer to tell the
client explicitly what charset you are sending.
For example, the default configuration is equivalent to this:
[ns/parameters] OutputCharset=iso-8859-1 HackContentType=true
$html
will be converted to
the ISO-8859-1 encoding as they are sent to the client.
If you write the headers to the client with ns_write
instead of letting AOLserver do it (via ns_return
or ns_respond
), then AOLserver does not parse
the content-type. You must explicitly tell it what charset
to use immediately after you write the headers, by calling
ns_startcontent
in one of these forms:
content-type
, which should
be the same value you sent to the client in the Content-Type
header. If content-type
does not
contain a charset parameter, AOLserver translates to
ISO-8859-1. ns_choosecharset
command will return the best charset
to use, taking into account the Accept-Charset header and the charsets
supported by AOLserver. The syntax is
The ns_choosecharset
algorithm:
preferred-charsets
to the list of
charsets specified by the -preference flag. If that flag
was not given, use the config parameter
ns/parameters/PreferredCharsets
. If the config
parameter is missing, use {utf-8 iso-8859-1}
.
The list order is significant. acceptable-charsets
to the
intersection of the Accept-Charset charsets and the charsets
supported by AOLserver. acceptable-charsets
is empty, return
the charset specified by config parameter
ns/parameters/DefaultCharset
, or
iso-8859-1
by default. preferred-charsets
that also appears in acceptable-charsets
.
Return that charset. preferred-charsets
also
appears in acceptable-charsets
, then choose the
first charset listed in Accept-Charsets that also appears in
acceptable-charsets
. Return that charset.
(Note: the last step will always return a charset because
acceptable-charsets
can only contain charsets
listed by Accept-Charsets.)
Example:
# Assume japanesetext.html_sj is stored in Shift-JIS encoding. set fd [open japanesetext.html_sj r] fconfigure $fd -encoding shiftjis set html [read $fd [file size japanesetext.html_sj]] close $fd set charset [ns_choosecharset -preference {utf-8 shift-jis euc-jp iso-2022-jp}] set type "text/html; charset=$charset" ns_write "HTTP/1.0 200 OK Content-Type: $type \n" ns_startcontent -type $type ns_write $html
In URL encoding, one byte may be encoded as three bytes which in US-ASCII represent a percent character ("%") followed by two hexadecimal digits.
After a URL is decoded, any bytes less that x80 represent US-ASCII characters. The problem with URLs and URL encoding is that historically, no standard defined what bytes larger than x80 represent. Various proposals such as IURI Internet-Draft propose using UTF-8 exclusively as the character encoding in URLs, but existing software does not work that way.
AOLserver's ns_urlencode
and
ns_urldecode
choose the character encoding to use
in one of three ways:
-charset
flag, use that charset. For example:
-charset
flag was given, then the
ns_urlcharset
command determines what
encoding is used. The ns_urlcharset
sets the
default charset for the ns_urlencode
and
ns_urldecode
commands for one connection.
For example, these commands have the same result as the
preceding example:
ns_urlcharset
command is only valid when
called from a connection thread. Do not call it from an
ns_schedule_proc
thread.
ns/parameters/URLCharset
determines the
charset. The default value for the parameter is
"iso-8859-1".
A URL, as seen by AOLserver in an HTTP request, consists of two parts, the path and the query. For example:
/register/user-new.tcl path |
? |
first_names=Rob&last_name=Mayoff query |
We will consider the path part and the query part separately.
ns/parameters/URLCharset
to decode the path.
You must use the same charset to encode URLs you send out,
or you will have problems.
However, other people might link to you from their servers and might be careless about the character encodings. So the safest practice is to use only US-ASCII characters in your URL paths if you possibly can.
application/x-www-form-urlencoded
Formatapplication/x-www-form-urlencoded
format.
Okay, it could be raw data from an <ISINDEX> page, but that tag is deprecated in HTML 4.0. Let's simplify our lives by pretending it doesn't exist.
application/x-www-form-urlencoded
format. The other format is covered under POST Data in
multipart/form-data
Format.
If you always send data in a single charset, and you always
specify the charset in the Content-Type header, then it is safe
to assume that form data is always encoded using that charset.
Just make that your ns/parameters/URLCharset
and
don't worry about it.
If you cannot limit yourself to a single charset, then you
need to use some other technique. No matter how you do it,
you must call ns_urlcharset
before calling
ns_conn form
or ns_getform
.
If you call ns_urlcharset
after you've asked
AOLserver for the form, it will not work retroactively.
Here are two ways you could determine the charset:
The chicken-and-egg problem here is that you need the contents of a form field in order to decode the form. Fortunately, all charset names use only US-ASCII characters, so you can extract the# myform.tcl set _charset [ns_choosecharset] ns_return 200 "text/html; charset=$_charset" " <form action='myform-2.tcl'> <input type='hidden' name='_charset' value='$_charset'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "
_charset
field from the
query string without decoding it. The predefined command
ns_formfieldcharset
will do this for
you:
# myform-2.tcl ns_formfieldcharset _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.
ns_formfieldcharset
calls
ns_urlcharset
, so this will affect all further
use of ns_urlencode
and
ns_urldecode
for that connection, unless you
call ns_urlcharset
again.
There is no chicken-and-egg problem here, but AOLserver still provides the predefined command# anotherform.tcl set _charset [ns_choosecharset] ns_set put [ns_conn outputheaders] Set-Cookie _charset=$_charset ns_return 200 "text/html; charset=$_charset" " <form action='anotherform-2.tcl'> First Names: <input type='text' name='first_names'><br> Last Name: <input type='text' name='last_name'><br> <input type='submit' name='submit' value='Submit'> </form> "
ns_cookiecharset
to set the URL encoding from a cookie:
Using a cookie has the big drawback that a cookie is not associated with a single web page. So if the user uses his back button, or has a page cached, or has multiple windows open, the wrong cookie value might be sent back to us.# myform-2.tcl ns_cookiefield _charset set form [ns_conn form] set first_names [ns_set get $form first_names] set last_name [ns_set get $form last_name] etc.
multipart/form-data
Formatmultipart/form-data
format when the FORM tag says
enctype='multipart/form-data'
. This format is
based on the MIME standard and allows file upload (which
application/x-www-form-urlencoded
does not).
Alas, multipart/form-data
format is no better than
application/x-www-form-urlencoded
format as far as
character encoding issues are concerned. The MIME multipart format
allows each form field to include its own Content-Type header
with a charset parameter, but in practice clients do not send
any indication of the charset used. So we must resort to the
same tricks to decide what charset the data is in: always use
the same charset, or use a hidden field or a cookie to determine
the charset.
The ns_formfieldcharset
and
ns_cookiecharset
commands work for fields in
multipart/form-data
format except file upload
fields. We cannot know what character set the user stores
his files in, so we don't know how to translate an uploaded
file to utf-8 (assuming the uploaded file is even a text file).
So the temporary files created by ns_getform
contain the exact bytes sent by the client.
If you hand non-UTF-8 data to the Oracle client library when it thinks you are handing it UTF-8 data, it may crash. So when you are inserting an uploaded file into a CLOB, it is imperative that you run the file contents through Tcl's encoder first. I have not figured out a satisfactory way to automate this yet.
ns_httpopen / ns_httpget
ns_httpopen
command now parses the Content-Type
header from the remote server and sets the encoding on the read file
descriptor appropriately. If the content from the remote server is a
text type but no charset was specified, then ns_httpopen
uses the config parameter ns/parameters/HttpOpenCharset
,
which specifies the charset to assume the remote server is sending
(iso-8859-1
by default).