Now that we have a fairly comprehensive start and native translations among character sets in the core Tcl, it's time to fledge out this great beginning. Many translations are not yet available for a simple encoding convertfrom ?encoding_name? ?string? Yet, the world-wide documentation of these encodings is expanding rapidly. Recently (Dec 1999), an [ebcdic].enc[http://members1.chello.nl/~j.nijtmans/ebcdic.enc] was posted by Jan Nijtmans [http://members1.chello.nl/~j.nijtmans/ebcdic.txt] based on a web table [http://www.synkronix.com/programmers_guide/ebcdic.html], but there are ''so'' many more[http://www.egt.ie/standards/iso10646/pdf/] still missing. Mark Leisher has a compendium at this homepage [http://crl.nmsu.edu/~mleisher/csets.html], for example. Tcl can be a powerful tool for standardization and automatic compatibility[http://www.w3.org/International/O-charset.html] leadership in data exchange. A mapping of Tcl endoding names to IANA's list is available at [http://mimersbrunn.sourceforge.net/tcl_charset_iana]. ---- '''UTF-8'''[http://info.internet.isi.edu:80/in-notes/rfc/files/rfc2279.txt] and other transformations [http://czyborra.com/utf/] Ref: (Compuserve)[http://ourworld.compuserve.com/homepages/elbrecht/utf8site.htm] (Wyoming)[http://gtcs.com/cgi-bin/charsets] (Germany)[http://www.elbrecht.de/utf8site.htm] (Wiki)[http://purl.org/thecliff/tcl/wiki/515.html] The UTF-8 encoding (e.g. Unicode-like encoding for the web - Netscape/IE 4+ support[http://msdn.microsoft.com/workshop/management/intl/unicode.asp]) is an alternative encoding to Unicode-16, encoded character-for-character, but with 'escape' values. You cannot mix Unicode-16 with UTF-8, but you can convert losslessly between them, so long as you're not off into the Unicode-32 encodings. \xFD\xBF\xBF\xBF\xBF\xBF translates to U+7FFFFFFF (Unicode-32) \xFB\xBF\xBF\xBF\xBF translates to U+03FFFFFF (Unicode-32) \xF7\xBF\xBF\xBF translates to U+001FFFFF (Unicode-32) \xEF\xBF\xBF translates to U+0000FFFF or ''\uFFFF'' in Tcl \xDF\xBF translates to U+000007FF or ''\u07FF'' in Tcl \x7F is the highest single-byte code in UTF-8 ---- Although there are Unicode ''escaped glyphs'', similar to those so often used like '''&#160;''' for the ISO8859-1 non-breaking space, you cannot count on browsers (yet) properly interpreting them within a page, and especially when the page itself has not been tagged as using '''charset=utf8'''. [BR] - Do you have specific experience here? In theory the numeric character entities (what you call "escaped glyphs", like &#160;) do not depend on the charset that a file is tagged with. The charset tag is explicitly for the characters ''not'' represented by entities. I have experience with IE, Netscape 4, Mozilla and Opera. Among those I remember that I have had problems with hexadecimal character entities like   and with UTF-8 display support, especially outside of Latin-1 (as expected for non-Unicode apps). Other than that things have worked fine. ---- The following should be considered ''alpha'' quality. For Tcl8.x with built-in encodings, merely use utf-8 as the convertto/convertfrom: # # Converts a Unicode string into an array of 16-bit values, for which # the low 8 bits of each character should be emitted to give the true # UTF-8 value (e.g. [encoding encodingto iso8859-1] in most cases) # Equivalent: [encoding convertto utf-8 string] # proc {unicode_to_utf8} {string} { set rv {} foreach c [split $string {}] { scan $c %c i if {$i < 128} { append rv $c } elseif {$i < 2048} { append rv [format %c%c [expr (($i & 1984) >> 6) | 192] \ [expr ($i & 62) | 128]] } elseif {$i < 65536} { append rv [format %c%c%c [expr (($i & 61440) >> 12) | 224] \ [expr (($i & 1984) >> 6) | 192] \ [expr ($i & 62) | 128]] } elseif {$i < 2097152} { # Can't happen in Tcl 8.3.x and below } elseif {$i < 4294967296} { # Can't happen in Tcl 8.3.x and below } } return $rv; # to be interpreted as a byte array } # # Converts a "string" of 16-bit UTF-8 entities into true unicode-16 where # values of \u0000-\uFFFF are specified in the UTF-8. Source data # likely was read in as [encoding encodingfrom iso8859-1]. When the # second parameter (uescape) is specified as a non-zero (TRUE) value, # any UTF-8 value above U+0000FFFF will be inserted as a pseudo \u # escaped ASCII-hex value. When it is not specified, any values above # U+0000FFFF will be replaced with a \uFFFC (not a character) which is # officially called the "Object Replacement Character" # Equivalent: [encoding convertfrom utf-8 $hextetarray] # proc {utf8_to_unicode} {hextetarray {uescape 0}} { set rv {} set string [split $hextetarray {}] for {set x 0} {$x < [llength $hextetarray]} {incr x} { scan [lindex $hextetarray $x] %c i if {$i > 253} { # Cannot be handled in 31 bits, let alone 16-bit Unicode-16 # Most likely an error - absorb ONE byte append rv ? } elseif {$i >= 252} { # Cannot be handled in 16-bit Unicode, this is a 32-bit value # in the range U+04000000..U+7FFFFFFF if {$uescape} { set iiiiii [expr ($i & 1) << 31] incr x scan [lindex $hextetarray $x] %c i set iiiii [expr ($i & 63) << 24] incr x scan [lindex $hextetarray $x] %c i set iiii [expr ($i & 63) << 18] incr x scan [lindex $hextetarray $x] %c i set iii [expr ($i & 63) << 12] incr x scan [lindex $hextetarray $x] %c i set ii [expr ($i & 63) << 6] incr x scan [lindex $hextetarray $x] %c i set i [expr $i & 63] append rv {\u} append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii | $iiiiii]] } else { append rv "\uFFFC" incr x 5 } } elseif {$i >= 248} { # Cannot be handled in 16-bit Unicode, this is a 32-bit value # in the range U+00200000..U+03FFFFFF if {$uescape} { set iiiii [expr ($i & 3) << 24] incr x scan [lindex $hextetarray $x] %c i set iiii [expr ($i & 63) << 18] incr x scan [lindex $hextetarray $x] %c i set iii [expr ($i & 63) << 12] incr x scan [lindex $hextetarray $x] %c i set ii [expr ($i & 63) << 6] incr x scan [lindex $hextetarray $x] %c i set i [expr $i & 127] append rv {\u} append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii]] } else { append rv "\uFFFC" incr x 4 } } elseif {$i >= 240} { # Cannot be handled in 16-bit Unicode, this is a 32-bit value # in the range U+00010000..U+001FFFFF if {$uescape} { set iiii [expr ($i & 7) << 18] incr x scan [lindex $hextetarray $x] %c i set iii [expr ($i & 63) << 12] incr x scan [lindex $hextetarray $x] %c i set ii [expr ($i & 63) << 6] incr x scan [lindex $hextetarray $x] %c i set i [expr $i & 63] append rv {\u} append rv [format %X [expr $i | $ii | $iii | $iiii]] } else { append rv "\uFFFC" incr x 3 } } elseif {$i >= 224} { set iii [expr ($i & 15) << 12] incr x scan [lindex $hextetarray $x] %c i set ii [expr ($i & 63) << 6] incr x scan [lindex $hextetarray $x] %c i set i [expr $i & 63] append rv [format %c [expr $i | $ii | $iii]] } elseif {$i >= 192} { set ii [expr ($i & 31) << 6] incr x scan [lindex $hextetarray $x] %c i set i [expr ($i & 63)] append rv [format %c [expr $i | $ii]] } elseif {$i < 128} { append rv [lindex $hextetarray $x] } } return $rv; # as a Unicode string } ---- '''Byte-Order Mark''' Also, there has been no strong push for use of the Unicode "introducer" in the Tcl community (yet). It's wise to use '''\uFEFF''' at the beginning of any Unicode-16 encoded file. This gives insurance about byte order, because '''\uFFFE''' is guaranteed to never be a true Unicode[http://www.unicode.org/textonly.html] character. In UTF-8, the ''BOM'' is '''\xEF\xBB\xBF''' ---- '''Slash-U format''' In Tcl (and/or Tcl/Tk) source files, we 'must' use slash-u format for unicode characters which are beyond the basic ASCII encoding, in order to preserve values across different system encodings. This proc is provided as an easy way to grab Unicode characters into strings which the interpreter will later encode into the desired values. proc {unicode_to_slashu} {string} { set rv {} foreach c [split $string {}] { scan $c %c c append rv {\u} append rv [format %.4X $c] } return $rv } Note: Java calls this format "Unicode escapes", C and C++ talk about UCNs, Universal Character Names. ---- '''Unicodes to HTML format''' Here's a similar helper that converts all characters above 127 in a string to the entity decimal format in HTML (e.g. &#22269; for \u56fd, i.e. 国): proc u2html {s} { set res "" foreach u [split $s ""] { scan $u %c t if {$t>127} { append res "&#$t;" } else { append res $u } } set res } ;# RS See also the [Drag and Drop] page on the Wiki, and [The Lish family] - [The i18n package] for other ways to get Unicodes from 7-bit ASCII.