Version 0 of Encoding Translations and i18n

Updated 2000-03-15 14:04:08

Now that we have a fairly comprehensive start and native translations among character sets in the core Tcl, it's time to fledge out this great beginning. Many translations are not yet available for a simple

 encoding convertfrom ?encoding_name? ?string?

Yet, the world-wide documentation of these encodings is expanding rapidly. Recently (Dec 1999), an ebcdic.enc[L1 ] was posted by Jan Nijtmans [L2 ] based on a web table [L3 ], but there are so many more[L4 ] still missing. Mark Leisher has a compendium at this homepage [L5 ], for example. Tcl can be a powerful tool for standardization and automatic compatibility[L6 ] leadership in data exchange.


UTF-8[L7 ] and other transformations [L8 ]

Ref: (Compuserve)[L9 ] (Wyoming)[L10 ] (Germany)[L11 ] (Wiki)[L12 ]

The UTF-8 encoding (e.g. Unicode-like encoding for the web - Netscape/IE 4+ support[L13 ]) is an alternative encoding to Unicode-16, encoded character-for-character, but with 'escape' values. You cannot mix Unicode-16 with UTF-8, but you can convert losslessly between them, so long as you're not off into the Unicode-32 encodings.

  \xFD\xBF\xBF\xBF\xBF\xBF translates to U+7FFFFFFF (Unicode-32)
  \xFB\xBF\xBF\xBF\xBF     translates to U+03FFFFFF (Unicode-32)
  \xF7\xBF\xBF\xBF         translates to U+001FFFFF (Unicode-32)
  \xEF\xBF\xBF             translates to U+0000FFFF or ''\uFFFF'' in Tcl
  \xDF\xBF                 translates to U+000007FF or ''\u07FF'' in Tcl
  \x7F                     is the highest single-byte code in UTF-8

Although there are Unicode escaped glyphs, similar to those so often used like   for the ISO8859-1 non-breaking space, you cannot count on browsers (yet) properly interpreting them within a page, and especially when the page itself has not been tagged as using charset=utf8.

The following should be considered alpha quality. For Tcl8.x with built-in encodings, merely use utf-8 as the convertto/convertfrom:

 #
 # Converts a Unicode string into an array of 16-bit values, for which
 # the low 8 bits of each character should be emitted to give the true
 # UTF-8 value (e.g. [encoding encodingto iso8859-1] in most cases)
 # Equivalent: [encoding convertto utf-8 string]
 #
 proc {unicode_to_utf8} {string} {
   set rv {}
   foreach c [split $string {}] {
     scan $c %c i
     if {$i < 128} {
       append rv $c
     } elseif {$i < 2048} {
       append rv [format %c%c [expr (($i & 1984) >> 6) | 192] \
                              [expr ($i & 62) | 128]]
     } elseif {$i < 65536} {
       append rv [format %c%c%c [expr (($i & 61440) >> 12) | 224] \
                                [expr (($i & 1984) >> 6) | 192] \
                                [expr ($i & 62) | 128]]
     } elseif {$i < 2097152} {
 #       Can't happen in Tcl 8.3.x and below
     } elseif {$i < 4294967296} {
 #       Can't happen in Tcl 8.3.x and below
     }
   }
   return $rv; # to be interpreted as a byte array
 }


 #
 # Converts a "string" of 16-bit UTF-8 entities into true unicode-16 where
 # values of \u0000-\uFFFF are specified in the UTF-8.  Source data
 # likely was read in as [encoding encodingfrom iso8859-1].  When the
 # second parameter (uescape) is specified as a non-zero (TRUE) value,
 # any UTF-8 value above U+0000FFFF will be inserted as a pseudo \u
 # escaped ASCII-hex value.  When it is not specified, any values above
 # U+0000FFFF will be replaced with a \uFFFC (not a character) which is
 # officially called the "Object Replacement Character"
 # Equivalent: [encoding convertfrom utf-8 $hextetarray]
 #
 proc {utf8_to_unicode} {hextetarray {uescape 0}} {
   set rv {}
   set string [split $hextetarray {}]
   for {set x 0} {$x < [llength $hextetarray]} {incr x} {
     scan [lindex $hextetarray $x] %c i
     if {$i > 253} {
 #       Cannot be handled in 31 bits, let alone 16-bit Unicode-16
 #       Most likely an error - absorb ONE byte
       append rv ?
     } elseif {$i >= 252} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+04000000..U+7FFFFFFF
       if {$uescape} {
         set iiiiii [expr ($i & 1) << 31]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiiii [expr ($i & 63) << 24]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiii [expr ($i & 63) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x 
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 63]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii | $iiiiii]]
       } else {
         append rv "\uFFFC"
         incr x 5
       }
     } elseif {$i >= 248} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+00200000..U+03FFFFFF
       if {$uescape} {
         set iiiii [expr ($i & 3) << 24]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iiii [expr ($i & 63) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 127]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii | $iiiii]]
      } else {
        append rv "\uFFFC"
        incr x 4
      }
    } elseif {$i >= 240} {
 #       Cannot be handled in 16-bit Unicode, this is a 32-bit value
 #       in the range U+00010000..U+001FFFFF
       if {$uescape} {
         set iiii [expr ($i & 7) << 18]
         incr x
         scan [lindex $hextetarray $x] %c i
         set iii [expr ($i & 63) << 12]
         incr x
         scan [lindex $hextetarray $x] %c i
         set ii  [expr ($i & 63) << 6]
         incr x
         scan [lindex $hextetarray $x] %c i
         set i [expr $i & 63]
         append rv {\u}
         append rv [format %X [expr $i | $ii | $iii | $iiii]]
       } else {
         append rv "\uFFFC"
         incr x 3
       }
     } elseif {$i >= 224} {
       set iii [expr ($i & 15) << 12]
       incr x
       scan [lindex $hextetarray $x] %c i
       set ii  [expr ($i & 63) << 6]
       incr x
       scan [lindex $hextetarray $x] %c i
       set i [expr $i & 63]
       append rv [format %c [expr $i | $ii | $iii]]
     } elseif {$i >= 192} {
       set ii [expr ($i & 31) << 6]
       incr x
       scan [lindex $hextetarray $x] %c i
       set i [expr ($i & 63)]
       append rv [format %c [expr $i | $ii]]
     } elseif {$i < 128} {
       append rv [lindex $hextetarray $x]
     }
   }
   return $rv; # as a Unicode string
 }

Byte-Order Mark

Also, there has been no strong push for use of the Unicode "introducer" in the Tcl community (yet). It's wise to use \uFEFF at the beginning of any Unicode-16 encoded file. This gives insurance about byte order, because \uFFFE is guaranteed to never be a true Unicode[L14 ] character. In UTF-8, the BOM is \xEF\xBB\xBF


Slash-U format

In Tcl (and/or Tcl/Tk) source files, we 'must' use slash-u format for unicode characters which are beyond the basic ASCII encoding, in order to preserve values across different system encodings. This proc is provided as an easy way to grab Unicode characters into strings which the interpreter will later encode into the desired values.

  proc {unicode_to_slashu} {string} {
    set rv {}
    foreach c [split $string {}] {
      scan $c %c c
      append rv {\u}
      append rv [format %.4X $c]
    }
    return $rv
  }

Unicodes to HTML format Here's a similar helper that converts all characters above 127 in a string to the entity decimal format in HTML (e.g. &#22269; for \u56fd):

 proc u2html {s} {
    set res ""
    foreach u [split $s ""] {
        scan $u %c t
        if {$t>127} {
            append res "&#$t;"
        } else {
            append res $u
        }
    }
    set res
 } ;# RS

See also the Drag and Drop page on the Wiki, and The Lish family for other ways to get Unicodes from 7-bit ASCII.