Unicode and UTF-8

Unicode and its primary transformation format, utf-8, provide a way to represent the characters found in the writing systems of the world.

Description

The goal of Unicode is to define all the characters in all the writing systems that have ever existed. As of version 15.0 it defines 149,186 distinct coded characters derived from more than 161 scripts. As of version 8.1, Tcl supports the first 65536 characters of Unicode, known as the basic multilingual plane, and uses a modified version of utf-8 as its internal representation for strings.

UTF-8 is an extension of ASCII, where the additional Unicode characters beyond ASCII are represented by one to six bytes. as follows:

ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic.
Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
ISO 10646 codes beyond Unicode: four to six bytes. Many of the newer emoji characters fall in this range.

The full list of encoded characters .

LV: Just this week I had a developer ask me how to handle characters in the four-to-six byte range. How does that work in Tcl? Right now, their tcl application is having a problem when encountering the &Ascr; character (which is a script-A), which has the unicode value of 0x1D49C. tdom says that only UTF-8 chars up to 3 bytes in length can be handled. Is this just a tdom limitation, or is it also a Tcl limitation?

J'raxis 2016-03-23: I just ran into the same issue working on a Tcl script for an IRC app, dealing with the increasingly popular emoji characters. Above it says Tcl represents Unicode internally with 16 bits, which means U+FFFF is the highest it can support. For example Tcl will output U+1F4A9 as "ð\x9f\x92©". In the same line of text I was testing with, other UTF-8-encoded input characters in the U+0000..FFFF range came through just fine.

wiwo 2016-12-07: There is a preprocessor directive in generic/tcl.h:

#define TCL_UTF_MAX 3

The comment says, UCS-2 (max 3 and 4 bytes) is save. When compiling from source, this value is set to 3. Does anybody know what the drawbacks would be when setting this value to 4?

APN: I believe AndroWish and related builds bump it up to 4 so it should work though it will probably lose compatibility with standard Tcl extensions unless they are also rebuilt. Not sure other what changes chw had to make in addition to changing that directive.

chw PYK: In AndroWish TCL_UTF_MAX is set to 6 which turns out to use max. 4 byte long UTF-8 sequences but to represent one Tcl_UniChar as a 32 bit item. Although this doubles memory requirements it keeps consistency w.r.t. counting codepoints, unlike TCL_UTF_MAX set to 3. In AndroWish some additional changes were required too, in order to proper support non Android platforms which have a UTF-16 OS interface (Windows, see undroidwish). One of your options is to search through the AndroWish source tree for places similar to #if TCL_UTF_MAX == 4 and #if TCL_UTF_MAX > 4. The TCL_UTF_MAX set to 4 was deliberately not chosen, since it represents codepoints larger than 0xFFFF as surrogate pairs which makes character counting expensive, and which aren't not fully handled by the Tcl core.

How UTF-8 Extends ASCII

Richard Suchenwirth, PYK:

In Unicode, each code point corresponds to a character or some other item defined in the specification. In UTF-8, a byte whose top bit is 0 is a single-byte ASCII character: Its value is 0x7F or below. If the top two bits are 1, then the length of the top sequence of consecutive bits with a value of 1 is the number of bytes needed to represent the Unicode code point. If the top two bits are 10, then the byte is a continuation byte. All other bits in a multibyte character are concatenated together to form a bit-wise representation of the number corresponding to the code point. See also UTF-8 bit by bit. A sequence of n bytes can hold

 b = 5n + 1  (1 < n < 7)

bits "payload", so the maximum is 31 bits for a 6-byte sequence.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:

00..7F: ASCII characters.
80..BF: Continuation bytes in a multibyte sequence.
C2..FD: Initial bytes in a multibyte sequence (C0, C1 are not allowed).
FE and FF: Never used (to avoid confusion with byte-order marks).

The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data. Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap). Tcl however shields these UTF-8 details from us: characters are just characters, no matter whether 7 bit, 16 bit, or more. I liked Tcl since 7.4, and I love it since 8.1.

Converting Unicode to Escape Sequences

Richard Suchenwirth:

Here's a little helper that reconverts a real Unicode string (e.g. as pasted to a widget) to \u-escaped ASCII:

proc u2a s {
    set res {} 
    foreach i [split $s {}] {
        scan $i %c c
        if {$c < 128} {append res $i} else {append res \\u[format %04.4X $c]}
    }
    set res
}

I found your u2a nice and handy but added a mapping since it gave me for some reason 'wrong' numbers.

## CP1252.TXT says these char should be that 
variable myuscanlist 
set myuscanlist [eval list {*}{
    \\u0080 \\u20AC \\u0081 {}
    \\u0082 \\u201A \\u0083 \\u0192
    \\u0084 \\u201E \\u0085 \\u2026
    \\u0086 \\u2020 \\u0087 \\u2021
    \\u0088 \\u02C6 \\u0089 \\u2030
    \\u008A \\u0160 \\u008B \\u2039
    \\u008C \\u0152 \\u008D {}
    \\u008E \\u017D \\u008F {}
    \\u0090 {}        \\u0091 \\u2018
    \\u0092 \\u2019 \\u0093 \\u201C
    \\u0094 \\u201D \\u0095 \\u2022
    \\u0096 \\u2013 \\u0097 \\u2014
    \\u0098 \\u02DC \\u0099 \\u2122
    \\u009A \\u0161 \\u009B \\u203A
    \\u009C \\u0153 \\u009D {}
    \\u009E \\u017E \\u009F \\u0178 
}]


proc u2a s {
   variable myuscanlist 
   set res {} 
   foreach i [split $s {}] {
       scan $i %c c
       if {$c < 128} {
           append res $i
       } else {
           append res [::string map $myuscanlist \\u[format %04.4X $c] ] ;# koyama
       }
   }
   set res
} ;#RS

RS: All is not that well. Unicode text files can so far not be sourced, or autoindexed, by Tcl. As a preliminary fix, see my Unicode file reader

The Unicode character \u0000, which is also the ASCII NUL, is represented internlly using the two-byte value that is illegal in UTF-8. This avoids interpretation in C of the NUL character as the end of a string.

DKF - RS: the two bytes are C0 80. See UTF-8 bit by bit for why. The Unicode consortium has "outlawed" the use of non-shortest byte sequences for security reasons, with the practical consequence that byte values C0 and C1 must never occur in Unicode data. So Tcl will have to make sure that such characters (especially the NUL byte Donal mentioned) are converted to legal forms when exported.

LV: Can anyone provide a simple Tk example which shows display of the various characters available with this new support? In particular, my users have a need to display Kanji, Greek, and English (at least) on a single page. - RS: see i18n tester (which also became a Tk widget demo in 8.4).

RS:

pack [text .t]
.t insert end "Japan  \u65E5\u672C\n"
.t insert end "Athens \u0391\u03B8\u03AE\u03BD\u03B1\u03B9"

Dirt simple. To get Unicodes in a more ergonomic way, use e.g. the Keyboard widget (that page has an example for a Russian typewriter), or a transliteration (see The Lish family, for instance Greeklish to Greek, Heblish to Hebrew), so you can write

 Athens [greeklish Aqh'nai]

for the same result as above.

RS:

Here's a simple wrapper for "encoding convertfrom"

proc <- {enco args} {encoding convertfrom $enco $args}

so you can just say:

<- utf-8 [[binary format c* {208 184 208 183 208 187 208 190 208 182 208 181 208 189 208 184 208 181}]]

DKF: It is more efficient to use the alias mechanism for this:

interp alias {} <- {} encoding convertfrom
interp alias {} -> {} encoding convertto

LV: Has anyone done any work reading and writing unicode via Tcl pipes? How about using unicode in Tk send commands to and from both older (pre Tcl 8) and new (Tk 8.3) applications? The reason I ask is that my users are reporting problems in both these areas - data loss in both cases (eighth bit lost in the case of pipes; incorrect encoding in the latter).

KBK: For pipes, it depends on what the target application generates or expects to see. Once you've opened a pipe, you should be able to [fconfigure $pipeID -encoding whatever] to inform Tcl of the encoding to be used on the pipe.

Not sure about send - I work mostly on platforms where it doesn't.

Latest news (Feb 7, 2001): From http://www.unicode.org/unicode/standard/versions/beta-ucd31.html :

Unicode 3.1 Beta has another 1024 "Mathematical Alphanumerics", and of course roughly 40000 more Kanji (CJK) listed. These pages are all beyond 16 bit (starting at 10300 for "Old Italic").

With UTF-8, we can handle such character codes with no changes at all; users of UTF-16 or a short wchar_t would be excluded. (RS)

Produce pure 1-byte encoding from Tcl: You can turn a UTF-8 encoded string into a strict 1-byte (and not proper UTF-8) string with [encoding convertfrom identity], as ::tcltest::bytestring demonstrates. This can be used if a C extension expects single-byte codes. The safer, but more complex way would be to use Tcl_UtfToExternalDString on the C side...

International Components for Unicode (ICU) http://www.icu-project.org/ is a C/C++ library for handling lots of issues that developers using Unicode encounter. Has anyone written a binding for Tcl for it yet?

See: ICU For Tcl

Additional reading to consider regarding Unicode:

Schneier's Crypto-Gram on Unicode Security
the Unix/Linux Unicode FAQ
UTF-8 torture test (Tcl should be able to pass this)
RFC2279
a collection of unicode fonts

Mick O'Donnell: When you dont know which unicode set to use for a language, below is code which displays a unicode file in all available encodings:

proc display-all-encodings {} {
    set file [tk_getOpenFile]
    text .t -bg white -font Times
    pack .t -fill both -expand t
    foreach encoding [lsort -ascii [encoding names]] {
        set str [open $file r]
        fconfigure $str -encoding $encoding
        .t insert end "\nEncoding: $encoding\n[read $str]"
        close $str
    }
}

GPS: In the Tcl chatroom "The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history)...." Plan 9 from Bell Labs FAQ , and UTF-8 history .

Java also has internal support for UTF-8. The Java UTF Socket Communication page provides examples for socket communication between Tcl and Java through the readUTF() and writeUTF() methods.

does anyone know anything about "gb2321"?

HaO: When receiving utf-8 data byte by byte one may convert the data by

 encoding convertfrom utf-8 $Data

If Data is not a complete utf-8 sequence (e.g. the last character sequence is incomplete), the upper command does not fail with an error. It will give an (in this sense) unexpected result.

To separate complete sequences and eventual remaining bytes, the following procedure UTF8FullCodes may be used. It takes all complete sequences of a buffer and returns them converted. An eventual incomplete sequence is left in the buffer. Usage example:

set f [open file_utf8.txt r]
fconfigure $f -encoding binary
set InBuffer {} 
while {![eof $f]} {
    append InBuffer [read $f 100]
    # do something with raw bytes like CRC evaluation etc.
    set NewData [UTF8FullCodes InBuffer]
    # do something with NewData
}
close $f

Remark: of course, the file might be opened using

fconfigure $f -encoding utf-8

and Tcl cares of everything. This issue is only important, if the raw bytes do not origin a stream device (in my case a DLL) and thus the encoding can not be used.

proc UTF8FullCodes pBuffer {
    upvar $pBuffer Buffer
    # > Get last position index
    set LastPos [string length $Buffer]
    incr LastPos -1
    # > Current bytecount of a multi byte character includes the start character
    set nBytes 1
    # Loop over bytes from the end
    for {set Pos $LastPos} {$Pos >= 0} {incr Pos -1} {
        set Code [scan [string index $Buffer $Pos] %c]
        if { $Code < 0x80 } {
            # > ASCII
            break
        }
        if { $Code < 0xbf } {
            # > Component of a multi-byte sequence
            incr nBytes
        } else {
            # > First byte of a multi-byte sequence
            # Find number of required bytes
            for {set Bytes 2} {$Bytes <= 6} {incr Bytes} {
                # > Check for zero at Position (7 - Bytes)
                if {0 == (( 1 << (7 - $Bytes)) & $Code)} {
                    break
                }
            }
            puts "Bytes=$Bytes"
            if { $Bytes == $nBytes } {
                # > Complete code
                set Pos $LastPos
                break
            } else {
                # > Incomplete code
                incr Pos -1
                break
            }
        }
    }
    # > Convert complete sequence until Pos
    set Res [encoding convertfrom utf-8 [string range $Buffer 0 $Pos]]
    # > Put incomplete character in Buffer
    incr Pos
    set Buffer [string range $Buffer $Pos end]
    return $Res
}

Lars H: A more compact way of checking that an UTF-8 octet sequence is a complete character is to use a regexp. This procedure checks whether a char is UTF-8:

proc utf8 {char} {
    regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
      [\x00-\x7F] |                # Single-byte chars (ASCII range)
      [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
      [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
      [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not supported by Tcl 8.5)
    } $char
}

(This regexp can be tightened a bit if one wishes to exclude ill-formed UTF-8; see Section 3.9 ("Unicode Encoding Forms") of the Unicode standard.) See UTF-8 bit by bit for an explanation of how UTF-8 is constructed.

To match all complete UTF-8 chars at the beginning of a buffer use

regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
    \A (
    [\x00-\x7F] |                # Single-byte chars (ASCII range)
    [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
    [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
    [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not supported by Tcl 8.5)
    ) +
} $buffer completeChars

telgo 2010-07-23 21:23:34:

Could I get one of you knowledgeable people to look at my question at Utf-8 difference between Windows and Mac? ? I am quite perplexed

crusherjoe - 2013-10-31 02:33:34

If you need to include utf-8 characters in your script, then follow the instruction given in https://www.tcl-lang.org/doc/howto/i18n.html under "Sourcing Scripts in Different Encodings":

set fd [open "yourcode.tcl" r]
fconfigure $fd -encoding utf-8
set script [read $fd]
close $fd
eval $script

RS 2013-11-01 Easier: since Tcl 8.5, you can just write

source -encoding utf-8 yourcode.tcl

If you are using some other scripting language and you are communicating with tclsh/wish via a stdio, then include the following at the start of the text sent to tclsh/wish:

fconfigure stdin -encoding utf-8
fconfigure stdout -encoding utf-8

From then on you can display utf-8 encoded text in tk widgets.

wiwo 2016-12-09:

I wrote a little script for translating 4 byte UTF chars to HTML entities.

proc utf8_encode_4byte_chars s {
    # info taken from https://de.wikipedia.org/wiki/UTF-8
    set result {} 
    set chars_left 0;

    foreach i [split $s {}] {
        # get the decimal representation
        scan $i %c c

        # If the binary representation starts with 11110 this is a 4-byte char. 
        # The "1" before the first "0" show the number of bytes.
        if {$c>=240 && $c<=247} {
          # start of 4 byte entity
          set chars_left 4
        }
        if {$chars_left > 0} {
          if {$chars_left == 4} {
            # This is the first byte, which always start with "11110".
            # The last 3 bits are used for calculating the entity value.
            set bnum2 [expr { $c & 15 }]
          } else {
            # Following bytes always start with "10".
            # The last 6 bytes are used for calculating the entity value.
            set bnum2 [expr { ($bnum2 << 6)  + ( $c & 127 ) }]
          }

          if {$chars_left == 1} {
            # This is the last byte. We have gathered the full information.
            # format as hex html entity
            set entity_code "&#x[format %04.4X $bnum2];"
            
            append result $entity_code
          }

          incr chars_left -1
        } else {
          append result $i
        }
    }
    return $result
}

Usage:

set x "... string with 4 byte entities, can't be entered in this wiki ..."
puts [utf8_encode_4byte_chars $x]

Sample output

# This is a smiley: %F0%9F%98%80! Of course there are more of them, e.g. %F0%9F%98%8D %F0%9F%98%8F! Still more: %F0%9F%99%83 %F0%9F%99%80 %F0%9F%98%BE %F0%9F%98%BD %F0%9F%98%BC %F0%9F%98%BB %F0%9F%98%BA %F0%9F%98%B9 %F0%9F%98%B8 %F0%9F%98%B7 %F0%9F%98%B6 %F0%9F%98%B5

Page Authors

Richard Suchenwirth: Original author

Category Characters

Category String Processing

i18n - Writing for the world

Arts and Crafts of Tcl-Tk Programming

Unicode and UTF-8

See Also

Description

How UTF-8 Extends ASCII

Converting Unicode to Escape Sequences

Page Authors