Unicode and UTF-8

Difference between version 74 and 75 - Previous - Next
'''[Richard Suchenwirth] 1999-08-11''':

Global players need many languages. And writing systems. For  Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.



** See Also **

   [A simple Arabic renderer], [Keyboard widget], [A little Unicode editor]:   Working with exotic character Unicodes.



** Description **

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at
http://www.joelonsoftware.com/articles/Unicode.html

The encoding standard to cover all these writing systems is the [Unicode]
( http://www.unicode.org/ ), a 16 (or more) bit-wide encoding for presently '''94,140 distinct coded characters''' derived from more than 25 supported scripts (as of Unicode 3.1). Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.

'''[UTF-8]''' is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits
width, but seems to be an overkill for most practical purposes). Characters are represented as
sequences of 1..6 eight-bit bytes - ''termed octets in the character set business'' - (for ASCII: 1, for Unicode: 2..3) as follows:

   * ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.

   * Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, [Hebrew], Arabic.

   * Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This  means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.

   * ISO 10646 codes beyond Unicode: 4..6 bytes. Many of the newer [emoji] characters fall in this range.

The [http://www.fileformat.info/info/charset/UTF-8/list.htm%|%full list of encoded characters].

----

[LV] Just this week I had a developer ask me how to handle characters in the 4-6 byte range. How does that work in Tcl? Right now, their tcl application is having a problem when encountering the 𝒜 character (which is a script-A), which has the unicode value of 0x1D49C. [tdom] says that only UTF-8 chars up to 3 bytes in length can be handled. Is this just a tdom limitation, or is it also a Tcl limitation?

J'raxis 2016-03-23: I just ran into the same issue working on a Tcl script for an IRC app, dealing with the increasingly popular emoji characters. Above it says Tcl represents Unicode internally with 16 bits, which means U+FFFF is the highest it can support. For example Tcl will output U+1F4A9 as "ð\x9f\x92©". In the same line of text I was testing with, other UTF-8-encoded input characters in the U+0000..FFFF range came through just fine.

wiwo 2016-12-07: There is a preprocessor directive in generic/tcl.h: 

#define TCL_UTF_MAX   3

The comment says, UCS-2 (max 3 and 4 bytes) is save. When compiling from source, this value is set to 3. Does anybody know what the drawbacks would be when setting this value to 4?

[APN]: I believe [AndroWish] and related builds bump it up to 4 so it should work though it will probably lose compatibility with standard Tcl extensions unless they are also rebuilt. Not sure other what changes [chw] had to make in addition to changing that directive.

[chw]: In [AndroWish] TCL_UTF_MAX is set to 6 which turns out to use max. 4 byte long UTF-8 sequences but to represent one Tcl_UniChar as a 32 bit item. Although this doubles memory requirements it keeps consistency w.r.t. counting codepoints to TCL_UTF_MAX set to 3. In [AndroWish] some additional changes were required, too, in order to proper support non Android platforms which have a UTF-16 OS interface (Windows, see [undroidwish]). One of your options is to search through the AndroWish source tree for places similar to ''#if TCL_UTF_MAX == 4'' and ''#if TCL_UTF_MAX > 4''. The TCL_UTF_MAX set to 4 was deliberately not chosen, since it requires to represent codepoints larger than 0xFFFF as surrogate pairs which makes character counting expensive and which aren't not fully handled by the Tcl core.



** A general principle of UTF-8 **

'''[Richard Suchenwirth]''':

A general principle of UTF-8 is that the first byte either is a single-byte character (if below 0x80), or
indicates length of multi-byte code by the number of 1's before the first 0 and is then filled
up with data bits. All other bytes start with bits 10 and are then filled up with 6 data bits. See also [UTF-8 bit by bit]. A sequence of ''n'' bytes can hold

 b = 5n + 1  (1 < n < 7)

bits "payload", so the maximum is 31 bits for a 6-byte sequence.

It follows from this that bytes in UTF-8 encoding fall in distinct ranges:


 00..7F - plain old ASCII
 80..BF - non-initial bytes of multibyte code
 C2..FD - initial bytes of multibyte code (C0, C1 are not legal!)
 FE, FF - never used (so, free for byte-order marks).

The distinction between initial and non-initial helps in plausibility checks, or to re-synchronize with missing data. Besides, it's independent of byte order (16-bit Unicode inherits byte order, so has to express that with the magic FEFF. Should you read FFFE, you're to swap).
Tcl however shields these UTF-8 details from us: characters are just characters, no matter whether 7 bit, 16 bit, or more. I liked Tcl since 7.4, and I love it since 8.1.

----

'''[Richard Suchenwirth]''':

Here's a little helper that reconverts a real Unicode string (e.g. as pasted to a widget) to \u-escaped ASCII:

======
proc u2a {s} {
    set res {} 
    foreach i [split $s {}] {
        scan $i %c c
        if {$c < 128} {append res $i} else {append res \\u[format %04.4X $c]}
    }
    set res
}
======

----

I found your u2a nice and handy but added a mapping 
since it gave me for some reason 'wrong' numbers.

======
## CP1252.TXT says these char should be that 
variable myuscanlist 
set myuscanlist [eval list {*}{
    \\u0080 \\u20AC \\u0081 {}
    \\u0082 \\u201A \\u0083 \\u0192
    \\u0084 \\u201E \\u0085 \\u2026
    \\u0086 \\u2020 \\u0087 \\u2021
    \\u0088 \\u02C6 \\u0089 \\u2030
    \\u008A \\u0160 \\u008B \\u2039
    \\u008C \\u0152 \\u008D {}
    \\u008E \\u017D \\u008F {}
    \\u0090 {}        \\u0091 \\u2018
    \\u0092 \\u2019 \\u0093 \\u201C
    \\u0094 \\u201D \\u0095 \\u2022
    \\u0096 \\u2013 \\u0097 \\u2014
    \\u0098 \\u02DC \\u0099 \\u2122
    \\u009A \\u0161 \\u009B \\u203A
    \\u009C \\u0153 \\u009D {}
    \\u009E \\u017E \\u009F \\u0178 
}]


proc u2a {s} {
   variable myuscanlist 
   set res {} 
   foreach i [split $s {}] {
       scan $i %c c
       if {$c < 128} {
           append res $i
       } else {
           append res [::string map $myuscanlist \\u[format %04.4X $c] ] ;# koyama
       }
   }
   set res
} ;#RS
======

----

[RS]:  All is not that well. Unicode text files can so far not be sourced, or autoindexed, by Tcl. 
As a preliminary fix, see my [Unicode file reader]

----

The Unicode character `\u0000`, which is also the [ascii%|%ASCII NUL], is represented using the two-byte value that is illegal in [UTF-8].  This avoids interpretation in [C] of the NUL character as the end of a string.

'''DKF''' - ''RS:'' the two bytes are C0 80. See [UTF-8 bit by bit] for why. The Unicode consortium has "outlawed" the use of non-shortest byte sequences for security reasons, with the practical consequence that byte values C0 and C1 must never occur in Unicode data. So Tcl will have to make sure that such characters (especially the NUL byte Donal mentioned) are converted to legal forms when exported.

----

LV: Can anyone provide a simple Tk example which shows display of the various characters available with this new support?  In particular, my users have a need to display Kanji, Greek, and English (at least) on a single page. - [RS]: see [i18n tester] (which also became a Tk widget demo in 8.4).

'''[RS]:'''

======
pack [text .t]
.t insert end "Japan  \u65E5\u672C\n"
.t insert end "Athens \u0391\u03B8\u03AE\u03BD\u03B1\u03B9"
======

Dirt simple. To get Unicodes in a more ergonomic way, use e.g. the [Keyboard widget] (that page has an example for a Russian typewriter), or a transliteration (see [The Lish family], for instance [Greeklish] to Greek, [Heblish] to Hebrew), so you can write
 Athens [greeklish Aqh'nai]
for the same result as above.

----

'''[RS]''':

Here's a simple wrapper for "encoding convertfrom"

======
proc <- {enco args} {encoding convertfrom $enco $args}

======

so you can just say:

======
<- utf-8 [[binary format c* {208 184 208 183 208 187 208 190 208 182 208 181 208 189 208 184 208 181}]]
======


''DKF:''  It is more efficient to use the '''alias''' mechanism for this:

======
interp alias {} <- {} encoding convertfrom
interp alias {} -> {} encoding convertto
======

----

[LV]: Has anyone done any work reading and writing unicode via Tcl pipes?  
How about using unicode in Tk [send] commands to and from both older (pre Tcl 8)
and new (Tk 8.3) applications?  The reason I ask is that my users are
reporting problems in both these areas - data loss in both cases 
(eighth bit lost in the case of pipes; incorrect encoding in the latter).

''KBK'': For pipes, it depends on what the target application generates
or expects to see.  Once you've opened a pipe, you should be able
to [[fconfigure $pipeID -encoding whatever]] to inform Tcl of the
encoding to be used on the pipe.

Not sure about `[send]` - I work mostly on platforms where it doesn't.

----

'''Latest news (Feb 7, 2001):''' From
http://www.unicode.org/unicode/standard/versions/beta-ucd31.html :

Unicode 3.1 Beta has another 1024 "Mathematical Alphanumerics", and
of course roughly 40000 more Kanji (CJK) listed. These pages are all
beyond 16 bit (starting at 10300 for "Old Italic").
 
With UTF-8, we can handle such character codes with no changes at all;
users of UTF-16 or a short ''wchar_t'' would be excluded. (RS)

'''Produce pure 1-byte encoding from Tcl:''' You can turn a UTF-8 encoded string into a strict 1-byte (and not proper UTF-8) string with [[encoding convertfrom identity]], as ''::tcltest::bytestring'' demonstrates. This can be used if a C extension expects single-byte codes. 
The safer, but more complex way would be to use ''Tcl_UtfToExternalDString'' on the C side...

----

International Components for Unicode (ICU) http://www.icu-project.org/ is a C/C++ library for handling lots of issues that developers using Unicode encounter.  Has anyone written a binding for Tcl for it yet?
See: [https://github.com/shawnw/icu4tcl%|%ICU For Tcl]

----

Additional reading to consider regarding Unicode:

   * https://www.schneier.com/crypto-gram/archives/2000/0715.html#9%|%Schneier's Crypto-Gram on Unicode Security%|%
   * http://www.cl.cam.ac.uk/~mgk25/unicode.html%|%the Unix/Linux Unicode FAQ%|%
   * http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt%|%UTF-8 torture test%|% (Tcl should be able to pass this)
   * http://www.rfc-editor.org/rfc/rfc2279.txt%|%RFC2279%|%
   * http://freecode.com/projects/efont-unicode-bdf/%|%a collection of unicode fonts%|%

----

Mick O'Donnell: When you dont know which unicode set to use for a language,
below is code which displays a unicode file in all available encodings:

======
proc display-all-encodings {} {
    set file [tk_getOpenFile]
    text .t -bg white -font Times
    pack .t -fill both -expand t
    foreach encoding [lsort -ascii [encoding names]] {
        set str [open $file r]
        fconfigure $str -encoding $encoding
        .t insert end "\nEncoding: $encoding\n[read $str]"
        close $str
    }
}
======

----

[George Peter Staplin%|%GPS]:  In the [Tcl chatroom] "The encoding known today as UTF-8 was invented by Ken Thompson. It was born during the evening hours of 1992-09-02 in a New Jersey diner, where he designed it in the presence of Rob Pike on a placemat (see Rob Pike's UTF-8 history)...." [https://web.archive.org/web/20060328080939/http://ask.km.ru/3p/plan9faq.html%|%Plan 9 from Bell Labs FAQ], and [http://www.cl.cam.ac.uk/~mgk25/ucs/utf-8-history.txt%|%UTF-8 history].

----

Java also has internal support for UTF-8. The [Java UTF Socket Communication] page provides examples 
for socket communication between Tcl and Java through the readUTF() and writeUTF() methods.

----

does anyone know anything about "gb2321"?

----

[HaO]:  When receiving utf-8 data byte by byte one may convert the data by
 encoding convertfrom utf-8 $Data
If Data is not a complete utf-8 sequence (e.g. the last character sequence is incomplete), the upper command does not fail with an error.
It will give an (in this sense) unexpected result.

To separate complete sequences and eventual remaining bytes, the following procedure UTF8FullCodes may be used.
It takes all complete sequences of a buffer and returns them converted. An eventual incomplete sequence is left in the buffer.
Usage example:

======
set f [open file_utf8.txt r]
fconfigure $f -encoding binary
set InBuffer {} 
while {![eof $f]} {
    append InBuffer [read $f 100]
    # do something with raw bytes like CRC evaluation etc.
    set NewData [UTF8FullCodes InBuffer]
    # do something with NewData
}
close $f
======

Remark: of course, the file might be opened using

======
fconfigure $f -encoding utf-8
======

and Tcl cares of everything. This issue is only important, if the raw bytes do not origin a stream device (in my case a DLL) and thus the encoding can not be used.

======
proc UTF8FullCodes pBuffer {
    upvar $pBuffer Buffer
    # > Get last position index
    set LastPos [string length $Buffer]
    incr LastPos -1
    # > Current bytecount of a multi byte character includes the start character
    set nBytes 1
    # Loop over bytes from the end
    for {set Pos $LastPos} {$Pos >= 0} {incr Pos -1} {
        set Code [scan [string index $Buffer $Pos] %c]
        if { $Code < 0x80 } {
            # > ASCII
            break
        }
        if { $Code < 0xbf } {
            # > Component of a multi-byte sequence
            incr nBytes
        } else {
            # > First byte of a multi-byte sequence
            # Find number of required bytes
            for {set Bytes 2} {$Bytes <= 6} {incr Bytes} {
                # > Check for zero at Position (7 - Bytes)
                if {0 == (( 1 << (7 - $Bytes)) & $Code)} {
                    break
                }
            }
            puts "Bytes=$Bytes"
            if { $Bytes == $nBytes } {
                # > Complete code
                set Pos $LastPos
                break
            } else {
                # > Incomplete code
                incr Pos -1
                break
            }
        }
    }
    # > Convert complete sequence until Pos
    set Res [encoding convertfrom utf-8 [string range $Buffer 0 $Pos]]
    # > Put incomplete character in Buffer
    incr Pos
    set Buffer [string range $Buffer $Pos end]
    return $Res
}
======

[Lars H]: A more compact way of checking that an UTF-8 octet sequence is a complete character is to use a regexp. This procedure checks whether a char is UTF-8:

======
proc utf8 {char} {
    regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
      [\x00-\x7F] |                # Single-byte chars (ASCII range)
      [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
      [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
      [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not supported by Tcl 8.5)
    } $char
}
======

(This regexp can be tightened a bit if one wishes to exclude ill-formed UTF-8; see Section 3.9 ("Unicode Encoding Forms") of the Unicode standard.)
See [UTF-8 bit by bit] for an explanation of how UTF-8 is constructed.

To match all complete UTF-8 chars at the beginning of a buffer use

======
regexp {(?x)   # Expanded regexp syntax, so I can put in comments :-)
    \A (
    [\x00-\x7F] |                # Single-byte chars (ASCII range)
    [\xC0-\xDF] [\x80-\xBF] |    # Two-byte chars (\u0080-\u07FF)
    [\xE0-\xEF] [\x80-\xBF]{2} | # Three-byte chars (\u0800-\uFFFF)
    [\xF0-\xF4] [\x80-\xBF]{3}   # Four-byte chars (U+10000-U+10FFFF, not supported by Tcl 8.5)
    ) +
} $buffer completeChars
======


----

'''[telgo] 2010-07-23 21:23:34''':

Could I get one of you knowledgeable people to look at my question at [Utf-8 difference between Windows and Mac?] ?
I am quite perplexed


----
'''[crusherjoe] - 2013-10-31 02:33:34'''

If you need to include utf-8 characters in your script, then follow the instruction given in http://www.tcl.tk/doc/howto/i18n.html under "Sourcing Scripts in Different Encodings":

======
set fd [open "yourcode.tcl" r]
fconfigure $fd -encoding utf-8
set script [read $fd]
close $fd
eval $script
======

[RS] 2013-11-01 Easier: since [Changes in Tcl/Tk 8.5%|%Tcl 8.5], you can just write

======
source -encoding utf-8 yourcode.tcl
======

If you are using some other scripting language and you are communicating with tclsh/wish via a stdio, then
include the following at the start of the text sent to tclsh/wish:

======
fconfigure stdin -encoding utf-8
fconfigure stdout -encoding utf-8
======

From then on you can display utf-8 encoded text in tk widgets.

----

'''wiwo 2016-12-09''':

I wrote a little script for translating 4 byte UTF chars to HTML entities.

======
proc utf8_encode_4byte_chars s {
    # info taken from https://de.wikipedia.org/wiki/UTF-8
    set result {} 
    set chars_left 0;

    foreach i [split $s {}] {
        # get the decimal representation
        scan $i %c c

        # If the binary representation starts with 11110 this is a 4-byte char. 
        # The "1" before the first "0" show the number of bytes.
        if {$c>=240 && $c<=247} {
          # start of 4 byte entity
          set chars_left 4
        }
        if {$chars_left > 0} {
          if {$chars_left == 4} {
            # This is the first byte, which always start with "11110".
            # The last 3 bits are used for calculating the entity value.
            set bnum2 [expr { $c & 15 }]
          } else {
            # Following bytes always start with "10".
            # The last 6 bytes are used for calculating the entity value.
            set bnum2 [expr { ($bnum2 << 6)  + ( $c & 127 ) }]
          }

          if {$chars_left == 1} {
            # This is the last byte. We have gathered the full information.
            # format as hex html entity
            set entity_code "&#x[format %04.4X $bnum2];"
            
            append result $entity_code
          }

          incr chars_left -1
        } else {
          append result $i
        }
    }
    return $result
}
======

Usage:

======
set x "... string with 4 byte entities, can't be entered in this wiki ..."
puts [utf8_encode_4byte_chars $x]
======

Sample output

======
# This is a smiley: %F0%9F%98%80! Of course there are more of them, e.g. %F0%9F%98%8D %F0%9F%98%8F! Still more: %F0%9F%99%83 %F0%9F%99%80 %F0%9F%98%BE %F0%9F%98%BD %F0%9F%98%BC %F0%9F%98%BB %F0%9F%98%BA %F0%9F%98%B9 %F0%9F%98%B8 %F0%9F%98%B7 %F0%9F%98%B6 %F0%9F%98%B5
======



<<categories>> Characters | String Processing | i18n - writing for the world | Arts and crafts of Tcl-Tk programming