Version 60 of Unicode and UTF-8

Richard Suchenwirth 1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.

For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at http://www.joelonsoftware.com/articles/Unicode.html

The encoding standard to cover all these writing systems is the Unicode ( http://www.unicode.org/ ), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1). Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.

UTF-8 is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes). Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:

ASCII 0x00..0x7F (Unicode page 0, left half): 0x00..0x7F. Nothing changed.
Unicode, pages 00..07: 2 bytes, 110aaabb 10bbbbbb, where aaa are the rightmost bits of page#, bb.. are the bits of the second Unicode byte. These pages cover European/Extended Latin, Greek, Cyrillic, Armenian, Hebrew, Arabic.
Unicode, pages 08..FE: 3 bytes, 1110aaaa 10aaaabb 10bbbbbb. These cover all the rest of Unicode, including Hangul, Kanji, and what else. This means that East Asian texts are 50% longer in UTF-8 than in pure 16 bit Unicode.
ISO 10646 codes beyond Unicode: 4..6 bytes. (Never seen one yet).

The full list of encoded characters .

LV Just this week I had a developer ask me how to handle characters in the 4-6 byte range. How does that work in Tcl? Right now, their tcl application is having a problem when encountering the &Ascr; character (which is a script-A), which has the unicode value of 0x1D49C. tdom says that only UTF-8 chars up to 3 bytes in length can be handled. Is this just a tdom limitation, or is it also a Tcl limitation?

tvideo.ge