Richard Suchenwirth 1999-08-11 - Global players need many languages. And writing systems. For Chinese, Korean, or just Greek, we need a way to code such non-ASCII characters.
For a historical perspective and beginner's technical introduction, see Joel Spolky's missive at http://www.joelonsoftware.com/articles/Unicode.html
The encoding standard to cover all these writing systems is the Unicode ( http://www.unicode.org/ ), a 16 (or more) bit-wide encoding for presently 94,140 distinct coded characters derived from more than 25 supported scripts (as of Unicode 3.1). Tcl/Tk supports the Unicode from version 8.1 as 16-bit chars or in the UTF-8 encoding as the internal representation for strings.
UTF-8 is made to cover 7-bit ASCII, Unicode, and its superset ISO 10646 (which offers 31 bits width, but seems to be an overkill for most practical purposes). Characters are represented as sequences of 1..6 eight-bit bytes - termed octets in the character set business - (for ASCII: 1, for Unicode: 2..3) as follows:
The full list of encoded characters .
LV Just this week I had a developer ask me how to handle characters in the 4-6 byte range. How does that work in Tcl? Right now, their tcl application is having a problem when encountering the 𝒜 character (which is a script-A), which has the unicode value of 0x1D49C. tdom says that only UTF-8 chars up to 3 bytes in length can be handled. Is this just a tdom limitation, or is it also a Tcl limitation?
tvideo.ge