Version 4 of utf-8

Updated 2008-03-10 13:01:06 by LV

A method of encoding UNICODE characters. It takes a variable number of bytes per character (1..3), but has the good property of making those characters from the ASCII subset (a majority of those found in most Tcl programs and much other text) single bytes.

Internally, Tcl uses a pseudo-UTF-8 encoding for most of its strings. This differs from the standard encoding in exactly one way: the NUL character (\u0000) is encoded using two bytes (i.e. in denormalized form). This means that we can use strings as binary-safe containers while still maintaining the C-string property of having a zero byte terminate the string.

See also Unicode and UTF-8.