Version 10 of utf-8

Updated 2010-01-09 11:25:52 by jima

A method of encoding UNICODE characters. It takes a variable number of bytes per character (1..3), but has the good property of making those characters from the ASCII subset (a majority of those found in most Tcl programs and much other text) single bytes.

Internally, Tcl uses a pseudo-UTF-8 encoding for most of its strings. This differs from the standard encoding in exactly one way: the NUL character (\u0000) is encoded using two bytes (i.e. in denormalized form). This means that we can use strings as binary-safe containers while still maintaining the C-string property of having a zero byte terminate the string.

See also Unicode and UTF-8.


DKF: Here's a little utility procedure I wrote today when I needed to convert a UNICODE character into a set of UTF-8 encoded hex digits (for a C string literal):

proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s cu* x
    format [string repeat \\x%02x [string length $s]] {*}$x
}

Demonstrating:

% toutf8 \u1234
\xe1\x88\xb4
% toutf8 \u0000
\x00

ferrieux: May I suggest a slight enhancement of readability (and possibly perf though not measured nor expecting much):

proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s H* x
    regsub -all -expanded {..} $x {\x&}
}

The demonstrated output continues to be the same as shown above, as expected.


jima (2010-01-09)

Does this work for unicodes in the range U+010000 to U+10FFFF ?

U+010000 is xF0 x90 x80 x80

According to:

[L1 ]

In my box

 toutf8 \u10000

Produces

 \xe1\x80\x80\x30

And (notice the extra 0 introduced here)

 toutf8 \u010000

Produces

 \xc4\x80\x30\x30

I have tested some codes in the other ranges defined in [L2 ] and everything seems fine whilst we don't put any extra zeroes at the beginning:

 toutf8 \u20ac

Correctly produces

 \xe2\x82\xac