Version 13 of utf-8

Updated 2010-01-19 19:33:44 by lars_h

A method of encoding UNICODE characters. It takes a variable number of bytes per character (1..3), but has the good property of making those characters from the ASCII subset (a majority of those found in most Tcl programs and much other text) single bytes.

Internally, Tcl uses a pseudo-UTF-8 encoding for most of its strings. This differs from the standard encoding in exactly one way: the NUL character (\u0000) is encoded using two bytes (i.e. in denormalized form). This means that we can use strings as binary-safe containers while still maintaining the C-string property of having a zero byte terminate the string.

See also Unicode and UTF-8.


DKF: Here's a little utility procedure I wrote today when I needed to convert a UNICODE character into a set of UTF-8 encoded hex digits (for a C string literal):

proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s cu* x
    format [string repeat \\x%02x [string length $s]] {*}$x
}

Demonstrating:

% toutf8 \u1234
\xe1\x88\xb4
% toutf8 \u0000
\x00

ferrieux: May I suggest a slight enhancement of readability (and possibly perf though not measured nor expecting much):

proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s H* x
    regsub -all -expanded {..} $x {\x&}
}

The demonstrated output continues to be the same as shown above, as expected.


jima (2010-01-09) Does this work for unicodes in the range U+010000 to U+10FFFF ?

U+010000 is xF0 x90 x80 x80

According to [L1 ]

In my box

 toutf8 \u10000

Produces

 \xe1\x80\x80\x30

And (notice the extra 0 introduced here)

 toutf8 \u010000

Produces

 \xc4\x80\x30\x30

I have tested some codes in the other ranges defined in [L2 ] and everything seems fine whilst we don't put any extra zeroes at the beginning:

 toutf8 \u20ac

Correctly produces

 \xe2\x82\xac

Lars H, 2010-01-12: No, Tcl can (currently) only represent characters within the Basic Multilingual Plane of Unicode, so there's no way that you can even feed an U+10000 into encoding convertto :-(. Fixing that is non-trivial, since some parts of Tcl (the C library) require a representation of strings where all characters take up the same number of bytes. It is possible to compile Tcl with that "number of bytes" set to 4 (meaning 32 bits per character), but it's rather wasteful, and has been reported not entirely compatible with Tk.

What one can often make do with is using surrogate pairs for characters beyond the BMP, thus treating Tcl's strings as being the UTF-16 representations of the strings proper. This doesn't play well with encoding convertto utf-8 though, as that will reencode each surrogate in the pair as a separate character. Perhaps I should get around to doing something about that…

\u by design grabs exactly four hexadecimal digits (thus leaving extra zeroes alone), and would continue to do so even after Tcl is extended to support full Unicode; this is so that you can put a hex digit immediately after a \u escape, which is not possible with \x (that will grab any number of digits). Possibly there would be an \U-escape for the full range (regexp already implements that, at least syntactically).

jima, 2010-01-13: Thanks Lars for the explanation.

Perhaps another side of this problem is that even if one is able to generate the right utf-8 (by coding the unicode to utf-8 conversion algorithm oneself), one won't get the proper graphical output unless the instructions to produce it are somewhere in the bellies of Tcl.

So (as I understand it), to display an image of the unicode character U+10000, properly encoded internally as utf-8 as \xF0\x90\x80\x80, we would need extra information besides the algorithm depicted in the wikipedia page mentioned earlier.

Lars H, 2010-01-19: The graphical output always depends on what is supposed to be supplying it. Tk is probably difficult, but if you're rather generating text that some other program (e.g. a web browser) is supposed to render, then 4-byte UTF-8 sequences may be fine.

For what it's worth, I went ahead with the "surrogate pairs inside Tcl — 4-byte sequences outside" idea; the result so far can be found in the Half Bakery at http://wiki.tcl.tk/_repo/UTF/ . This is an C-coded extension (well, the files needed for one which are not the same as in the sampleextension), whose package name is UTF, and which defines a new encoding UTF-8 (upper case, whereas the built-in one is utf-8). Upon encoding convertfrom, this converts 4-byte sequences (code points U+10000 through U+10FFFF) to surrogate pairs, and upon encoding convertto it converts surrogate pairs to 4-byte sequences. There are even tests, which it passes!

The only catch is, that it doesn't seem to work at all when used as channel encoding: it gets stuck in a loop. I don't know what's wrong, but then it's my first attempt ever to use Tcl_CreateEncoding, so it's probably just some fine detail about Tcl's encoding subsystem that I don't understand.

If I work it out, then a logical next step would be to also implement UTF-16BE and UTF-16LE as encodings (the Unicode built-in encoding is almost one of these, but it depends on the platform which one it is).