utf-8

Difference between version 18 and 19 - Previous - Next
`[https://www.unicode.org/versions/latest/ch03.pdf#G7404%|%utf-8]`, where `utf`
is short for '''unicode transformation format''', is a method of [encoding][Uunicode] characters using one to four bytes per character.   It is a superset
of [ascii], uses easily-distinguishable context-free prefixes to disinguish the
beginning of each character, and can be probabilistically differentiated from
legacy [ascii%|%extended ascii] encodings.



** See Also **

   [Unicode and UTF-8]:   



** Tcl Internals **

Internally, Tcl uses
[https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8%|%modified utf-8] encoding,
which is the same as utf-8 except that the NUL character (`\u0000`) is encoded
as the bytes `0xC0` `0x80`, which is not a valid utf-8 sequence.  Since there
are no nulls in such a string the [C]-string property that a null byte
terminates the string may be preserved.

----
[DKF]: Here's a little utility procedure I wrote today when I needed to convert
the utf-8 encoding a unicode character into a set of qutf-8-encoded hexadecimal digits to 
use as a literal value in C:

======
proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s cu* x
    format [string repeat \\x%02x [string length $s]] {*}$x
}
======

Demonstrating:

===
'''%''' toutf8 \u1234
''\xe1\x88\xb4''
'''%''' toutf8 \u0000
''\x00''
===

----
[ferrieux]: May I suggest a slight enhancement of readability, (and possibly 
performance, though not measured nor expecting much):

======
proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s H* x
    regsub -all -expanded {..} $x {\x&}
}
======

The demonstrated output continues to be the same as shown above, as expected.

----

[jima] 2010-01-09: Does this work for unicode characters in the range U+010000 to U+10FFFF ?

    :   U+010000 is xF0 x90 x80 x80

According to [http://www.fileformat.info/info/unicode/char/10000/index.htm]

In my box

======
toutf8 \u10000
======

Produces

======
\xe1\x80\x80\x30
======

And (notice the extra `0` introduced here)

======
toutf8 \u010000
======

Produces

======
\xc4\x80\x30\x30
======

I have tested some codes in the other ranges defined in [http://en.wikipedia.org/wiki/UTF-8] and everything seems fine whilst we don't put any extra zeroes at the beginning:

 toutf8 \u20ac

Correctly produces

 \xe2\x82\xac
[Lars H], 2010-01-12 [PYK] 2020-07-22: No, Tcl can (currently) only represent characters within the '''basic multilingual plane''' of unicode, so there's no way that you can even feed an U+10000 character ''into'' `[encoding convertto]` :-(. Fixing that is non-trivial since some parts of the Tcl [C] library require a representation of strings where all characters take up the same number of bytes. It is possible to compile Tcl with that `TCL_UTF_MAX` set to 4, meaning 32 bits per character, but it's rather wasteful, and has been reported not entirely compatible with [Tk].

What one can often make do with is using surrogate pairs for characters beyond the BMP, thus treating Tcl's strings as being the UTF-16 representations of the strings proper. This doesn't play well with `encoding convertto utf-8` though, as that will reencode each surrogate in the pair as a separate character. Perhaps I should get around to doing something about that…
`\u` by design grabs no more than four hexadecimal digits, (thus leaving extra zeroes alone), and would continue to do so even after Tcl is extended to support full Uunicode;. t This is so that you can put a hex digit immediately after a four-digit `\u` substitution, which is not possible with `\x`, which consumes as many hexadecimal digits as it finds. Possibly there would be an `\U` substitution for the full range. `[regexp]` already implements that, at least syntactically.

[jima], 2010-01-13: Thanks Lars for the explanation.

Perhaps another side of this problem is that even if one is able to generate the right utf-8 (by coding the unicode to utf-8 conversion algorithm oneself), one won't get the proper graphical output unless the instructions to produce it are somewhere in the bellies of Tcl. 

So (as I understand it), to display an image of the Unicode character U+10000, properly encoded internally as utf-8 as `\xF0\x90\x80\x80`, we would need extra information besides the algorithm depicted in the wikipedia page mentioned earlier.

[Lars H], 2010-01-19, 2010-01-26: The graphical output always depends on what is supposed to be supplying it. [Tk] is probably difficult, but if you're rather generating text that some other program (e.g. a web browser) is supposed to render, then 4-byte UTF-8 sequences may be fine.

For what it's worth, I went ahead with the "surrogate pairs inside Tcl — 4-byte sequences outside" idea; the result so far can be found in the [Half Bakery] at http://wiki.tcl.tk/_repo/UTF/. This is an C-coded [extension] (well, the files needed for one which are not the same as in the [sampleextension]), whose [package] name is UTF, and which defines a new encoding, "UTF-8" (upper case, whereas the built-in one is utf-8). Upon `[encoding convertfrom]`, this converts 4-byte sequences (code points U+10000 through U+10FFFF) to surrogate pairs, and upon `[encoding convertto]` it converts surrogate pairs to 4-byte sequences. There are even tests, which it passes!
A previous version (from 2010-01-19) had a bug that caused it to get stuck in an infinite loop when used as a channel encoding, but the current seems to work fine.

A logical next step would be to also implement UTF-16BE and UTF-16LE as encodings (the `Unicode` built-in encoding is almost one of these, but it depends on the platform which one it is).

[AMG]: See [http://sourceforge.net/tracker/?func=detail&aid=1165752&group_id=10894&atid=360894] for the UTF-16 encoding feature request.

<<categories>>Glossary | Characters