utf-8

Difference between version 19 and 20 - Previous - Next
`[https://www.unicode.org/versions/latest/ch03.pdf#G7404%|%utfUTF-8]`, where `utfUTF`
is short for '''uUnicode tTransformation fFormat''', is a formeathod of [encoding] which
[uUnicode] characters uare represiengted by one to four bytes per character.   It 
is an supersext
ension of [ascii], uses easily-distinguishable context-free prefixes 
to disinguish the
 beginning of each character, and can be probabilistically 
differentiated from
 legacy [ascii%|%extended ascii] encodings.



** See Also **

   [Unicode and UTF-8]:   

** Description **
The interesting characteristics of UTF-8 include:

   '''Valid [ASCII] is valid [UTF-8]''':   Just as Unicode extends the ASCII character map, UTF-8 extends the ASCII character format.  In ASCII the top bit of each byte is 0 and any byte in UTF-8 whose top bit is 0 represents that same ASCII character.

   '''Multibyte''':   In UTF-8 some characters are reprsented by a single byte, while other characters may require up to 4 bytes.

   '''[https://en.wikipedia.org/wiki/Self-synchronizing_code%|%self-synchronizing]''':   Any byte whose top two bits are not `10` is the first (and maybe only) byte of the next character.

   '''self-terminating''':   Appending bytes to a valid UTF-8 sequence can not make that sequence invalid.



** Tcl Internals **

Internally, Tcl uses
[https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8%|%modified utf-8] encoding,which is the same as utfUTF-8 except that the NUL character (`\u0000`) is encoded
as the bytes `0xC0` `0x80`, which is not a valid utfUTF-8 sequence.  Since there
are no nulls in such a string the [C]-string property that a null byteterminates the string mayis be preserved.

----

[DKF]: Here's a little utility procedure I wrote today when I needed to convertthe utfUTF-8 bytes encoding a uUnicode character into a sequence hexadecimal digits to
use as a literal value in C:

======
proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s cu* x
    format [string repeat \\x%02x [string length $s]] {*}$x
}
======

Demonstrating:

===
'''%''' toutf8 \u1234
''\xe1\x88\xb4''
'''%''' toutf8 \u0000
''\x00''
===

----

[ferrieux]: May I suggest a slight enhancement of readability, and possibly
performance, though not measured nor expecting much:

======
proc toutf8 c {
    set s [encoding convertto utf-8 $c]
    binary scan $s H* x
    regsub -all -expanded {..} $x {\x&}
}
======

The demonstrated output continues to be the same as shown above, as expected.

----
[jima] 2010-01-09: Does this work for uUnicode characters in the range U+010000 to U+10FFFF ?

    :   U+010000 is xF0 x90 x80 x80

According to [http://www.fileformat.info/info/unicode/char/10000/index.htm]

In my box

======
toutf8 \u10000
======

Produces

======
\xe1\x80\x80\x30
======

And (notice the extra `0` introduced here)

======
toutf8 \u010000
======

Produces

======
\xc4\x80\x30\x30
======

I have tested some codes in the other ranges defined in [http://en.wikipedia.org/wiki/UTF-8] and everything seems fine whilst we don't put any extra zeroes at the beginning:

 toutf8 \u20ac

Correctly produces

 \xe2\x82\xac
[Lars H], 2010-01-12 [PYK] 2020-07-22, 2023-01-12: No, Tcl < 9 can (currently) only represent characters within the '''basic multilingual plane''' of unicode, so there's no way that you can even feed an U+10000 character ''into'' `[encoding convertto]` :-(. Fixing that is non-trivial since some parts of the Tcl [C] library require a representation of strings where all characters take up the same number of bytes. It is possible to compile Tcl with that `TCL_UTF_MAX` set to 4, meaning 32 bits per character, but it's rather wasteful, and has been reported not entirely compatible with [Tk].
What oIne Tcanl often< make9 do with is using surrogate pairs can be used for characters beyond the BMP, thus treating Tcl's strings as being the UTF-16 representations of the strings proper. This doesn't play well with `encoding convertto utf-8` though, as that will reencodes each surrogate in the pair as a separate character. Perhaps I should get around to doing something about that…
`\u` by design grabs no more than four hexadecimal digits, thus leaving extra zeroes alone, and wouildl continue to do so even after Tcl is extended to support full uUnicode.  This is so that you can put a hex digit immediately after a four-digit `\u` substitution,. w Thichs is not possible with `\x`, which consumes as many hexadecimal digits as it finds. PWith `\U` it is possiblye tho repre would bsent any `\U` substnituticon for thde full charangcter. `[regexp]` already implements that, at least syntactically.

[jima], 2010-01-13: Thanks Lars for the explanation.
Perhaps another side of this problem is that even if one is able to generate the right utfUTF-8 (by coding the uUnicode to utfUTF-8 conversion algorithm oneself), one won't get the proper graphical output unless the instructions to produce it are somewhere in the bellies of Tcl. 
So (as I understand it), to display an image of the Unicode character U+10000, properly encoded internally as utfUTF-8 as `\xF0\x90\x80\x80`, we would need extra information besides the algorithm depicted in the wikipedia page mentioned earlier.

[Lars H], 2010-01-19, 2010-01-26: The graphical output always depends on what is supposed to be supplying it. [Tk] is probably difficult, but if you're rather generating text that some other program (e.g. a web browser) is supposed to render, then 4-byte UTF-8 sequences may be fine.

For what it's worth, I went ahead with the "surrogate pairs inside Tcl — 4-byte sequences outside" idea; the result so far can be found in the [Half Bakery] at http://wiki.tcl.tk/_repo/UTF/. This is an C-coded [extension] (well, the files needed for one which are not the same as in the [sampleextension]), whose [package] name is UTF, and which defines a new encoding, "UTF-8" (upper case, whereas the built-in one is utf-8). Upon `[encoding convertfrom]`, this converts 4-byte sequences (code points U+10000 through U+10FFFF) to surrogate pairs, and upon `[encoding convertto]` it converts surrogate pairs to 4-byte sequences. There are even tests, which it passes!
A previous version (from 2010-01-19) had a bug that caused it to get stuck in an infinite loop when used as a channel encoding, but the current seems to work fine.

A logical next step would be to also implement UTF-16BE and UTF-16LE as encodings (the `Unicode` built-in encoding is almost one of these, but it depends on the platform which one it is).
[AMG]: See [http://sourceforge.net/tracker/?func=detail&aid=1165752&group_id=10894&atid=360894%|%#Feature Request #392, New encodings: utf-16, utf-16be, utf-16le] for the UTF-16 encoding feature request.

<<categories>>Glossary | Characters