`[https://www.unicode.org/versions/latest/ch03.pdf#G7404%|%utf-8]`, where `utf`
is short for '''unicode transformation format''', is a method of [encoding][Uunicode] characters using one to four bytes per character. It is a superset
of [ascii], uses easily-distinguishable context-free prefixes to disinguish the
beginning of each character, and can be probabilistically differentiated from
legacy [ascii%|%extended ascii] encodings.
** See Also **
[Unicode and UTF-8]:
** Tcl Internals **
Internally, Tcl uses
[https://en.wikipedia.org/wiki/UTF-8#Modified_UTF-8%|%modified utf-8] encoding,
which is the same as utf-8 except that the NUL character (`\u0000`) is encoded
as the bytes `0xC0` `0x80`, which is not a valid utf-8 sequence. Since there
are no nulls in such a string the [C]-string property that a null byte
terminates the string may be preserved.
----
[DKF]: Here's a little utility procedure I wrote today when I needed to convert
the utf-8 encoding a unicode character into a set of qutf-8-encoded hexadecimal digits to
use as a literal value in C:
======
proc toutf8 c {
set s [encoding convertto utf-8 $c]
binary scan $s cu* x
format [string repeat \\x%02x [string length $s]] {*}$x
}
======
Demonstrating:
===
'''%''' toutf8 \u1234
''\xe1\x88\xb4''
'''%''' toutf8 \u0000
''\x00''
===
----
[ferrieux]: May I suggest a slight enhancement of readability, (and possibly
performance, though not measured nor expecting much):
======
proc toutf8 c {
set s [encoding convertto utf-8 $c]
binary scan $s H* x
regsub -all -expanded {..} $x {\x&}
}
======
The demonstrated output continues to be the same as shown above, as expected.
----
[jima] 2010-01-09: Does this work for unicode characters in the range U+010000 to U+10FFFF ?
: U+010000 is xF0 x90 x80 x80
According to [http://www.fileformat.info/info/unicode/char/10000/index.htm]
In my box
======
toutf8 \u10000
======
Produces
======
\xe1\x80\x80\x30
======
And (notice the extra `0` introduced here)
======
toutf8 \u010000
======
Produces
======
\xc4\x80\x30\x30
======
I have tested some codes in the other ranges defined in [http://en.wikipedia.org/wiki/UTF-8] and everything seems fine whilst we don't put any extra zeroes at the beginning:
toutf8 \u20ac
Correctly produces
\xe2\x82\xac
[Lars H], 2010-01-12 [PYK] 2020-07-22: No, Tcl can (currently) only represent characters within the '''basic multilingual plane''' of unicode, so there's no way that you can even feed an U+10000 character ''into'' `[encoding convertto]` :-(. Fixing that is non-trivial since some parts of the Tcl [C] library require a representation of strings where all characters take up the same number of bytes. It is possible to compile Tcl with that `TCL_UTF_MAX` set to 4, meaning 32 bits per character, but it's rather wasteful, and has been reported not entirely compatible with [Tk].
What one can often make do with is using surrogate pairs for characters beyond the BMP, thus treating Tcl's strings as being the UTF-16 representations of the strings proper. This doesn't play well with `encoding convertto utf-8` though, as that will reencode each surrogate in the pair as a separate character. Perhaps I should get around to doing something about that…
`\u` by design grabs no more than four hexadecimal digits, (thus leaving extra zeroes alone), and would continue to do so even after Tcl is extended to support full Uunicode;. t This is so that you can put a hex digit immediately after a four-digit `\u` substitution, which is not possible with `\x`, which consumes as many hexadecimal digits as it finds. Possibly there would be an `\U` substitution for the full range. `[regexp]` already implements that, at least syntactically.
[jima], 2010-01-13: Thanks Lars for the explanation.
Perhaps another side of this problem is that even if one is able to generate the right utf-8 (by coding the unicode to utf-8 conversion algorithm oneself), one won't get the proper graphical output unless the instructions to produce it are somewhere in the bellies of Tcl.
So (as I understand it), to display an image of the Unicode character U+10000, properly encoded internally as utf-8 as `\xF0\x90\x80\x80`, we would need extra information besides the algorithm depicted in the wikipedia page mentioned earlier.
[Lars H], 2010-01-19, 2010-01-26: The graphical output always depends on what is supposed to be supplying it. [Tk] is probably difficult, but if you're rather generating text that some other program (e.g. a web browser) is supposed to render, then 4-byte UTF-8 sequences may be fine.
For what it's worth, I went ahead with the "surrogate pairs inside Tcl — 4-byte sequences outside" idea; the result so far can be found in the [Half Bakery] at http://wiki.tcl.tk/_repo/UTF/. This is an C-coded [extension] (well, the files needed for one which are not the same as in the [sampleextension]), whose [package] name is UTF, and which defines a new encoding, "UTF-8" (upper case, whereas the built-in one is utf-8). Upon `[encoding convertfrom]`, this converts 4-byte sequences (code points U+10000 through U+10FFFF) to surrogate pairs, and upon `[encoding convertto]` it converts surrogate pairs to 4-byte sequences. There are even tests, which it passes!
A previous version (from 2010-01-19) had a bug that caused it to get stuck in an infinite loop when used as a channel encoding, but the current seems to work fine.
A logical next step would be to also implement UTF-16BE and UTF-16LE as encodings (the `Unicode` built-in encoding is almost one of these, but it depends on the platform which one it is).
[AMG]: See [http://sourceforge.net/tracker/?func=detail&aid=1165752&group_id=10894&atid=360894] for the UTF-16 encoding feature request.
<<categories>>Glossary | Characters