A method of [encoding] [UNICODE] characters. It takes a variable number of bytes per character (1..3), but has the good property of making those characters from the [ASCII] subset (a majority of those found in most [Tcl] programs and much other text) single bytes. Internally, Tcl uses a pseudo-UTF-8 encoding for most of its strings. This differs from the standard encoding in exactly one way: the NUL character (\u0000) is encoded using two bytes (i.e. in denormalized form). This means that we can use strings as binary-safe containers while still maintaining the C-string property of having a zero byte terminate the string. See also [Unicode and UTF-8]. ---- [DKF]: Here's a little utility procedure I wrote today when I needed to convert a UNICODE character into a set of UTF-8 encoded hex digits (for a C string literal): ====== proc toutf8 c { set s [encoding convertto utf-8 $c] binary scan $s cu* x format [string repeat \\x%02x [string length $s]] {*}$x } ====== Demonstrating: === '''%''' toutf8 \u1234 ''\xe1\x88\xb4'' '''%''' toutf8 \u0000 ''\x00'' === ---- [ferrieux]: May I suggest a slight enhancement of readability (and possibly perf though not measured nor expecting much): ====== proc toutf8 c { set s [encoding convertto utf-8 $c] binary scan $s H* x regsub -all -expanded {..} $x {\x&} } ====== The demonstrated output continues to be the same as shown above, as expected. ---- [jima] (2010-01-09) Does this work for unicodes in the range U+010000 to U+10FFFF ? U+010000 is xF0 x90 x80 x80 According to: [http://www.fileformat.info/info/unicode/char/10000/index.htm] In my box toutf8 \u10000 Produces \xe1\x80\x80\x30 And (notice the extra 0 introduced here) toutf8 \u010000 Produces \xc4\x80\x30\x30 I have tested some codes in the other ranges defined in [http://en.wikipedia.org/wiki/UTF-8] and everything seems fine whilst we don't put any extra zeroes at the beginning: toutf8 \u20ac Correctly produces \xe2\x82\xac ---- !!!!!! %|[Category Glossary]|% !!!!!!