The lower 8-bits of each character in string are taken as a single byte and the resulting sequence of bytes is converted from encoding to a Unicode string. If encoding is not specified, the system encoding is used.
PYK 2017-08-19: If a string to be converted from utf-8 contains invalid utf-8 byte sequences, each invalid byte is interpreted as an 8-bit integer and converted to the unicode character at that code point. I.e., encoding convertfrom utf-8 will never fail, so it can not be used to determine whether a string is valid utf-8.
set value [binary format c 239] set value [encoding convertfrom utf-8 $value] scan $value %c codepoint ; # $codepoint == 239
For comparison, here is the same operation on a valid utf-8 sequence:
set value [binary format ccc 239 188 129] set value [encoding convertfrom utf-8 $value] scan $value %c codepoint ; # $codepoint == 65281
text .t -font Term pack .t .t insert end [format %c 152] .t insert end [encoding convertfrom cp437 [format %c 152]]
The Term font being used is available at http://8bit.memoryleak.org/Flag/Term.ttf and is designed for displaying cp437-encoded text.
Character 152 in cp437 is a y-umlaut. However, the first insert displays a placeholder character (a solid down-arrow) instead. The second does display a y-umlaut, but it does so by mapping to character 255, which isn't available in the Term font (because it has no meaning in cp437), so Tcl uses a fallback font, and it looks totally wrong (Term is fixed-width and quite bold; the fallback font, Lucida Sans Unicode, doesn't match up at all).
I can use the Term font in other (non-Tcl) applications, for instance MS Word, and insert char 152, which gives a y-umlaut without any problem. I honestly have no idea what's causing this issue; can anyone shed any light?