[Richard Suchenwirth] - The little tool below lets you select an encoding in the listbox and display the characters between \x00 and \xFF of that encoding, hence it is most suited for single-byte encodings. Especially useful for checking the various cp... code pages. [http://mini.net/files/codepages.jpg] ---- package require Tk listbox .lb -yscrollcommand ".y set" -width 16 bind .lb {showCodepage .t [selection get]} scrollbar .y -command ".lb yview" text .t -bg white -height 32 -wrap word pack .lb .y .t -side left -fill y pack .t -fill both -expand 1 foreach encoding [lsort [encoding names]] { .lb insert end $encoding } proc showCodepage {w encoding} { $w delete 1.0 end wm title . $encoding set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F] foreach high $hexdigits { foreach low $hexdigits { set c [encoding convertfrom $encoding [subst \\x$high$low]] $w insert end "$high$low:$c " } $w insert end \n\n } } ;# RS ---- How does one do the reverse of this: given an encoding name and a character or utf-8 decimal value (possibly > 255): scan $c %c decVal How can one get from this to the code to use for that character in the particular encoding given? - [RS]: Elementary: encoding convertto $target_encoding $utf8string encoding convertto $target_encoding [format %c $decimal_unicode] But this gives one a character, not a decimal value, doesn't it (but I am confused, so please bear with me)? For example I know that a bullet \u2022 is actually decimal-165 in the macRoman encoding, decimal-8226 in utf-8, and something else in iso8859-1, but given a utf8string "\u2022", how do I generate those numbers (which I understand are the code-page indices or whatever of that glyph in that encoding)? - [RS]: Characters ''are'' decimal values. After encoding convertto, you can just scan it out: % scan [encoding convertto macRoman \u2022] %c 165 But: % scan [encoding convertto utf-8 \u2022] %c 226 % scan [encoding convertto iso8859-1 \u2022] %c 63 Aren't correct.... - [RS]: They are, just it takes a bit of explanation. Tcl has strings internally in utf-8 encoding. When you convert a character to '''explicit''' utf-8, you produce a series of characters (3 in this case) that corresponds to the bytes of the original character (see [Unicode and UTF-8]). Decimal 226 is the first of this three-byte sequence. (u2x is on the [encoding] page). % u2x [encoding convertto utf-8 \u2022] \u00E2\u0080\u00A2 The second is easier explained: % format %c 63 ? As there is no mapping from \u2022 to iso8859-1, it falls back to the default character, which is the question mark. Unicode can hold 65,000 characters or more, iso8859-1 only 256, so you lose information by such encoding conversion. ---- [Arts and crafts of Tcl-Tk programming]