** Summary ** [Richard Suchenwirth]: The little tool below lets you select an encoding in the listbox and display the characters between \x00 and \xFF of that encoding, hence it is most suited for single-byte encodings. Especially useful for checking the various cp... code pages. ** Description ** [WikiDbImage codepages.jpg] ====== package require Tk listbox .lb -yscrollcommand ".y set" -width 16 bind .lb {showCodepage .t [selection get]} scrollbar .y -command ".lb yview" text .t -bg white -height 32 -wrap word pack .lb .y .t -side left -fill y pack .t -fill both -expand 1 foreach encoding [lsort [encoding names]] { .lb insert end $encoding } proc showCodepage {w encoding} { $w delete 1.0 end wm title . $encoding set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F] foreach high $hexdigits { foreach low $hexdigits { set c [encoding convertfrom $encoding [subst \\x$high$low]] $w insert end "$high$low:$c " } $w insert end \n\n } } ;# RS ====== ---- How does one do the reverse of this: given an encoding name and a character or utf-8 decimal value (possibly > 255): ====== scan $c %c decVal ====== How can one get from this to the code to use for that character in the particular encoding given? [RS]: Elementary: ====== encoding convertto $target_encoding $utf8string encoding convertto $target_encoding [format %c $decimal_unicode] ====== But this gives one a character, not a decimal value, doesn't it (but I am confused, so please bear with me)? For example I know that a bullet \u2022 is actually decimal-165 in the macRoman encoding, decimal-8226 in utf-8, and something else in iso8859-1, but given a utf8string "\u2022", how do I generate those numbers (which I understand are the code-page indices or whatever of that glyph in that encoding)? - [RS]: Characters ''are'' decimal values. After encoding convertto, you can just scan it out: ======none % scan [encoding convertto macRoman \u2022] %c 165 ====== But: ======none % scan [encoding convertto utf-8 \u2022] %c 226 % scan [encoding convertto iso8859-1 \u2022] %c 63 ====== Aren't correct.... [RS]: They are, just it takes a bit of explanation. Tcl has strings internally in utf-8 encoding. When you convert a character again to '''explicit''' utf-8, you produce a series of characters (3 in this case) that corresponds to the bytes of the original character (see [Unicode and UTF-8]). Just like sqrt($x) is not sqrt(sqrt($x)) - for most values of x, at least. Decimal 226 is the first of this sequence, which sits in three memory bytes as E2 80 A2. (u2x is on the [encoding] page). ======none % u2x [encoding convertto utf-8 \u2022] \u00E2\u0080\u00A2 ====== The second is easier explained: ======none % format %c 63 ? ====== As there is no mapping from \u2022 to iso8859-1, it falls back to the default character, which is the question mark. Unicode can hold 65,000 characters or more, iso8859-1 only 256, so you lose information by such encoding conversion. ---- [Stu] 2008-11-15: Where is the proc ''getFixedFonts'' ? [BHE]: Slightly different version of the encoding viewer above to put the character array into a table with the Hex characters as col/row headings. I also added another listbox to select a font (fixed width fonts are listed first): ====== listbox .fonts -listvariable ::fonts -yscrollcommand ".sb set" -width 16 scrollbar .sb -command ".fonts yview" listbox .lb -listvariable ::encodings -yscrollcommand ".y set" -width 16 scrollbar .y -command ".lb yview" text .t -bg white -height 32 -wrap word -tabs {2} pack .fonts .sb .lb .y .t -side left -fill y pack .t -fill both -expand 1 bind .fonts {.t config -font [list [.fonts get active]]} bind .lb {showCodepage .t [.lb selection get]} getFixedFonts set ::fonts $::fontDB(fixed) set ::encodings [lsort [encoding names]] proc showCodepage {w encoding} { $w delete 1.0 end wm title . $encoding set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F] $w insert end " [join $hexdigits ""]\n +[string repeat "-" 16]\n" foreach high $hexdigits { $w insert end "$high|" foreach low $hexdigits { set c [encoding convertfrom $encoding [subst \\x$high$low]] if {$c == "\n" || $c == "\t"} { set c " " } $w insert end "$c" } $w insert end \n } } ====== I noticed that even though fonts report fixedwidth, only a few of the encodings for each are 100% fixedwidth. I've been looking for a way to print the 'box graphics' characters to use in a console text-adventure game, like NetHack or something. "Courier + cp437" seems to be a good combination for this. Just for completeness, I made a list of font+encoding that were really fixedwidth using this routine: ====== proc checkAllEncodingsForFixedWidth {f} { set fixedwidth_encodings {} puts -nonewline [format "%-20s: " [string range $f 0 20]] foreach enc $::encodings { set t "" for {set i 32} {$i < 256} {incr i} {append t [encoding convertfrom $enc [format %c $i]]} set size [font measure [list $f] $t] if {[expr [font measure [list $f] "."]*224] == $size} { lappend fixedwidth_encodings $enc puts -nonewline ! } else { puts -nonewline . } update } puts "" return $fixedwidth_encodings } ====== Pass in the font name and it returns a list of available truly fixedwidth encodings. It works by comparing the font display size (width in pixels) of an encoded string of characters, 0x20-0xff, with the display size of a single encoded "." times 224. Not a perfect search algorithm, but it only returned 1 or 2 mistakes. It takes forever to run for a single font (about 30 seconds?), so I left in the puts and update. Here's an array of fonts I compiled, giving a list of encodings that are truly fixed width for each. This is on a Windows XP machine with only default fonts, btw. ====== array set fixedwidth_font_encodings { "Courier" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 cp861 cp862 cp863 cp865 cp866 iso2022-jp koi8-r koi8-u macCentEuro macCroatian macCyrillic macIceland macRoman macRomania macUkraine} "Courier New" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 cp861 cp862 cp863 cp865 cp866 iso2022-jp koi8-r koi8-u macCentEuro macCroatian macCyrillic macIceland macRoman macRomania macUkraine} "Fixedsys" {ascii ebcdic identity iso2022 iso2022-jp iso2022-kr iso8859-1 iso8859-15 utf-8} "Lucida Console" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 cp861 cp863 cp865 cp866 iso2022-jp koi8-r koi8-u macCentEuro macCroatian macCyrillic macIceland macRoman macRomania macUkraine} "Terminal" {iso2022-jp} "WST_Czec" {iso2022-jp} "WST_Engl" {iso2022-jp} "WST_Fren" {iso2022-jp} "WST_Germ" {iso2022-jp} "WST_Ital" {iso2022-jp} "WST_Span" {iso2022-jp} "WST_Swed" {iso2022-jp} } ====== <> Human Language | RS