Version 8 of a little code page browser

Updated 2004-01-23 11:07:07

Richard Suchenwirth - The little tool below lets you select an encoding in the listbox and display the characters between \x00 and \xFF of that encoding, hence it is most suited for single-byte encodings. Especially useful for checking the various cp... code pages.

http://mini.net/files/codepages.jpg


 package require Tk
 listbox   .lb -yscrollcommand ".y set" -width 16
 bind .lb <Double-1> {showCodepage .t [selection get]}
 scrollbar .y -command ".lb yview"
 text .t -bg white -height 32 -wrap word
 pack .lb .y .t -side left -fill y
 pack .t        -fill both -expand 1

 foreach encoding [lsort [encoding names]] {
    .lb insert end $encoding
 }
 proc showCodepage {w encoding} {
    $w delete 1.0 end
    wm title . $encoding
    set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F]
    foreach high $hexdigits {
        foreach low $hexdigits {
            set c [encoding convertfrom $encoding [subst \\x$high$low]] 
            $w insert end "$high$low:$c "
        }
        $w insert end \n\n
    }
 } ;# RS

How does one do the reverse of this: given an encoding name and a character or utf-8 decimal value (possibly > 255):

   scan $c %c decVal

How can one get from this to the code to use for that character in the particular encoding given? - RS: Elementary:

   encoding convertto $target_encoding $utf8string
   encoding convertto $target_encoding [format %c $decimal_unicode]

But this gives one a character, not a decimal value, doesn't it (but I am confused, so please bear with me)? For example I know that a bullet \u2022 is actually decimal-165 in the macRoman encoding, decimal-8226 in utf-8, and something else in iso8859-1, but given a utf8string "\u2022", how do I generate those numbers (which I understand are the code-page indices or whatever of that glyph in that encoding)? - RS: Characters are decimal values. After encoding convertto, you can just scan it out:

 %  scan [encoding convertto macRoman \u2022] %c
 165

But:

    % scan [encoding convertto utf-8 \u2022] %c
    226
    % scan [encoding convertto iso8859-1 \u2022] %c
    63

Aren't correct.... - RS: They are, just it takes a bit of explanation. Tcl has strings internally in utf-8 encoding. When you convert a character to explicit utf-8, you produce a series of characters (3 in this case) that corresponds to the bytes of the original character (see Unicode and UTF-8). Decimal 226 is the first of this three-byte sequence. (u2x is on the encoding page).

 % u2x [encoding convertto utf-8 \u2022]
 \u00E2\u0080\u00A2

The second is easier explained:

 % format %c 63
 ?

As there is no mapping from \u2022 to iso8859-1, it falls back to the default character, which is the question mark. Unicode can hold 65,000 characters or more, iso8859-1 only 256, so you lose information by such encoding conversion.


Arts and crafts of Tcl-Tk programming