Version 9 of a little code page browser

Updated 2004-01-23 11:23:07

Richard Suchenwirth - The little tool below lets you select an encoding in the listbox and display the characters between \x00 and \xFF of that encoding, hence it is most suited for single-byte encodings. Especially useful for checking the various cp... code pages.

http://mini.net/files/codepages.jpg


 package require Tk
 listbox   .lb -yscrollcommand ".y set" -width 16
 bind .lb <Double-1> {showCodepage .t [selection get]}
 scrollbar .y -command ".lb yview"
 text .t -bg white -height 32 -wrap word
 pack .lb .y .t -side left -fill y
 pack .t        -fill both -expand 1

 foreach encoding [lsort [encoding names]] {
    .lb insert end $encoding
 }
 proc showCodepage {w encoding} {
    $w delete 1.0 end
    wm title . $encoding
    set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F]
    foreach high $hexdigits {
        foreach low $hexdigits {
            set c [encoding convertfrom $encoding [subst \\x$high$low]] 
            $w insert end "$high$low:$c "
        }
        $w insert end \n\n
    }
 } ;# RS

How does one do the reverse of this: given an encoding name and a character or utf-8 decimal value (possibly > 255):

   scan $c %c decVal

How can one get from this to the code to use for that character in the particular encoding given? - RS: Elementary:

   encoding convertto $target_encoding $utf8string
   encoding convertto $target_encoding [format %c $decimal_unicode]

But this gives one a character, not a decimal value, doesn't it (but I am confused, so please bear with me)? For example I know that a bullet \u2022 is actually decimal-165 in the macRoman encoding, decimal-8226 in utf-8, and something else in iso8859-1, but given a utf8string "\u2022", how do I generate those numbers (which I understand are the code-page indices or whatever of that glyph in that encoding)? - RS: Characters are decimal values. After encoding convertto, you can just scan it out:

 %  scan [encoding convertto macRoman \u2022] %c
 165

But:

    % scan [encoding convertto utf-8 \u2022] %c
    226
    % scan [encoding convertto iso8859-1 \u2022] %c
    63

Aren't correct.... - RS: They are, just it takes a bit of explanation. Tcl has strings internally in utf-8 encoding. When you convert a character again to explicit utf-8, you produce a series of characters (3 in this case) that corresponds to the bytes of the original character (see Unicode and UTF-8). Just like sqrt($x) is not sqrt(sqrt($x)) - for most values of x, at least. Decimal 226 is the first of this sequence, which sits in three memory bytes as E2 80 A2. (u2x is on the encoding page).

 % u2x [encoding convertto utf-8 \u2022]
 \u00E2\u0080\u00A2

The second is easier explained:

 % format %c 63
 ?

As there is no mapping from \u2022 to iso8859-1, it falls back to the default character, which is the question mark. Unicode can hold 65,000 characters or more, iso8859-1 only 256, so you lose information by such encoding conversion.


Arts and crafts of Tcl-Tk programming