a little code page browser

Summary

Richard Suchenwirth: The little tool below lets you select an encoding in the listbox and display the characters between \x00 and \xFF of that encoding, hence it is most suited for single-byte encodings. Especially useful for checking the various cp... code pages.

Description

WikiDbImage codepages.jpg

package require Tk
listbox   .lb -yscrollcommand ".y set" -width 16
bind .lb <Double-1> {showCodepage .t [selection get]}
scrollbar .y -command ".lb yview"
text .t -bg white -height 32 -wrap word
pack .lb .y .t -side left -fill y
pack .t        -fill both -expand 1

foreach encoding [lsort [encoding names]] {
    .lb insert end $encoding
}
proc showCodepage {w encoding} {
    $w delete 1.0 end
    wm title . $encoding
    set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F]
    foreach high $hexdigits {
        foreach low $hexdigits {
            set c [encoding convertfrom $encoding [subst \\x$high$low]] 
            $w insert end "$high$low:$c "
        }
        $w insert end \n\n
    }
} ;# RS

How does one do the reverse of this: given an encoding name and a character or utf-8 decimal value (possibly > 255):

scan $c %c decVal

How can one get from this to the code to use for that character in the particular encoding given?

RS: Elementary:

encoding convertto $target_encoding $utf8string
encoding convertto $target_encoding [format %c $decimal_unicode]

But this gives one a character, not a decimal value, doesn't it (but I am confused, so please bear with me)? For example I know that a bullet \u2022 is actually decimal-165 in the macRoman encoding, decimal-8226 in utf-8, and something else in iso8859-1, but given a utf8string "\u2022", how do I generate those numbers (which I understand are the code-page indices or whatever of that glyph in that encoding)? - RS: Characters are decimal values. After encoding convertto, you can just scan it out:

%  scan [encoding convertto macRoman \u2022] %c
165

But:

% scan [encoding convertto utf-8 \u2022] %c
226
% scan [encoding convertto iso8859-1 \u2022] %c
63

Aren't correct....

RS: They are, just it takes a bit of explanation. Tcl has strings internally in utf-8 encoding. When you convert a character again to explicit utf-8, you produce a series of characters (3 in this case) that corresponds to the bytes of the original character (see Unicode and UTF-8). Just like sqrt($x) is not sqrt(sqrt($x)) - for most values of x, at least. Decimal 226 is the first of this sequence, which sits in three memory bytes as E2 80 A2. (u2x is on the encoding page).

% u2x [encoding convertto utf-8 \u2022]
\u00E2\u0080\u00A2

The second is easier explained:

% format %c 63
?

As there is no mapping from \u2022 to iso8859-1, it falls back to the default character, which is the question mark. Unicode can hold 65,000 characters or more, iso8859-1 only 256, so you lose information by such encoding conversion.

Stu 2008-11-15: Where is the proc getFixedFonts ?

BHE: Slightly different version of the encoding viewer above to put the character array into a table with the Hex characters as col/row headings. I also added another listbox to select a font (fixed width fonts are listed first):

listbox .fonts -listvariable ::fonts -yscrollcommand ".sb set" -width 16
scrollbar .sb -command ".fonts yview"
listbox .lb -listvariable ::encodings -yscrollcommand ".y set" -width 16
scrollbar .y -command ".lb yview"
text .t -bg white -height 32 -wrap word -tabs {2}
 
pack .fonts .sb .lb .y .t -side left -fill y
pack .t        -fill both -expand 1
 
bind .fonts <Double-1> {.t config -font [list [.fonts get active]]}
bind .lb <Double-1> {showCodepage .t [.lb selection get]}
 
getFixedFonts
 
set ::fonts $::fontDB(fixed)
set ::encodings [lsort [encoding names]]
 
proc showCodepage {w encoding} {
    $w delete 1.0 end
    wm title . $encoding
    set hexdigits [list 0 1 2 3 4 5 6 7 8 9 A B C D E F]
    $w insert end "  [join $hexdigits ""]\n +[string repeat "-" 16]\n"
    foreach high $hexdigits {
        $w insert end "$high|"
        foreach low $hexdigits {
            set c [encoding convertfrom $encoding [subst \\x$high$low]]
            if {$c == "\n" || $c == "\t"} { set c " " }
            $w insert end "$c"
        }
        $w insert end \n
    }
}

I noticed that even though fonts report fixedwidth, only a few of the encodings for each are 100% fixedwidth. I've been looking for a way to print the 'box graphics' characters to use in a console text-adventure game, like NetHack or something. "Courier + cp437" seems to be a good combination for this. Just for completeness, I made a list of font+encoding that were really fixedwidth using this routine:

proc checkAllEncodingsForFixedWidth {f} {
    set fixedwidth_encodings {}
    puts -nonewline [format "%-20s: " [string range $f 0 20]]
    foreach enc $::encodings {
        set t ""
        for {set i 32} {$i < 256} {incr i} {append t [encoding convertfrom $enc [format %c $i]]}
        set size [font measure [list $f] $t]
        if {[expr [font measure [list $f] "."]*224] == $size} {
            lappend fixedwidth_encodings $enc
            puts -nonewline !
        } else {
            puts -nonewline .
        }
        update
    }
    puts ""
    return $fixedwidth_encodings
}

Pass in the font name and it returns a list of available truly fixedwidth encodings. It works by comparing the font display size (width in pixels) of an encoded string of characters, 0x20-0xff, with the display size of a single encoded "." times 224. Not a perfect search algorithm, but it only returned 1 or 2 mistakes. It takes forever to run for a single font (about 30 seconds?), so I left in the puts and update. Here's an array of fonts I compiled, giving a list of encodings that are truly fixed width for each. This is on a Windows XP machine with only default fonts, btw.

array set fixedwidth_font_encodings {
    "Courier" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 
        cp861 cp862 cp863 cp865 cp866 iso2022-jp koi8-r 
        koi8-u macCentEuro macCroatian macCyrillic 
        macIceland macRoman macRomania macUkraine}
    "Courier New" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860
        cp861 cp862 cp863 cp865 cp866 iso2022-jp koi8-r 
        koi8-u macCentEuro macCroatian macCyrillic 
        macIceland macRoman macRomania macUkraine}
    "Fixedsys" {ascii ebcdic identity iso2022 iso2022-jp iso2022-kr 
        iso8859-1 iso8859-15 utf-8}
    "Lucida Console" {cp437 cp737 cp775 cp850 cp852 cp855 cp857 cp860 
        cp861 cp863 cp865 cp866 iso2022-jp koi8-r koi8-u 
        macCentEuro macCroatian macCyrillic macIceland 
        macRoman macRomania macUkraine}
    "Terminal" {iso2022-jp}
    "WST_Czec" {iso2022-jp}
    "WST_Engl" {iso2022-jp}
    "WST_Fren" {iso2022-jp}
    "WST_Germ" {iso2022-jp}
    "WST_Ital" {iso2022-jp}
    "WST_Span" {iso2022-jp}
    "WST_Swed" {iso2022-jp}
}

Category Human Language

Arts and Crafts of Tcl-Tk Programming