show encoding

WJP 2007-11-24 Sometimes you would like to know how a string appears in a particular encoding. The discussion yesterday over on comp.lang.tcl motivated me to create the following general procedure for doing this. The obligatory arguments are a string and the name of an encoding. It returns a string showing how the string is encoded in the specified encoding, byte by byte except in the case of UTF-32/UCS-2 where the two byte numerical values for each character are shown. The optional third argument controls whether or not printable ASCII characters are converted to numerical codes (the default) or shown as such. If one or more characters in the string does not belong to the character set of the specified encoding, an empty string is returned.

proc ShowEncoding {s enc {asciip 0}} {
    set res ""
    #We handle utf-32 separately because it isn't a supported encoding
    if {($enc eq "utf-32") || ($enc eq "utf-16") || ($enc eq "ucs-2")} {
        foreach c [split $s ""] {
            scan $c "%c" t
            if {($t < 0x7F) && ($t > 0x20) && $asciip} {
                lappend res $c;
            } else {
                lappend res [format "0x%04x" $t];
            }
        }
    } else {
        set str [encoding convertto $enc $s]
        set rev [encoding convertfrom $enc $str]
        if {$s ne $rev} {return ""}
        foreach c [split $str ""] {
            scan $c %c i
            if {$asciip} {
                lappend res [expr {$i < 0x20 || $i > 0x7E ? [format 0x%02x $i] : $c}]
            } else {
                lappend res [format 0x%02x $i]
            }
        }
    }
   return [join $res]
}

Demo

    set foo "\u00e9t\u00fc"
    puts [format "%-9s\t%s" ucs-2 [ShowEncoding $foo ucs-2]]
    puts [format "%-9s\t%s" utf-8 [ShowEncoding $foo utf-8]]
    puts [format "%-9s\t%s" iso8859-1 [ShowEncoding $foo iso8859-1]]
    puts [format "%-9s\t%s" cp1252 [ShowEncoding $foo cp1252]]
    puts [format "%-9s\t%s" ascii [ShowEncoding $foo ascii]]

The output from the demo is as follows:

 ucs-2           0x00e9 0x0074 0x00fc
 utf-8           0xc3 0xa9 0x74 0xc3 0xbc
 iso8859-1       0xe9 0x74 0xfc
 cp1252          0xe9 0x74 0xfc
 ascii

Note that the second column for "ascii" is empty because the string contains non-ascii characters.