Univert

Univert is a Universal Unicode Converter, at least as universal as the Unicode implementation in Tcl 8.1 or later can manage on your computer.

Web site: http://scarydevil.com/~peter/sw/univert/

Univert will convert interchangably between the following formats:

Unicode: Native Tcl unicode, whatever your computer supports.
Hex: Hex bytes or words separated by spaces.
Decimal: Decimal bytes or words separated by spaces.
Entity: HTML entities, (eg <é>)
URL: UTF-8/URL encoding, for Unicode HTTP requests.
Printable: Quoted printable, using whatever the current encoding is.
String: C string, using whatever the current encoding is.

APN Very useful (for me). Thanks!

DKF: It'd be nice if it also had \u escapes. - RS offers this snippet which \u-escapes all non-ASCII characters in a given string:

 proc u2x string {
    set res ""
    foreach u [split $string ""] {
        scan $u %c t
        append res [expr {$t>127? [format "\\u%04.4X" $t] : $u}]
    }
    set res
 }

 % u2x hölle
 h\u00F6lle

WJP The above routine works as is but contains a nasty little trap. Suppose that you don't want to prefix the \u, so you change the second line of the loop to:

 append res [expr {$t>127? [format "%04.4X" $t] : $u}]

Looks innocuous, doesn't it? It isn't. The procedure is now wrong. If the input string contains values that when converted to text contain E or e followed by a digit, this will be interpreted as exponential notation because it is in an expression context. For example, if the input string is 0x03E3 "coptic small letter shei", the result of the conversion will not be "03E3" as desired but "3000.0"!

Here's a safe version:

 proc UnicodeStringToHex {s} {
    set rv "";
    foreach c [split $s ""] {
        scan $c "%c" t
        if {$t <= 0xFF} {
            append rv $c;
        } else {
            append rv [format "%04X " $t];
        }
    }
    return [string trim $rv];
 }

(This one inserts spaces as separators in lieu of \u.)

Category Application

Category Human Language