Version 33 of encoding

Updated 2008-03-03 18:05:00 by LV

encoding convertfrom ?encoding? data

encoding convertto ?encoding? string

encoding dirs ?directoryList?

encoding names

encoding system ?encoding?


http://purl.org/tcl/home/man/tcl8.5/TclCmd/encoding.htm


In a recent discussion in the chat room, the frustrations of having characters without knowing their encoding, and wishing for a utility to assist in turning the characters into something nearly readable, resulted in the suggestion of :

RS: Larry: Yes. One might wrap that functionality in a text widget, into which you paste the suspicious page, and have a listbox to choose all [encoding names] offers... a double click on a listbox item converts the text contents. RS: Rolf: "encoding convertfrom foo" turns the questionable characters into UTF-8, which you can inspect directly in your text widget. Writing it to a text file, or stdout, involved an intrinsic "encoding convertto [encoding system]", which often is iso8859-1 or cp1252. So if the input hat Russian or Greek characters, these will not come through to system encoding, but will be replaced by question marks.

Alas, the situation is this - LV receives emails, files, web pages, etc. on a regular basis without an encoding specified. While much of the text characters appear correct, the punctuation is skewed - it shows up as ? or \x92,etc. instead of things like " or ' or -, and so forth.

All he wants to do is have the weird punctuation marks (which he figures in the editors of the creators look like various fancy punctuation characters) readable.

Here's the code LV is currently trying...

 # translate those obnoxious windows characters to something better
 while { ! [eof stdin] } {
        gets stdin buffer
        set buffer2 [string map { \xa9 (c) \xd7 (c) \x85 - \x91 ' \x92 ' \x93 \" \x94 \" } $buffer]
        puts $buffer2
 }
 # I've no idea why those quotes have to be back slashed...

Lars H: See the man page of lindex. Quotes have the same power in lists as they do in "ordinary" Tcl code.


Examples: The Euro sign is represented in Windows cp1252 as \x80. If you get such strings in, you can see the real Euro sign with

 encoding convertfrom cp1252 \x80

Back you go with

 encoding convertto cp1252 \u20AC

Which default encoding system is used for communication with the OS (including file I/O), you can find out with

 encoding system

Which encodings are delivered with your Tcl version, you can easily see with

 encoding names

VL 18 aug 2003 - A list that matches Tcl encoding names to IANA's list is available at [L1 ]


Can I use the 'encoding' command (or some appriate 'fconfigure -encoding') to take a Tcl source file (*.tcl) in an arbitrary encoding and output a well-formed Tcl source file which is pure ascii (i.e. all chars > 127 have been converted to \uhhhh unicode escape sequences)?

RS: Sure, but some assembly required - like this:

 proc u2x s {
    set res ""
    foreach i [split $s ""] {
        scan $i %c int
        if {$int<128} {
           append res $i
        } else {
           append res \\u[format %04.4X $int]
        }
    }
    set res
 }
 set fp [open $filename]
 fconfigure $fp -encoding $originalEncoding
 set data [u2x [read $fp [file size $filename]]]
 close $fp
 set fp2 [open $newFilename w]
 puts -nonewline $fp2 $data
 close $fp2 

The "u2x" functionality is easily done, but it's also somewhere built-in in Tk - on Unix, codes for which no font has a character are substituted in "\uxxxx" style... (Windows mostly shows an empty rectangle). See Unicode and UTF-8


RS revisits this page on 2004-03-19 and now would rather write it this way:

 proc u2x s {
    set res ""
    foreach c [split $s ""] {
      scan $c %c int
      append res [expr {$int<128? $c :"\\u[format %04.4X $int]"}]
    }
    set res
 }

Encodability: Not every Unicode can be represented unambiguosly in every encoding. In such cases, a fallback character (e.g. ?) is inserted instead, but on reconversion the original character is lost. Here is a little tester for such cases, demonstrated with some German and Greek (out of laziness, produced by Greeklish) letters. Notice the quote way to use a command result as argument default, which assumes that the name of the system encoding is a single word (RS):

 proc encodable "s {enc [encoding system]}" {
    string eq $s [encoding convertfrom $enc [encoding convertto $enc $s]]
 }
 % encodable äöü ; # That's "\u00e4\u00f6\u00fc" in case this example needs to be repaired
 1
 % encodable [greeklish Aqhnai]
 0
 % encodable [greeklish Aqhnai] iso8859-7
 1
 % encodable äöü iso8859-7
 0

See also:


Note that the lack of a mapping from some other encoding into Unicode can lead to effects that appear to be font problems though they can be fixed by the creation of a new encoding, as that page illustrates.


See also Encoding table generator which also has code to work around a bug/feature: if an encoding files gives 0000 as target for a character, encoding convertfrom seems to fall through to iso8859-1 and get the Unicode from there...


Is it true that the documentation lies? The documentation says that a search is made

    foreach dir $::tcl_libPath {
        set fn $dir/encoding/$name.enc
        ...
    }

(see documentation for Tcl_GetEncoding()--the bottom of [L3 ]). I think this is not true, and that the search is made only of $tcl_library/encoding/$name.enc. Who knows?

See TIP 258 [L4 ].


When do I need to include the .enc files in a distribution?

When code uses encoding convertfrom and such, certainly. But what if you're not, but you are (for example) writing a Tk application that's going to be displaying UTF-8, or using the message catalog to localize for other languages, and want it to display correctly on systems in different countries? Are they needed then?


MG I've never really looked at encodings before, and just started doing so for the first time while trying to hack something into a client app. The problem I've just run into is that the server is using the encoding name "iso-8859-1" (which seems to be the correct IANA name), while Tcl calls it "iso8859-1" (without the first hyphen). Is there any reason for this? Or any simple (but correct) way around it? (My immediate thought was to try matching against the Tcl encoding names with the first hyphen edited out, if the encoding can't be found - definately simple, but also definately not correct...) Any help would be greatly appreciated. Thanks.

A. This is indeed a problem... [L5 ]

LV 2007 Oct 18 Have you submitted a bug report at http://tcl.sf.net/ ?


See also: