Version 52 of encoding

Updated 2020-05-13 15:08:37 by oehhar

encoding , a built-in Tcl command, manages the conversion of strings to and from particular encodings, and controls how the definitions of particular encodings are located and what Tcl believes the OS uses for encoding strings.

See Also

Mapping of Tcl-encoding names to IANA-list
encoding to and from UTF-8
An example by KBK.
Chinese characters in file name
HaO 2012-01-30: After a question about chinese characters in a DOS BOX, in the windows wish console and in tkcon, Donal Fellows explains the console use of encodings.
Bug in regsub with "special chars"?
AMG: Having the wrong source encoding can interact badly with regular expressions.
encodiff
Encoding table generator
Also has code to work around a bug/feature: if an encoding file gives 0000 as target for a character, encoding convertfrom seems to fall through to iso8859-1 and get the Unicode from there...
register new encodings at runtime
TTP 258
Refined the behavour of Tcl's encoding routines.
ycl string encode/decode
Alternative routines that return an error rather than losing information.

Synopsis

encoding convertfrom ?encoding? data
encoding convertto ?encoding? string
encoding dirs ?directoryList?
encoding names
encoding system ?encoding?

Description

HaO 2020-05-13 - The encoding "unicode" may be 16 or 32 bit and little or big endian depending on the compile options and platform endianess. See also [L1 ].

Internal encodings

The following encodings are available in TCL8.6 without encoding files:

  • identify (depreciated, default encoding for TCL8.0 to TCL8.5)
  • iso8859-1 (default encoding for TCL8.6)
  • utf8
  • unicode

They are hard-coded in the file generic/tclEncoding.c.

Information source: Don Porter on clt "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?" on 2020-05-11.

Why is iso8859-1 the default

HaO 2020-05-13: Kevin Kenny gave the following explanation on clt "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?", date 2020-05-11:

ISO8859-1 enjoys the uniquely privileged position that 'binary' is the same encoding. The first 256 code points of Unicode are precisely the 256 byte values of ISO8859-1. The conversion between Tcl's byte arrays and Tcl's strings is therefore also ISO8859-1 equivalent.

The implication is that ISO8859-1 is at least lossless. In particular, every UTF-8 string has some interpretation as an ISO8859-1 string, and on any Unix-based system, that's enough to represent path names so that encoding tables can be read.

The 'identity' encoding was a rather muddled attempt to produce the same effect, which did not work, and resulted in malformed UTF-8 if any data outside the seven-bit ISO646 range were present.

It's pretty subtle, but ISO8859-1 is pretty much The Right Thing anywhere that we thought 'identity' might be correct. (With the exception of some code internal to the test suite whose purpose is to test that the Core does not crash if handed a malformed CESU-8 string by a C extension.).

Discussion

In a recent discussion in the chat room, the frustrations of having characters without knowing their encoding, and wishing for a utility to assist in turning the characters into something nearly readable, resulted in the suggestion of :

RS: Larry: Yes. One might wrap that functionality in a text widget, into which you paste the suspicious page, and have a listbox to choose all [encoding names] offers... a double click on a listbox item converts the text contents. RS: Rolf: "encoding convertfrom foo" turns the questionable characters into UTF-8, which you can inspect directly in your text widget. Writing it to a text file, or stdout, involved an intrinsic "encoding convertto [encoding system]", which often is iso8859-1 or cp1252. So if the input had Russian or Greek characters, these will not come through to system encoding, but will be replaced by question marks.

Alas, the situation is this - LV receives emails, files, web pages, etc. on a regular basis without an encoding specified. While much of the text characters appear correct, the punctuation is skewed - it shows up as ? or \x92,etc. instead of things like " or ' or -, and so forth.

All he wants to do is have the weird punctuation marks (which he figures in the editors of the creators look like various fancy punctuation characters) readable.

Here's the code LV is currently trying...

# translate those obnoxious windows characters to something better
while {! [eof stdin]} {
    gets stdin buffer
    set buffer2 [string map { \xa9 (c) \xd7 (c) \x85 - \x91 ' \x92 ' \x93 \" \x94 \" } $buffer]
    puts $buffer2
}
# I've no idea why those quotes have to be back slashed...

Lars H: See the man page of lindex. Quotes have the same power in lists as they do in "ordinary" Tcl code.

Examples

The Euro sign is represented in Windows cp1252 as \x80. If you get such strings in, you can see the real Euro sign with

encoding convertfrom cp1252 \x80

Back you go with

encoding convertto cp1252 \u20AC

To find out which encoding is used for communication with the OS (including file I/O):

encoding system

To see available encodings:

encoding names

Converting non-ASCII Characters

Can I use the 'encoding' command (or some appropriate fconfigure -encoding) to take a Tcl source file (*.tcl) in an arbitrary encoding and output a well-formed Tcl source file which is pure ASCII ? (i.e. all chars > 127 have been converted to \uhhhh unicode escape sequences)?

RS: Sure, but some assembly required - like this:

proc u2x s {
    set res {} 
    foreach i [split $s {}] {
        scan $i %c int
        if {$int < 128} {
            append res $i
        } else {
            append res \\u[format %04.4X $int]
        }
    }
    set res
}
set fp [open $filename]
fconfigure $fp -encoding $originalEncoding
set data [u2x [read $fp [file size $filename]]]
close $fp
set fp2 [open $newFilename w]
puts -nonewline $fp2 $data
close $fp2 

The "u2x" functionality is easily done, but it's also somewhere built-in in Tk - on Unix, codes for which no font has a character are substituted in "\uxxxx" style... (Windows mostly shows an empty rectangle). See Unicode and UTF-8


RS revisits this page on 2004-03-19 and now would rather write it this way:

proc u2x s {
    set res {} 
    foreach c [split $s {}] {
        scan $c %c int
        append res [expr {$int<128? $c :"\\u[format %04.4X $int]"}]
    }
    set res
}

Encodability

Not every unicode character can be represented unambiguosly in every encoding. In such cases, a fallback character (e.g. ?) is inserted instead, but on reconversion the original character is lost. Here is a little tester for such cases, demonstrated with some German and Greek (out of laziness, produced by Greeklish) letters. Notice the quote way to use a command result as argument default, which assumes that the name of the system encoding is a single word (RS):

 proc encodable "s {enc [encoding system]}" {
    string eq $s [encoding convertfrom $enc [encoding convertto $enc $s]]
 }
% encodable äöü ; # That's "\u00e4\u00f6\u00fc" in case this example needs to be repaired
1
% encodable [greeklish Aqhnai]
0
% encodable [greeklish Aqhnai] iso8859-7
1
% encodable äöü iso8859-7
0

See also:

RFE #535705 , Tcl interface to stop on encoding errors
remove diacritic

Misc


Note that the lack of a mapping from some other encoding into Unicode can lead to effects that appear to be font problems though they can be fixed by the creation of a new encoding, as that page illustrates.


When do I need to include the .enc files in a distribution?

When code uses encoding convertfrom and such, certainly. But what if you're not, but you are (for example) writing a Tk application that's going to be displaying UTF-8, or using the message catalog to localize for other languages, and want it to display correctly on systems in different countries? Are they needed then?


MG: I've never really looked at encodings before, and just started doing so for the first time while trying to hack something into a client app. The problem I've just run into is that the server is using the encoding name "iso-8859-1" (which seems to be the correct IANA name), while Tcl calls it "iso8859-1" (without the first hyphen). Is there any reason for this? Or any simple (but correct) way around it? (My immediate thought was to try matching against the Tcl encoding names with the first hyphen edited out, if the encoding can't be found - definately simple, but also definately not correct...) Any help would be greatly appreciated. Thanks.

A.

This is indeed a problem... [L2 ]

LV 2007-10-18: Have you submitted a bug report at http://tcl.sf.net/ ?

ZB 2008-03-04: Maybe don't be too fast with fixing that "bug"; pay attention, that "iso-8859-1" is used f.e. in web pages headers, while operating systems are using "iso8859-1" (f.e. Linux and xBSD systems). Such description is used by converting utilities (like iconv). I don't know the reason - but it seems to be kind of "convention". So there seems not to be any need for change.