encoding

encoding , a built-in Tcl command, manages the conversion of strings to and from particular encodings, and controls how the definitions of particular encodings are located and what Tcl believes the OS uses for encoding strings.

See Also

Mapping of Tcl-encoding names to IANA-list
encoding to and from UTF-8
An example by KBK.
Chinese characters in file name
HaO 2012-01-30: After a question about Chinese characters in a DOS BOX, in the windows wish console and in tkcon, Donal Fellows explains the console use of encodings.
Bug in regsub with "special chars"?
AMG: Having the wrong source encoding can interact badly with regular expressions.
encodiff
Encoding table generator
Also has code to work around a bug/feature: if an encoding file gives 0000 as target for a character, encoding convertfrom seems to fall through to iso8859-1 and get the Unicode from there...
register new encodings at runtime
TTP 258
Refined the behavior of Tcl's encoding routines.
ycl string encode/decode
Alternative routines that return an error rather than losing information.

Synopsis

encoding convertfrom ?encoding? data
encoding convertto ?encoding? string
encoding dirs ?directoryList?
encoding names
encoding system ?encoding?

Description

HaO 2020-05-13 - The encoding "unicode" may be 16 or 32 bit and little or big endian depending on the compile options and platform endianess. See also [L1 ].

Internal encodings

The following encodings are available in TCL8.6 without encoding files:

  • identify (depreciated, default encoding for TCL8.0 to TCL8.5)
  • iso8859-1 (default encoding for TCL8.6)
  • utf-8
  • unicode

They are hard-coded in the file generic/tclEncoding.c.

Information source: Don Porter on the core list 2020-05-11 titled "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?".

Why is iso8859-1 the default

HaO 2020-05-13: Kevin Kenny gave the following explanation on the core list on 2020-05-11 titled "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?":

ISO8859-1 enjoys the uniquely privileged position that 'binary' is the same encoding. The first 256 code points of Unicode are precisely the 256 byte values of ISO8859-1. The conversion between Tcl's byte arrays and Tcl's strings is therefore also ISO8859-1 equivalent.

The implication is that ISO8859-1 is at least lossless. In particular, every UTF-8 string has some interpretation as an ISO8859-1 string, and on any Unix-based system, that's enough to represent path names so that encoding tables can be read.

The 'identity' encoding was a rather muddled attempt to produce the same effect, which did not work, and resulted in malformed UTF-8 if any data outside the seven-bit ISO646 range were present.

It's pretty subtle, but ISO8859-1 is pretty much The Right Thing anywhere that we thought 'identity' might be correct. (With the exception of some code internal to the test suite whose purpose is to test that the Core does not crash if handed a malformed CESU-8 string by a C extension.).

Discussion

In a recent discussion in the chat room, the frustrations of having characters without knowing their encoding, and wishing for a utility to assist in turning the characters into something nearly readable, resulted in the suggestion of :

RS: Larry: Yes. One might wrap that functionality in a text widget, into which you paste the suspicious page, and have a listbox to choose all [encoding names] offers... a double click on a listbox item converts the text contents. RS: Rolf: "encoding convertfrom foo" turns the questionable characters into UTF-8, which you can inspect directly in your text widget. Writing it to a text file, or stdout, involved an intrinsic "encoding convertto [encoding system]", which often is iso8859-1 or cp1252. So if the input had Russian or Greek characters, these will not come through to system encoding, but will be replaced by question marks.

Alas, the situation is this - LV receives emails, files, web pages, etc. on a regular basis without an encoding specified. While much of the text characters appear correct, the punctuation is skewed - it shows up as ? or \x92,etc. instead of things like " or ' or -, and so forth.

All he wants to do is have the weird punctuation marks (which he figures in the editors of the creators look like various fancy punctuation characters) readable.

Here's the code LV is currently trying...

# translate those obnoxious windows characters to something better
while {! [eof stdin]} {
    gets stdin buffer
    set buffer2 [string map { \xa9 (c) \xd7 (c) \x85 - \x91 ' \x92 ' \x93 \" \x94 \" } $buffer]
    puts $buffer2
}

Examples

The Euro sign is represented in Windows cp1252 as \x80. If you get such strings in, you can see the real Euro sign with

encoding convertfrom cp1252 \x80

Back you go with

encoding convertto cp1252 \u20AC

To find out which encoding is used for communication with the OS (including file I/O):

encoding system

To see available encodings:

encoding names

Converting non-ASCII Characters

Can I use the 'encoding' command (or some appropriate fconfigure -encoding) to take a Tcl source file (*.tcl) in an arbitrary encoding and output a well-formed Tcl source file which is pure ASCII ? (i.e. all chars > 127 have been converted to \uhhhh unicode escape sequences)?

RS: Sure, but some assembly required - like this:

proc u2x s {
    set res {} 
    foreach i [split $s {}] {
        scan $i %c int
        if {$int < 128} {
            append res $i
        } else {
            append res \\u[format %04.4X $int]
        }
    }
    set res
}
set fp [open $filename]
fconfigure $fp -encoding $originalEncoding
set data [u2x [read $fp [file size $filename]]]
close $fp
set fp2 [open $newFilename w]
puts -nonewline $fp2 $data
close $fp2 

The "u2x" functionality is easily done, but it's also somewhere built-in in Tk - on Unix, codes for which no font has a character are substituted in "\uxxxx" style... (Windows mostly shows an empty rectangle). See Unicode and UTF-8


RS revisits this page on 2004-03-19 and now would rather write it this way:

proc u2x s {
    set res {} 
    foreach c [split $s {}] {
        scan $c %c int
        append res [expr {$int<128? $c :"\\u[format %04.4X $int]"}]
    }
    set res
}

Encodability

Not every unicode character can be represented unambiguously in every encoding. In such cases, a fallback character (e.g. ?) is inserted instead, but on re-conversion the original character is lost. Here is a little tester for such cases, demonstrated with some German and Greek (out of laziness, produced by Greeklish) letters. Notice the quote way to use a command result as argument default, which assumes that the name of the system encoding is a single word (RS):

 proc encodable "s {enc [encoding system]}" {
    string eq $s [encoding convertfrom $enc [encoding convertto $enc $s]]
 }
% encodable äöü ; # That's "\u00e4\u00f6\u00fc" in case this example needs to be repaired
1
% encodable [greeklish Aqhnai]
0
% encodable [greeklish Aqhnai] iso8859-7
1
% encodable äöü iso8859-7
0

See also:

RFE #535705 , Tcl interface to stop on encoding errors
remove diacritic

Misc

Font problems

Note that the lack of a mapping from some other encoding into Unicode can lead to effects that appear to be font problems though they can be fixed by the creation of a new encoding, as that page illustrates.

When do I need to include the .enc files in a distribution?

If you want to use any of the encoding which are not build-in or provided by other sources (Example:Tk).

But what if you're not, but you are (for example) writing a Tk application that's going to be displaying UTF-8, or using the message catalog to localize for other languages, and want it to display correctly on systems in different countries? Are they needed then?

HaO 2020-05-13: To my knowledge: no, all this may be handled internally. Message catalogue files are sourced as utf-8 which is an internal encoding. The character drawing mechanism does not need encodings (I am not sure about that and this may be platform dependent). If you want to use the console, you need the system encoding to output non-ASCII characters (for example: "puts ÄÖÜ" on German Windows requires cp1525 which is not an internal encoding). Maybe, the Windows system console evolves to unicode (like many Linux consoles nowadays) and this will not be required any more.

IANA versus TCL encoding names

MG: I've never really looked at encodings before, and just started doing so for the first time while trying to hack something into a client app. The problem I've just run into is that the server is using the encoding name "iso-8859-1" (which seems to be the correct IANA name), while Tcl calls it "iso8859-1" (without the first hyphen). Is there any reason for this? Or any simple (but correct) way around it? (My immediate thought was to try matching against the Tcl encoding names with the first hyphen edited out, if the encoding can't be found - definitely simple, but also definitely not correct...) Any help would be greatly appreciated. Thanks.

A.

This is indeed a problem... [L2 ]

HaO 2020-05-13: You may use the (not published) routine of the http package to convert IANA names to TCL names:

% package require http
2.9.1
% http::CharsetToEncoding iso-8859-1
iso8859-1

Description in the file: Tries to map a given IANA charset to a tcl encoding. If no encoding can be found, returns binary.