encoding , a built-in Tcl command, manages the conversion of strings to and from particular encodings, and controls how the definitions of particular encodings are located and what Tcl believes the OS uses for encoding strings.
HaO 2020-05-13 - The encoding "unicode" may be 16 or 32 bit and little or big endian depending on the compile options and platform endianess. See also [L1 ].
The following encodings are available in TCL8.6 without encoding files:
They are hard-coded in the file generic/tclEncoding.c.
Information source: Don Porter on the core list 2020-05-11 titled "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?".
HaO 2020-05-13: Kevin Kenny gave the following explanation on the core list on 2020-05-11 titled "Re: TCLCORE Default fallback encoding in TCL8.6.10 not identity but iso8859-1 ?":
ISO8859-1 enjoys the uniquely privileged position that 'binary' is the same encoding. The first 256 code points of Unicode are precisely the 256 byte values of ISO8859-1. The conversion between Tcl's byte arrays and Tcl's strings is therefore also ISO8859-1 equivalent.
The implication is that ISO8859-1 is at least lossless. In particular, every UTF-8 string has some interpretation as an ISO8859-1 string, and on any Unix-based system, that's enough to represent path names so that encoding tables can be read.
The 'identity' encoding was a rather muddled attempt to produce the same effect, which did not work, and resulted in malformed UTF-8 if any data outside the seven-bit ISO646 range were present.
It's pretty subtle, but ISO8859-1 is pretty much The Right Thing anywhere that we thought 'identity' might be correct. (With the exception of some code internal to the test suite whose purpose is to test that the Core does not crash if handed a malformed CESU-8 string by a C extension.).
In a recent discussion in the chat room, the frustrations of having characters without knowing their encoding, and wishing for a utility to assist in turning the characters into something nearly readable, resulted in the suggestion of :
RS: Larry: Yes. One might wrap that functionality in a text widget, into which you paste the suspicious page, and have a listbox to choose all [encoding names] offers... a double click on a listbox item converts the text contents. RS: Rolf: "encoding convertfrom foo" turns the questionable characters into UTF-8, which you can inspect directly in your text widget. Writing it to a text file, or stdout, involved an intrinsic "encoding convertto [encoding system]", which often is iso8859-1 or cp1252. So if the input had Russian or Greek characters, these will not come through to system encoding, but will be replaced by question marks.
Alas, the situation is this - LV receives emails, files, web pages, etc. on a regular basis without an encoding specified. While much of the text characters appear correct, the punctuation is skewed - it shows up as ? or \x92,etc. instead of things like " or ' or -, and so forth.
All he wants to do is have the weird punctuation marks (which he figures in the editors of the creators look like various fancy punctuation characters) readable.
Here's the code LV is currently trying...
# translate those obnoxious windows characters to something better while {! [eof stdin]} { gets stdin buffer set buffer2 [string map { \xa9 (c) \xd7 (c) \x85 - \x91 ' \x92 ' \x93 \" \x94 \" } $buffer] puts $buffer2 }
The Euro sign is represented in Windows cp1252 as \x80. If you get such strings in, you can see the real Euro sign with
encoding convertfrom cp1252 \x80
Back you go with
encoding convertto cp1252 \u20AC
To find out which encoding is used for communication with the OS (including file I/O):
encoding system
To see available encodings:
encoding names
Can I use the 'encoding' command (or some appropriate fconfigure -encoding) to take a Tcl source file (*.tcl) in an arbitrary encoding and output a well-formed Tcl source file which is pure ASCII ? (i.e. all chars > 127 have been converted to \uhhhh unicode escape sequences)?
RS: Sure, but some assembly required - like this:
proc u2x s { set res {} foreach i [split $s {}] { scan $i %c int if {$int < 128} { append res $i } else { append res \\u[format %04.4X $int] } } set res } set fp [open $filename] fconfigure $fp -encoding $originalEncoding set data [u2x [read $fp [file size $filename]]] close $fp set fp2 [open $newFilename w] puts -nonewline $fp2 $data close $fp2
The "u2x" functionality is easily done, but it's also somewhere built-in in Tk - on Unix, codes for which no font has a character are substituted in "\uxxxx" style... (Windows mostly shows an empty rectangle). See Unicode and UTF-8
RS revisits this page on 2004-03-19 and now would rather write it this way:
proc u2x s { set res {} foreach c [split $s {}] { scan $c %c int append res [expr {$int<128? $c :"\\u[format %04.4X $int]"}] } set res }
Not every unicode character can be represented unambiguously in every encoding. In such cases, a fallback character (e.g. ?) is inserted instead, but on re-conversion the original character is lost. Here is a little tester for such cases, demonstrated with some German and Greek (out of laziness, produced by Greeklish) letters. Notice the quote way to use a command result as argument default, which assumes that the name of the system encoding is a single word (RS):
proc encodable "s {enc [encoding system]}" { string eq $s [encoding convertfrom $enc [encoding convertto $enc $s]] }
% encodable äöü ; # That's "\u00e4\u00f6\u00fc" in case this example needs to be repaired 1 % encodable [greeklish Aqhnai] 0 % encodable [greeklish Aqhnai] iso8859-7 1 % encodable äöü iso8859-7 0
See also:
Note that the lack of a mapping from some other encoding into Unicode can lead to effects that appear to be font problems though they can be fixed by the creation of a new encoding, as that page illustrates.
If you want to use any of the encoding which are not build-in or provided by other sources (Example:Tk).
But what if you're not, but you are (for example) writing a Tk application that's going to be displaying UTF-8, or using the message catalog to localize for other languages, and want it to display correctly on systems in different countries? Are they needed then?
HaO 2020-05-13: To my knowledge: no, all this may be handled internally. Message catalogue files are sourced as utf-8 which is an internal encoding. The character drawing mechanism does not need encodings (I am not sure about that and this may be platform dependent). If you want to use the console, you need the system encoding to output non-ASCII characters (for example: "puts ÄÖÜ" on German Windows requires cp1525 which is not an internal encoding). Maybe, the Windows system console evolves to unicode (like many Linux consoles nowadays) and this will not be required any more.
MG: I've never really looked at encodings before, and just started doing so for the first time while trying to hack something into a client app. The problem I've just run into is that the server is using the encoding name "iso-8859-1" (which seems to be the correct IANA name), while Tcl calls it "iso8859-1" (without the first hyphen). Is there any reason for this? Or any simple (but correct) way around it? (My immediate thought was to try matching against the Tcl encoding names with the first hyphen edited out, if the encoding can't be found - definitely simple, but also definitely not correct...) Any help would be greatly appreciated. Thanks.
A.
This is indeed a problem... [L2 ]
HaO 2020-05-13: You may use the (not published) routine of the http package to convert IANA names to TCL names:
% package require http 2.9.1 % http::CharsetToEncoding iso-8859-1 iso8859-1
Description in the file: Tries to map a given IANA charset to a tcl encoding. If no encoding can be found, returns binary.